ericsonwillians
/

distilbert-base-uncased-steam-sentiment

@@ -1,141 +1,172 @@
-# Steam Reviews Sentiment Analysis
-This repository contains code and documentation to fine-tune a transformer-based model (DistilBERT) on a dataset of Steam reviews to determine whether a given review is positive or negative.
-## Project Overview
-- **Data Processing:** The raw data is downloaded and stored in `data/raw`. We preprocess the data and store cleaned results in `data/interim`.
-- **Model Training:** Using Hugging Face Transformers, we fine-tune a `distilbert-base-uncased` model for sentiment classification.
-- **Evaluation & Results:** After training, the model’s performance is evaluated on a hold-out test set. Results, including accuracy, are printed and logged.
-- **Inference:** A command-line inference script allows you to enter a new review and see the predicted sentiment. This script uses `rich` for a fancy user interface.
-## Project Structure
-```
-project/
-├─ data/
-│  ├─ raw/             # Original downloaded datasets
-│  ├─ interim/         # Intermediate preprocessed data
-│  └─ processed/       # Final data (if applicable)
-├─ notebooks/           # Jupyter notebooks for EDA and experiments
-├─ scripts/             # Python scripts for download, preprocess, training, inference
-│  ├─ setup_project.sh  # Script to set up directory structure
-│  ├─ download_data.py  # Script to download the dataset using kagglehub
-│  ├─ preprocess_data.py# Script to preprocess the raw data
-│  ├─ train_transformer.py  # Script to train and evaluate the transformer model
-│  └─ inference.py      # Script to run inference on a single review
-├─ models/              # Saved model files and tokenizer
-├─ config/              # Configuration files (if any)
-├─ logs/                # Logs (if any)
-├─ tests/               # Tests (if any)
-└─ README.md            # This file
-```
-## Requirements
-- Python 3.10 (or compatible)
-- [Poetry](https://python-poetry.org/) for dependency management
-- [Hugging Face Transformers](https://github.com/huggingface/transformers)
-- [datasets](https://github.com/huggingface/datasets)
-- [evaluate](https://github.com/huggingface/evaluate)
-- [rich](https://github.com/Textualize/rich)
-- [scikit-learn](https://scikit-learn.org/)
-- [torch](https://pytorch.org/)
-Ensure these dependencies are listed in your `pyproject.toml` or `requirements.txt`.
-## Setup Instructions
-1. **Clone the repository:**
-   ```bash
-   git clone https://github.com/yourusername/steam-reviews-sentiment-analysis.git
-   cd steam-reviews-sentiment-analysis
-   ```
-2. **Set up the environment:**
-   If using Poetry:
-   ```bash
-   poetry install
-   poetry shell
-   ```
-3. **Initialize Project Structure (if needed):**
-   ```bash
-   chmod +x scripts/setup_project.sh
-   ./scripts/setup_project.sh
-   ```
-   This will create the necessary directories.
-4. **Download Data:**
-   Make sure you have `kagglehub` configured and run:
-   ```bash
-   python3 scripts/download_data.py
-   ```
-   This will fetch the dataset into `data/raw`.
-5. **Preprocess Data:**
-   ```bash
-   python3 scripts/preprocess_data.py
-   ```
-   The cleaned, intermediate data will be stored in `data/interim`.
-6. **Train the Model:**
-   ```bash
-   python3 scripts/train_transformer.py
-   ```
-   This step will fine-tune DistilBERT on your dataset. Once completed, it will save the model and tokenizer in `models/transformer_model`.
-## Inference
-After training, you can run inference on a custom review:
 ```bash
-python3 scripts/inference.py
 ```
-You’ll be prompted to enter a review. The script will then display the predicted sentiment and associated probabilities.
-## Example
-**Running Inference:**
 ```bash
-(venv) $ python3 scripts/inference.py
 ```
-You might see something like:
 ```
 Steam Review Sentiment Inference
-**Welcome!**
 This tool uses a fine-tuned DistilBERT model to predict whether a given Steam review is *Positive* or *Negative*.
-Please enter the Steam review text (This game is amazing!): This game sucks!
 Loading model and tokenizer...
 Running inference...
 Inference Result
 Predicted Sentiment: Negative
 Sentiment Probabilities:
- Positive: 0.0224
- Negative: 0.9776
 ```
-## Troubleshooting
-- **Model Directory Not Found:** Ensure that you have trained the model and that `models/transformer_model` exists.
-- **CUDA/GPUs:** If you have a GPU and want to speed up training or inference, ensure PyTorch is installed with CUDA support and that `torch.cuda.is_available()` returns `True`.
-- **Missing Dependencies:** If you encounter `ModuleNotFoundError`, install missing packages with `poetry add <package>` or `pip install <package>` depending on your setup.
 ## License
-This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
-## Acknowledgments
-- [Hugging Face](https://huggingface.co/) for Transformers & Datasets libraries.
-- [Rich](https://github.com/Textualize/rich) for enhanced CLI UI.
-- [Steam](https://store.steampowered.com/) for the dataset.
-- The open-source community for tools and inspiration.

+---
+language: en
+license: mit
+datasets:
+- steam_reviews
+tags:
+- sentiment-analysis
+- text-classification
+- transformers
+- distilbert
+- pytorch
+metrics:
+- accuracy
+widget:
+- text: This game blew my mind! Loved every minute.
+library_name: transformers
+pipeline_tag: text-classification
+model_name: distilbert-base-uncased-steam-sentiment
+base_model:
+- distilbert/distilbert-base-uncased
+---
+```yaml
+---
+language: en
+license: mit
+datasets:
+- steam_reviews
+tags:
+- sentiment-analysis
+- text-classification
+- transformers
+- distilbert
+- pytorch
+metrics:
+- accuracy
+widget:
+  - text: "This game blew my mind! Loved every minute."
+library_name: transformers
+pipeline_tag: text-classification
+model_name: distilbert-base-uncased-steam-sentiment
+---
+```
+# DistilBERT for Steam Reviews Sentiment Analysis
+This repository provides a DistilBERT-based model fine-tuned on a dataset of Steam reviews to classify reviews as **Positive** or **Negative**. It is efficient and fast, making it ideal for large-scale or real-time applications.
+## Model Description
+- **Base Model:** [DistilBERT-base-uncased](https://huggingface.co/distilbert-base-uncased)
+- **Task:** Binary sentiment classification
+- **Trained On:** A large collection of user reviews from Steam
+- **Performance:** ~89% accuracy on the test set
+This model is specifically trained on Steam reviews, where language can be raw and sometimes offensive. It may also work on other short text snippets like movie reviews, but please note that performance might degrade outside the gaming domain.
+## Use Cases
+- **Game Recommendation Systems:** Identify user sentiment towards titles to refine recommendation algorithms.
+- **Community Management:** Spot negative feedback early and improve customer support responses.
+- **Market Research & Insights:** Understand what features or aspects of a product users love or dislike.
+## Installation Requirements
+### Python & Environment Setup
+- **Python version:** 3.10 or later recommended.
+- **Package Manager:** [Poetry](https://python-poetry.org/) recommended, or you may use `pip`.
+### Necessary Libraries
+- [transformers](https://github.com/huggingface/transformers) (for loading and using the model)
+- [torch](https://pytorch.org/) (for model inference and tensor operations)
+- [rich](https://github.com/Textualize/rich) (for a more appealing command-line UI)
+- [evaluate](https://github.com/huggingface/evaluate) (optional, for metrics if needed)
+- [scikit-learn](https://scikit-learn.org/) (optional, if you want to train or evaluate metrics locally)
+**Install with Poetry:**
+```bash
+poetry install
+poetry shell
+```
+If using pip:
 ```bash
+pip install torch transformers rich
 ```
+## Model Files
+After placing the model and tokenizer files in the repository root, you should have:
+- `config.json`
+- `model.safetensors` (or `pytorch_model.bin` if you used that format)
+- `special_tokens_map.json`
+- `tokenizer_config.json`
+- `tokenizer.json`
+- `vocab.txt`
+- `training_args.bin` (optional, stores training parameters)
+- `README.md` (this file)
+## Running Inference
+We provide an `inference.py` script that:
+- Prompts the user for a review string.
+- Loads the model and tokenizer directly from the current directory.
+- Uses the model to predict whether the review is Positive or Negative.
+- Displays probabilities and predictions using a rich UI.
+### Example Inference
+**Usage:**
 ```bash
+python inference.py
 ```
+**Example Output:**
 ```
 Steam Review Sentiment Inference
+Welcome!
 This tool uses a fine-tuned DistilBERT model to predict whether a given Steam review is *Positive* or *Negative*.
+Please enter the Steam review text (This game is amazing!): This game is boring and repetitive
 Loading model and tokenizer...
 Running inference...
 Inference Result
 Predicted Sentiment: Negative
 Sentiment Probabilities:
+ Positive: 0.1234
+ Negative: 0.8766
 ```
+### Code Snippet for Direct Inference
+If you want to run inference programmatically (without the script):
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_name = "./"  # assuming model files are in current directory
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+review_text = "I absolutely loved this game!"
+inputs = tokenizer(review_text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
+with torch.no_grad():
+    outputs = model(**inputs)
+    probs = torch.softmax(outputs.logits, dim=1)
+    predicted_class = torch.argmax(probs, dim=1).item()
+sentiment = "Positive" if predicted_class == 1 else "Negative"
+print(sentiment, probs.tolist())
+```
+## Limitations & Biases
+- The model is trained on Steam reviews, where language can be harsh or contain slurs. It may inherit biases from the data.
+- Not guaranteed to understand sarcasm, humor, or context unrelated to gaming.
+- Results outside the gaming domain might be less accurate.
 ## License
+This project is released under the [MIT License](./LICENSE).
+## Contact & Feedback
+If you have suggestions, want to contribute, or encounter issues, feel free to open a discussion or contact Ericson Willians ([email protected]). Your feedback is appreciated!
+---
+With this setup, you can easily integrate this sentiment analysis model into your pipelines, dashboards, or research projects. Enjoy exploring the sentiment of Steam reviews!