ericsonwillians commited on
Commit
d977458
·
verified ·
1 Parent(s): 60c61cf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +138 -107
README.md CHANGED
@@ -1,141 +1,172 @@
1
- # Steam Reviews Sentiment Analysis
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- This repository contains code and documentation to fine-tune a transformer-based model (DistilBERT) on a dataset of Steam reviews to determine whether a given review is positive or negative.
4
 
5
- ## Project Overview
6
 
7
- - **Data Processing:** The raw data is downloaded and stored in `data/raw`. We preprocess the data and store cleaned results in `data/interim`.
8
- - **Model Training:** Using Hugging Face Transformers, we fine-tune a `distilbert-base-uncased` model for sentiment classification.
9
- - **Evaluation & Results:** After training, the model’s performance is evaluated on a hold-out test set. Results, including accuracy, are printed and logged.
10
- - **Inference:** A command-line inference script allows you to enter a new review and see the predicted sentiment. This script uses `rich` for a fancy user interface.
11
 
12
- ## Project Structure
 
 
 
13
 
14
- ```
15
- project/
16
- ├─ data/
17
- │ ├─ raw/ # Original downloaded datasets
18
- │ ├─ interim/ # Intermediate preprocessed data
19
- │ └─ processed/ # Final data (if applicable)
20
- ├─ notebooks/ # Jupyter notebooks for EDA and experiments
21
- ├─ scripts/ # Python scripts for download, preprocess, training, inference
22
- │ ├─ setup_project.sh # Script to set up directory structure
23
- │ ├─ download_data.py # Script to download the dataset using kagglehub
24
- │ ├─ preprocess_data.py# Script to preprocess the raw data
25
- │ ├─ train_transformer.py # Script to train and evaluate the transformer model
26
- │ └─ inference.py # Script to run inference on a single review
27
- ├─ models/ # Saved model files and tokenizer
28
- ├─ config/ # Configuration files (if any)
29
- ├─ logs/ # Logs (if any)
30
- ├─ tests/ # Tests (if any)
31
- └─ README.md # This file
32
- ```
33
 
34
- ## Requirements
35
-
36
- - Python 3.10 (or compatible)
37
- - [Poetry](https://python-poetry.org/) for dependency management
38
- - [Hugging Face Transformers](https://github.com/huggingface/transformers)
39
- - [datasets](https://github.com/huggingface/datasets)
40
- - [evaluate](https://github.com/huggingface/evaluate)
41
- - [rich](https://github.com/Textualize/rich)
42
- - [scikit-learn](https://scikit-learn.org/)
43
- - [torch](https://pytorch.org/)
44
-
45
- Ensure these dependencies are listed in your `pyproject.toml` or `requirements.txt`.
46
-
47
- ## Setup Instructions
48
-
49
- 1. **Clone the repository:**
50
- ```bash
51
- git clone https://github.com/yourusername/steam-reviews-sentiment-analysis.git
52
- cd steam-reviews-sentiment-analysis
53
- ```
54
-
55
- 2. **Set up the environment:**
56
- If using Poetry:
57
- ```bash
58
- poetry install
59
- poetry shell
60
- ```
61
-
62
- 3. **Initialize Project Structure (if needed):**
63
- ```bash
64
- chmod +x scripts/setup_project.sh
65
- ./scripts/setup_project.sh
66
- ```
67
- This will create the necessary directories.
68
-
69
- 4. **Download Data:**
70
- Make sure you have `kagglehub` configured and run:
71
- ```bash
72
- python3 scripts/download_data.py
73
- ```
74
- This will fetch the dataset into `data/raw`.
75
-
76
- 5. **Preprocess Data:**
77
- ```bash
78
- python3 scripts/preprocess_data.py
79
- ```
80
- The cleaned, intermediate data will be stored in `data/interim`.
81
-
82
- 6. **Train the Model:**
83
- ```bash
84
- python3 scripts/train_transformer.py
85
- ```
86
- This step will fine-tune DistilBERT on your dataset. Once completed, it will save the model and tokenizer in `models/transformer_model`.
87
-
88
- ## Inference
89
-
90
- After training, you can run inference on a custom review:
91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  ```bash
93
- python3 scripts/inference.py
94
  ```
95
 
96
- You’ll be prompted to enter a review. The script will then display the predicted sentiment and associated probabilities.
 
 
 
 
 
 
 
 
 
 
97
 
98
- ## Example
99
 
100
- **Running Inference:**
 
 
 
 
101
 
 
 
 
102
  ```bash
103
- (venv) $ python3 scripts/inference.py
104
  ```
105
 
106
- You might see something like:
107
-
108
  ```
109
  Steam Review Sentiment Inference
110
-
111
- **Welcome!**
112
  This tool uses a fine-tuned DistilBERT model to predict whether a given Steam review is *Positive* or *Negative*.
113
 
114
- Please enter the Steam review text (This game is amazing!): This game sucks!
115
 
116
  Loading model and tokenizer...
117
-
118
  Running inference...
119
  Inference Result
120
  Predicted Sentiment: Negative
121
  Sentiment Probabilities:
122
- Positive: 0.0224
123
- Negative: 0.9776
124
  ```
125
 
126
- ## Troubleshooting
 
 
 
 
 
 
 
 
 
 
127
 
128
- - **Model Directory Not Found:** Ensure that you have trained the model and that `models/transformer_model` exists.
129
- - **CUDA/GPUs:** If you have a GPU and want to speed up training or inference, ensure PyTorch is installed with CUDA support and that `torch.cuda.is_available()` returns `True`.
130
- - **Missing Dependencies:** If you encounter `ModuleNotFoundError`, install missing packages with `poetry add <package>` or `pip install <package>` depending on your setup.
 
 
 
 
 
 
 
 
 
 
 
 
 
131
 
132
  ## License
133
 
134
- This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
 
 
 
 
135
 
136
- ## Acknowledgments
137
 
138
- - [Hugging Face](https://huggingface.co/) for Transformers & Datasets libraries.
139
- - [Rich](https://github.com/Textualize/rich) for enhanced CLI UI.
140
- - [Steam](https://store.steampowered.com/) for the dataset.
141
- - The open-source community for tools and inspiration.
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ datasets:
5
+ - steam_reviews
6
+ tags:
7
+ - sentiment-analysis
8
+ - text-classification
9
+ - transformers
10
+ - distilbert
11
+ - pytorch
12
+ metrics:
13
+ - accuracy
14
+ widget:
15
+ - text: This game blew my mind! Loved every minute.
16
+ library_name: transformers
17
+ pipeline_tag: text-classification
18
+ model_name: distilbert-base-uncased-steam-sentiment
19
+ base_model:
20
+ - distilbert/distilbert-base-uncased
21
+ ---
22
+ ```yaml
23
+ ---
24
+ language: en
25
+ license: mit
26
+ datasets:
27
+ - steam_reviews
28
+ tags:
29
+ - sentiment-analysis
30
+ - text-classification
31
+ - transformers
32
+ - distilbert
33
+ - pytorch
34
+ metrics:
35
+ - accuracy
36
+ widget:
37
+ - text: "This game blew my mind! Loved every minute."
38
+ library_name: transformers
39
+ pipeline_tag: text-classification
40
+ model_name: distilbert-base-uncased-steam-sentiment
41
+ ---
42
+ ```
43
 
44
+ # DistilBERT for Steam Reviews Sentiment Analysis
45
 
46
+ This repository provides a DistilBERT-based model fine-tuned on a dataset of Steam reviews to classify reviews as **Positive** or **Negative**. It is efficient and fast, making it ideal for large-scale or real-time applications.
47
 
48
+ ## Model Description
 
 
 
49
 
50
+ - **Base Model:** [DistilBERT-base-uncased](https://huggingface.co/distilbert-base-uncased)
51
+ - **Task:** Binary sentiment classification
52
+ - **Trained On:** A large collection of user reviews from Steam
53
+ - **Performance:** ~89% accuracy on the test set
54
 
55
+ This model is specifically trained on Steam reviews, where language can be raw and sometimes offensive. It may also work on other short text snippets like movie reviews, but please note that performance might degrade outside the gaming domain.
56
+
57
+ ## Use Cases
58
+
59
+ - **Game Recommendation Systems:** Identify user sentiment towards titles to refine recommendation algorithms.
60
+ - **Community Management:** Spot negative feedback early and improve customer support responses.
61
+ - **Market Research & Insights:** Understand what features or aspects of a product users love or dislike.
62
+
63
+ ## Installation Requirements
64
+
65
+ ### Python & Environment Setup
 
 
 
 
 
 
 
 
66
 
67
+ - **Python version:** 3.10 or later recommended.
68
+ - **Package Manager:** [Poetry](https://python-poetry.org/) recommended, or you may use `pip`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
+ ### Necessary Libraries
71
+
72
+ - [transformers](https://github.com/huggingface/transformers) (for loading and using the model)
73
+ - [torch](https://pytorch.org/) (for model inference and tensor operations)
74
+ - [rich](https://github.com/Textualize/rich) (for a more appealing command-line UI)
75
+ - [evaluate](https://github.com/huggingface/evaluate) (optional, for metrics if needed)
76
+ - [scikit-learn](https://scikit-learn.org/) (optional, if you want to train or evaluate metrics locally)
77
+
78
+ **Install with Poetry:**
79
+ ```bash
80
+ poetry install
81
+ poetry shell
82
+ ```
83
+
84
+ If using pip:
85
  ```bash
86
+ pip install torch transformers rich
87
  ```
88
 
89
+ ## Model Files
90
+
91
+ After placing the model and tokenizer files in the repository root, you should have:
92
+ - `config.json`
93
+ - `model.safetensors` (or `pytorch_model.bin` if you used that format)
94
+ - `special_tokens_map.json`
95
+ - `tokenizer_config.json`
96
+ - `tokenizer.json`
97
+ - `vocab.txt`
98
+ - `training_args.bin` (optional, stores training parameters)
99
+ - `README.md` (this file)
100
 
101
+ ## Running Inference
102
 
103
+ We provide an `inference.py` script that:
104
+ - Prompts the user for a review string.
105
+ - Loads the model and tokenizer directly from the current directory.
106
+ - Uses the model to predict whether the review is Positive or Negative.
107
+ - Displays probabilities and predictions using a rich UI.
108
 
109
+ ### Example Inference
110
+
111
+ **Usage:**
112
  ```bash
113
+ python inference.py
114
  ```
115
 
116
+ **Example Output:**
 
117
  ```
118
  Steam Review Sentiment Inference
119
+ Welcome!
 
120
  This tool uses a fine-tuned DistilBERT model to predict whether a given Steam review is *Positive* or *Negative*.
121
 
122
+ Please enter the Steam review text (This game is amazing!): This game is boring and repetitive
123
 
124
  Loading model and tokenizer...
 
125
  Running inference...
126
  Inference Result
127
  Predicted Sentiment: Negative
128
  Sentiment Probabilities:
129
+ Positive: 0.1234
130
+ Negative: 0.8766
131
  ```
132
 
133
+ ### Code Snippet for Direct Inference
134
+
135
+ If you want to run inference programmatically (without the script):
136
+
137
+ ```python
138
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
139
+ import torch
140
+
141
+ model_name = "./" # assuming model files are in current directory
142
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
143
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
144
 
145
+ review_text = "I absolutely loved this game!"
146
+ inputs = tokenizer(review_text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
147
+ with torch.no_grad():
148
+ outputs = model(**inputs)
149
+ probs = torch.softmax(outputs.logits, dim=1)
150
+ predicted_class = torch.argmax(probs, dim=1).item()
151
+
152
+ sentiment = "Positive" if predicted_class == 1 else "Negative"
153
+ print(sentiment, probs.tolist())
154
+ ```
155
+
156
+ ## Limitations & Biases
157
+
158
+ - The model is trained on Steam reviews, where language can be harsh or contain slurs. It may inherit biases from the data.
159
+ - Not guaranteed to understand sarcasm, humor, or context unrelated to gaming.
160
+ - Results outside the gaming domain might be less accurate.
161
 
162
  ## License
163
 
164
+ This project is released under the [MIT License](./LICENSE).
165
+
166
+ ## Contact & Feedback
167
+
168
+ If you have suggestions, want to contribute, or encounter issues, feel free to open a discussion or contact Ericson Willians ([email protected]). Your feedback is appreciated!
169
 
170
+ ---
171
 
172
+ With this setup, you can easily integrate this sentiment analysis model into your pipelines, dashboards, or research projects. Enjoy exploring the sentiment of Steam reviews!