|
--- |
|
license: apache-2.0 |
|
base_model: distilbert-base-uncased |
|
tags: |
|
- generated_from_trainer |
|
metrics: |
|
- accuracy |
|
model-index: |
|
- name: imdb-distilbert-funetuned |
|
results: [] |
|
datasets: |
|
- ajaykarthick/imdb-movie-reviews |
|
language: |
|
- en |
|
library_name: transformers |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# DistilBERT IMDb Sentiment Classifier |
|
|
|
## Model Description |
|
This is a fine-tuned version of [DistilBERT](https://huggingface.co/distilbert-base-uncased) for sentiment analysis on the IMDb movie review dataset. DistilBERT is a smaller, faster, and lighter variant of BERT, designed to perform efficiently while retaining the core strengths of BERT in natural language understanding. |
|
|
|
The model is trained to classify movie reviews as either **positive** or **negative** sentiments, making it ideal for applications where sentiment analysis is needed, such as analyzing customer feedback, social media posts, or reviews. |
|
|
|
## Intended Use |
|
This model is intended for text classification tasks, specifically sentiment analysis. It can be used to automatically label a piece of text as either having a positive or negative sentiment. |
|
|
|
### Use Cases |
|
- **Movie review sentiment analysis** |
|
- **Customer feedback analysis** |
|
- **Social media sentiment monitoring** |
|
- **Product review classification** |
|
|
|
## How to Use |
|
|
|
Here is how you can use this model with the Hugging Face `transformers` library: |
|
|
|
```python |
|
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification |
|
import torch |
|
|
|
# Load the model and tokenizer |
|
model_name = "Ashaduzzaman/imdb-distilbert-funetuned", |
|
tokenizer = DistilBertTokenizer.from_pretrained(model_name) |
|
model = DistilBertForSequenceClassification.from_pretrained(model_name) |
|
|
|
# Example text |
|
text = "The movie was absolutely fantastic! The acting was superb and the story was gripping." |
|
|
|
# Tokenize and predict |
|
inputs = tokenizer(text, return_tensors="pt") |
|
outputs = model(**inputs) |
|
logits = outputs.logits |
|
predictions = torch.softmax(logits, dim=1) |
|
|
|
# Get the predicted label |
|
predicted_label = torch.argmax(predictions).item() |
|
labels = ["Negative", "Positive"] |
|
print(f"Predicted sentiment: {labels[predicted_label]}") |
|
``` |
|
|
|
## Training Data |
|
This model was trained on the IMDb movie review dataset, a large dataset for binary sentiment classification. The dataset contains 50,000 highly polarized movie reviews. This dataset is balanced, with 25,000 positive and 25,000 negative reviews. |
|
|
|
## Training Procedure |
|
The model was fine-tuned using the IMDb dataset with the following configuration: |
|
- **Optimizer**: AdamW (Adam with betas=(0.9,0.999) and epsilon=1e-08) |
|
- **Learning Rate**: 2e-5 |
|
- **Batch Size**: 16 |
|
- **Epochs**: 2 |
|
- **Max Sequence Length**: 512 tokens |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Accuracy | |
|
|:-------------:|:-----:|:----:|:---------------:|:--------:| |
|
| 0.2239 | 1.0 | 1563 | 0.2026 | 0.9227 | |
|
| 0.1468 | 2.0 | 3126 | 0.2319 | 0.9320 | |
|
|
|
- **Loss:** 0.2319 |
|
- **Accuracy:** 0.9320 |
|
|
|
## Limitations |
|
- The model is specifically trained on the IMDb dataset, so its effectiveness may be reduced when applied to other domains or types of text. |
|
- Sentiment detection is binary (positive or negative). Neutral sentiments or more nuanced emotions are not captured. |
|
- The model may not perform well on text that is highly sarcastic, contains slang, or is very short (e.g., one-word reviews). |
|
|
|
## Ethical Considerations |
|
- **Bias**: The model may reflect biases present in the IMDb dataset. Users should be cautious about applying this model to sensitive applications. |
|
- **Content**: Since the IMDb dataset includes movie reviews, the model might not generalize well to text outside of this context. |
|
|
|
## Acknowledgments |
|
- The original [DistilBERT](https://huggingface.co/distilbert-base-uncased) model was developed by Hugging Face. |
|
- The IMDb dataset is provided by Stanford and can be found [here](https://ai.stanford.edu/~amaas/data/sentiment/). |
|
|
|
## Framework versions |
|
|
|
- Transformers 4.42.4 |
|
- Pytorch 2.3.1+cu121 |
|
- Datasets 2.21.0 |
|
- Tokenizers 0.19.1 |