Keyword_Extraction / README.md
ayushsinha's picture
Update README.md
388dcd7 verified
|
raw
history blame
3.15 kB
# 🧠 Keyphrase Extraction with BERT (Fine-Tuned on `midas/inspec`)
This repository contains a complete pipeline to **fine-tune BERT** for **Keyphrase Extraction** using the [`midas/inspec`](https://huggingface.co/datasets/midas/inspec) dataset. The model performs sequence labeling with BIO tags to extract meaningful phrases from scientific text.
---
## 🔧 Features
- ✅ Preprocessed dataset with BIO-tagged tokens
- ✅ Fine-tuning BERT (`bert-base-cased`) using Hugging Face Transformers
- ✅ Token-label alignment
- ✅ Evaluation using `seqeval` metrics (Precision, Recall, F1)
- ✅ Inference pipeline to extract keyphrases
- ✅ CUDA-enabled for GPU acceleration
---
## 📂 Dataset
**Source:** [`midas/inspec`](https://huggingface.co/datasets/midas/inspec)
- Fields:
- `document`: List of tokenized words (already split)
- `doc_bio_tags`: BIO-format labels for keyphrases
- Splits:
- `train`: 1000 samples
- `validation`: 500 samples
- `test`: 500 samples
---
## 🚀 Setup & Installation
```bash
git clone https://github.com/your-username/keyphrase-bert-inspec
cd keyphrase-bert-inspec
pip install -r requirements.txt
```
### `requirements.txt`
```text
datasets
transformers
evaluate
seqeval
```
---
## 🧪 Training
```python
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
```
1. Load and preprocess data with aligned BIO labels
2. Fine-tune `bert-base-cased` on the dataset
3. Evaluate and save model artifacts
### Training Script Overview:
```python
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
trainer.save_model("keyphrase-bert-inspec")
```
---
## 📊 Evaluation Metrics
```python
{
"precision": 0.84,
"recall": 0.81,
"f1": 0.825,
"accuracy": 0.88
}
```
---
## 🔍 Inference Example
```python
from transformers import pipeline
ner_pipeline = pipeline(
"ner",
model="keyphrase-bert-inspec",
tokenizer="keyphrase-bert-inspec",
aggregation_strategy="simple"
)
text = "Information-based semantics is a theory in the philosophy of mind."
results = ner_pipeline(text)
for r in results:
print(f"{r['word']} ({r['entity_group']}) - {r['score']:.2f}")
```
### Sample Output
```
🟢 Extracted Keyphrases:
- Information-based semantics (score: 0.94)
- philosophy of mind (score: 0.91)
```
---
## 💾 Model Artifacts
After training, the model and tokenizer are saved as:
```
keyphrase-bert-inspec/
├── config.json
├── pytorch_model.bin
├── tokenizer_config.json
├── vocab.txt
```
---
## 📌 Future Improvements
- Add postprocessing to group fragmented tokens
- Use a larger dataset (like `scientific_keyphrases`)
- Convert to a web app using Gradio or Streamlit
---
## 👨‍🔬 Author
**Your Name**
GitHub: [@your-username](https://github.com/your-username)
Contact: [email protected]
---