🧠 Keyphrase Extraction with BERT (Fine-Tuned on midas/inspec
)
This repository contains a complete pipeline to fine-tune BERT for Keyphrase Extraction using the midas/inspec
dataset. The model performs sequence labeling with BIO tags to extract meaningful phrases from scientific text.
🔧 Features
- ✅ Preprocessed dataset with BIO-tagged tokens
- ✅ Fine-tuning BERT (
bert-base-cased
) using Hugging Face Transformers - ✅ Token-label alignment
- ✅ Evaluation using
seqeval
metrics (Precision, Recall, F1) - ✅ Inference pipeline to extract keyphrases
- ✅ CUDA-enabled for GPU acceleration
📂 Dataset
Source: midas/inspec
- Fields:
document
: List of tokenized words (already split)doc_bio_tags
: BIO-format labels for keyphrases
- Splits:
train
: 1000 samplesvalidation
: 500 samplestest
: 500 samples
🚀 Setup & Installation
git clone https://github.com/your-username/keyphrase-bert-inspec
cd keyphrase-bert-inspec
pip install -r requirements.txt
requirements.txt
datasets
transformers
evaluate
seqeval
🧪 Training
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
- Load and preprocess data with aligned BIO labels
- Fine-tune
bert-base-cased
on the dataset - Evaluate and save model artifacts
Training Script Overview:
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
trainer.save_model("keyphrase-bert-inspec")
📊 Evaluation Metrics
{
"precision": 0.84,
"recall": 0.81,
"f1": 0.825,
"accuracy": 0.88
}
🔍 Inference Example
from transformers import pipeline
ner_pipeline = pipeline(
"ner",
model="keyphrase-bert-inspec",
tokenizer="keyphrase-bert-inspec",
aggregation_strategy="simple"
)
text = "Information-based semantics is a theory in the philosophy of mind."
results = ner_pipeline(text)
for r in results:
print(f"{r['word']} ({r['entity_group']}) - {r['score']:.2f}")
Sample Output
🟢 Extracted Keyphrases:
- Information-based semantics (score: 0.94)
- philosophy of mind (score: 0.91)
💾 Model Artifacts
After training, the model and tokenizer are saved as:
keyphrase-bert-inspec/
├── config.json
├── pytorch_model.bin
├── tokenizer_config.json
├── vocab.txt
📌 Future Improvements
- Add postprocessing to group fragmented tokens
- Use a larger dataset (like
scientific_keyphrases
) - Convert to a web app using Gradio or Streamlit
👨🔬 Author
Your Name
GitHub: @your-username
Contact: [email protected]