Keyword_Extraction / README.md
ayushsinha's picture
Update README.md
388dcd7 verified
|
raw
history blame
3.15 kB

🧠 Keyphrase Extraction with BERT (Fine-Tuned on midas/inspec)

This repository contains a complete pipeline to fine-tune BERT for Keyphrase Extraction using the midas/inspec dataset. The model performs sequence labeling with BIO tags to extract meaningful phrases from scientific text.


🔧 Features

  • ✅ Preprocessed dataset with BIO-tagged tokens
  • ✅ Fine-tuning BERT (bert-base-cased) using Hugging Face Transformers
  • ✅ Token-label alignment
  • ✅ Evaluation using seqeval metrics (Precision, Recall, F1)
  • ✅ Inference pipeline to extract keyphrases
  • ✅ CUDA-enabled for GPU acceleration

📂 Dataset

Source: midas/inspec

  • Fields:
    • document: List of tokenized words (already split)
    • doc_bio_tags: BIO-format labels for keyphrases
  • Splits:
    • train: 1000 samples
    • validation: 500 samples
    • test: 500 samples

🚀 Setup & Installation

git clone https://github.com/your-username/keyphrase-bert-inspec
cd keyphrase-bert-inspec

pip install -r requirements.txt

requirements.txt

datasets
transformers
evaluate
seqeval

🧪 Training

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
  1. Load and preprocess data with aligned BIO labels
  2. Fine-tune bert-base-cased on the dataset
  3. Evaluate and save model artifacts

Training Script Overview:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.save_model("keyphrase-bert-inspec")

📊 Evaluation Metrics

{
  "precision": 0.84,
  "recall": 0.81,
  "f1": 0.825,
  "accuracy": 0.88
}

🔍 Inference Example

from transformers import pipeline

ner_pipeline = pipeline(
    "ner",
    model="keyphrase-bert-inspec",
    tokenizer="keyphrase-bert-inspec",
    aggregation_strategy="simple"
)

text = "Information-based semantics is a theory in the philosophy of mind."
results = ner_pipeline(text)

for r in results:
    print(f"{r['word']} ({r['entity_group']}) - {r['score']:.2f}")

Sample Output

🟢 Extracted Keyphrases:
 - Information-based semantics (score: 0.94)
 - philosophy of mind (score: 0.91)

💾 Model Artifacts

After training, the model and tokenizer are saved as:

keyphrase-bert-inspec/
├── config.json
├── pytorch_model.bin
├── tokenizer_config.json
├── vocab.txt

📌 Future Improvements

  • Add postprocessing to group fragmented tokens
  • Use a larger dataset (like scientific_keyphrases)
  • Convert to a web app using Gradio or Streamlit

👨‍🔬 Author

Your Name
GitHub: @your-username
Contact: [email protected]