AventIQ-AI
/

Keyword_Extraction

Model card Files Files and versions

Keyword_Extraction / README.md

ayushsinha's picture

Update README.md

388dcd7 verified 22 days ago

|

3.15 kB


	# 🧠 Keyphrase Extraction with BERT (Fine-Tuned on `midas/inspec`)

	This repository contains a complete pipeline to fine-tune BERT for Keyphrase Extraction using the [`midas/inspec`](https://huggingface.co/datasets/midas/inspec) dataset. The model performs sequence labeling with BIO tags to extract meaningful phrases from scientific text.

	---

	## 🔧 Features

	- ✅ Preprocessed dataset with BIO-tagged tokens
	- ✅ Fine-tuning BERT (`bert-base-cased`) using Hugging Face Transformers
	- ✅ Token-label alignment
	- ✅ Evaluation using `seqeval` metrics (Precision, Recall, F1)
	- ✅ Inference pipeline to extract keyphrases
	- ✅ CUDA-enabled for GPU acceleration

	---

	## 📂 Dataset

	Source: [`midas/inspec`](https://huggingface.co/datasets/midas/inspec)

	- Fields:
	- `document`: List of tokenized words (already split)
	- `doc_bio_tags`: BIO-format labels for keyphrases
	- Splits:
	- `train`: 1000 samples
	- `validation`: 500 samples
	- `test`: 500 samples

	---

	## 🚀 Setup & Installation

	```bash
	git clone https://github.com/your-username/keyphrase-bert-inspec
	cd keyphrase-bert-inspec

	pip install -r requirements.txt
	```

	### `requirements.txt`

	```text
	datasets
	transformers
	evaluate
	seqeval
	```

	---

	## 🧪 Training

	```python
	from datasets import load_dataset
	from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
	```

	1. Load and preprocess data with aligned BIO labels
	2. Fine-tune `bert-base-cased` on the dataset
	3. Evaluate and save model artifacts

	### Training Script Overview:

	```python
	trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=tokenized_datasets["train"],
	eval_dataset=tokenized_datasets["validation"],
	tokenizer=tokenizer,
	data_collator=data_collator,
	compute_metrics=compute_metrics,
	)

	trainer.train()
	trainer.save_model("keyphrase-bert-inspec")
	```

	---

	## 📊 Evaluation Metrics

	```python
	{
	"precision": 0.84,
	"recall": 0.81,
	"f1": 0.825,
	"accuracy": 0.88
	}
	```

	---

	## 🔍 Inference Example

	```python
	from transformers import pipeline

	ner_pipeline = pipeline(
	"ner",
	model="keyphrase-bert-inspec",
	tokenizer="keyphrase-bert-inspec",
	aggregation_strategy="simple"
	)

	text = "Information-based semantics is a theory in the philosophy of mind."
	results = ner_pipeline(text)

	for r in results:
	print(f"{r['word']} ({r['entity_group']}) - {r['score']:.2f}")
	```

	### Sample Output

	```
	🟢 Extracted Keyphrases:
	- Information-based semantics (score: 0.94)
	- philosophy of mind (score: 0.91)
	```

	---

	## 💾 Model Artifacts

	After training, the model and tokenizer are saved as:

	```
	keyphrase-bert-inspec/
	├── config.json
	├── pytorch_model.bin
	├── tokenizer_config.json
	├── vocab.txt
	```

	---

	## 📌 Future Improvements

	- Add postprocessing to group fragmented tokens
	- Use a larger dataset (like `scientific_keyphrases`)
	- Convert to a web app using Gradio or Streamlit

	---

	## 👨‍🔬 Author

	Your Name
	GitHub: [@your-username](https://github.com/your-username)
	Contact: [email protected]

	---