🧠 NERClassifier-BERT-CoNLL2003
A BERT-based Named Entity Recognition (NER) model fine-tuned on the CoNLL-2003 dataset. It classifies tokens in text into predefined entity types: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). This model is ideal for information extraction, document tagging, and question answering systems.
✨ Model Highlights
📌 Based on bert-base-cased (by Google) 🔍 Fine-tuned on the CoNLL-2003 Named Entity Recognition dataset ⚡ Supports prediction of 4 entity types: PER, LOC, ORG, MISC 💾 Available in both full and quantized versions for fast inference
🧠 Intended Uses
• Resume and document parsing • News article analysis • Question answering pipelines • Chatbots and virtual assistants • Information retrieval and tagging
🚫 Limitations
• Trained on English-only NER data (CoNLL-2003) • May not perform well on informal text (e.g., tweets, slang) • Entity boundaries may be misaligned with subword tokenization • Limited performance on extremely long sequences (>128 tokens)
🏋️♂️ Training Details
Field | Value |
---|---|
Base Model | bert-base-cased |
Dataset | CoNLL-2003 |
Framework | PyTorch with 🤗 Transformers |
Epochs | 5 |
Batch Size | 16 |
Max Length | 128 tokens |
Optimizer | AdamW |
Loss | CrossEntropyLoss (token-level) |
Device | Trained on CUDA-enabled GPU |
📊 Evaluation Metrics
Metric | Score |
---|---|
Accuracy | 0.98 |
F1-Score | 0.97 |
🔎 Label Mapping
Label ID | Entity Type |
---|---|
0 | O |
1 | B-PER |
2 | I-PER |
3 | B-ORG |
4 | I-ORG |
5 | B-LOC |
6 | I-LOC |
7 | B-MISC |
8 | I-MISC |
🚀 Usage
from transformers import BertTokenizerFast, BertForTokenClassification
import torch
model_name = "AventIQ-AI/ner_bert_conll2003"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)
model.eval()
def predict_tokens(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[label_id.item()] for label_id in predictions[0]]
return list(zip(tokens, labels))
# Test example
print(predict_tokens("Barack Obama visited Google in California."))
🧩 Quantization
Post-training static quantization applied using PyTorch to reduce model size and improve inference performance on edge devices.
🗂 Repository Structure
.
├── model/ # Quantized model files
├── tokenizer_config/ # Tokenizer and vocab files
├── model.safensors/ # Fine-tuned model in safetensors format
├── README.md # Model card
🤝 Contributing
Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model.
- Downloads last month
- 2