File size: 3,668 Bytes
06a0a6f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
🧠 NERClassifier-BERT-CoNLL2003
A BERT-based Named Entity Recognition (NER) model fine-tuned on the CoNLL-2003 dataset. It classifies tokens in text into predefined entity types: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). This model is ideal for information extraction, document tagging, and question answering systems.
---
✨ Model Highlights
📌 Based on bert-base-cased (by Google)
🔍 Fine-tuned on the CoNLL-2003 Named Entity Recognition dataset
⚡ Supports prediction of 4 entity types: PER, LOC, ORG, MISC
💾 Available in both full and quantized versions for fast inference
---
🧠 Intended Uses
• Resume and document parsing
• News article analysis
• Question answering pipelines
• Chatbots and virtual assistants
• Information retrieval and tagging
---
🚫 Limitations
• Trained on English-only NER data (CoNLL-2003)
• May not perform well on informal text (e.g., tweets, slang)
• Entity boundaries may be misaligned with subword tokenization
• Limited performance on extremely long sequences (>128 tokens)
---
🏋️♂️ Training Details
| Field | Value |
| -------------- | ------------------------------ |
| **Base Model** | `bert-base-cased` |
| **Dataset** | CoNLL-2003 |
| **Framework** | PyTorch with 🤗 Transformers |
| **Epochs** | 5 |
| **Batch Size** | 16 |
| **Max Length** | 128 tokens |
| **Optimizer** | AdamW |
| **Loss** | CrossEntropyLoss (token-level) |
| **Device** | Trained on CUDA-enabled GPU |
---
📊 Evaluation Metrics
| Metric | Score |
| ----------------------------------------------- | ----- |
| Accuracy | 0.98 |
| F1-Score | 0.97 |
---
🔎 Label Mapping
| Label ID | Entity Type |
| -------- | ----------- |
| 0 | O |
| 1 | B-PER |
| 2 | I-PER |
| 3 | B-ORG |
| 4 | I-ORG |
| 5 | B-LOC |
| 6 | I-LOC |
| 7 | B-MISC |
| 8 | I-MISC |
---
🚀 Usage
```python
from transformers import BertTokenizerFast, BertForTokenClassification
import torch
model_name = "AventIQ-AI/ner_bert_conll2003"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)
model.eval()
def predict_tokens(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[label_id.item()] for label_id in predictions[0]]
return list(zip(tokens, labels))
# Test example
print(predict_tokens("Barack Obama visited Google in California."))
```
---
🧩 Quantization
Post-training static quantization applied using PyTorch to reduce model size and improve inference performance on edge devices.
---
🗂 Repository Structure
```
.
├── model/ # Quantized model files
├── tokenizer_config/ # Tokenizer and vocab files
├── model.safensors/ # Fine-tuned model in safetensors format
├── README.md # Model card
```
---
🤝 Contributing
Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model.
|