File size: 3,668 Bytes

06a0a6f

🧠 NERClassifier-BERT-CoNLL2003

A BERT-based Named Entity Recognition (NER) model fine-tuned on the CoNLL-2003 dataset. It classifies tokens in text into predefined entity types: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). This model is ideal for information extraction, document tagging, and question answering systems.

---
✨ Model Highlights

📌 Based on bert-base-cased (by Google)
🔍 Fine-tuned on the CoNLL-2003 Named Entity Recognition dataset
⚡ Supports prediction of 4 entity types: PER, LOC, ORG, MISC
💾 Available in both full and quantized versions for fast inference

---
🧠 Intended Uses

• Resume and document parsing
• News article analysis
• Question answering pipelines
• Chatbots and virtual assistants
• Information retrieval and tagging

---
🚫 Limitations

• Trained on English-only NER data (CoNLL-2003)
• May not perform well on informal text (e.g., tweets, slang)
• Entity boundaries may be misaligned with subword tokenization
• Limited performance on extremely long sequences (>128 tokens)

---
🏋️‍♂️ Training Details

| Field          | Value                          |
| -------------- | ------------------------------ |
| **Base Model** | `bert-base-cased`              |
| **Dataset**    | CoNLL-2003                     |
| **Framework**  | PyTorch with 🤗 Transformers   |
| **Epochs**     | 5                              |
| **Batch Size** | 16                             |
| **Max Length** | 128 tokens                     |
| **Optimizer**  | AdamW                          |
| **Loss**       | CrossEntropyLoss (token-level) |
| **Device**     | Trained on CUDA-enabled GPU    |

---
📊 Evaluation Metrics

| Metric                                          | Score |
| ----------------------------------------------- | ----- |
| Accuracy                                        | 0.98  |
| F1-Score                                        | 0.97  |

---
🔎 Label Mapping

| Label ID | Entity Type |
| -------- | ----------- |
| 0        | O           |
| 1        | B-PER       |
| 2        | I-PER       |
| 3        | B-ORG       |
| 4        | I-ORG       |
| 5        | B-LOC       |
| 6        | I-LOC       |
| 7        | B-MISC      |
| 8        | I-MISC      |

---
🚀 Usage
```python
from transformers import BertTokenizerFast, BertForTokenClassification
import torch

model_name = "AventIQ-AI/ner_bert_conll2003"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)
model.eval()

def predict_tokens(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs).logits
    predictions = torch.argmax(outputs, dim=2)
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    labels = [model.config.id2label[label_id.item()] for label_id in predictions[0]]
    return list(zip(tokens, labels))

# Test example
print(predict_tokens("Barack Obama visited Google in California."))

```
---
🧩 Quantization

Post-training static quantization applied using PyTorch to reduce model size and improve inference performance on edge devices.

---
🗂 Repository Structure
```
.
├── model/               # Quantized model files
├── tokenizer_config/    # Tokenizer and vocab files
├── model.safensors/     # Fine-tuned model in safetensors format
├── README.md            # Model card

```
---
🤝 Contributing

Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model.