|
--- |
|
library_name: span-marker |
|
tags: |
|
- span-marker |
|
- token-classification |
|
- ner |
|
- named-entity-recognition |
|
- muppet-roberta-large-ner |
|
datasets: |
|
- DFKI-SLT/few-nerd |
|
metrics: |
|
- precision |
|
- recall |
|
- f1 |
|
widget: |
|
- text: "His name was Radu-Sebastian Amarie, and was building IJW trying to figure out how to properly extract entities from raw data. He's from Romania and he's eager to watch Dune 2." |
|
- example_title: "Random few NERD examples." |
|
- text: "The Alabama Supreme Court effectively halted in vitro fertilization at several state hospitals and caused a massive nationwide backlash when it ruled last week in a wrongful death case that frozen embryos used in IVF are considered people. Dr. Paula Amato, the president of the American Society for Reproductive Medicine, said in a press release it was a mistake to conflate frozen fertilized eggs with embryos developing within a mother." |
|
- example_title: "News" |
|
pipeline_tag: token-classification |
|
license: cc-by-sa-4.0 |
|
language: |
|
- en |
|
model-index: |
|
- name: >- |
|
SpanMarker w. facebook/muppet-roberta-large on finegrained, supervised FewNERD |
|
results: |
|
- task: |
|
type: token-classification |
|
name: Named Entity Recognition |
|
dataset: |
|
name: finegrained, supervised FewNERD |
|
type: DFKI-SLT/few-nerd |
|
config: supervised |
|
split: test |
|
revision: 6f0944f5a1d47c359b4f5de03ed1d58c98f297b5 |
|
metrics: |
|
- type: f1 |
|
value: 0.705678 |
|
name: F1 |
|
- type: precision |
|
value: 0.701648 |
|
name: Precision |
|
- type: recall |
|
value: 0.709755 |
|
name: Recall |
|
|
|
--- |
|
|
|
# SpanMarker |
|
|
|
This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [DFKI-SLT/few-nerd](https://huggingface.co/datasets/DFKI-SLT/few-nerd) dataset that can be used for Named Entity Recognition. |
|
Training was done on a Nvidia 4090 in approximately 8 hours (but final chosen checkpoint was from before the first half of training) |
|
|
|
|
|
## Training and Validation Metrics |
|
|
|
 |
|
|
|
Current model represents STEP 25000 |
|
|
|
|
|
## Test Set Evaluation |
|
|
|
The following are some manually-selected checkpoints that correspond to the above steps: |
|
|
|
``` |
|
| checkpoint | Precision | Recall | F1 | Accuracy | Runtime | Samples/s | |
|
|-------------:|----------:|-----------:|-----------:|-----------:|----------:|------------:| |
|
| 17000 | 0.706066 | 0.691239 | 0.698574 | 0.926213 | 335.172 | 123.474 | |
|
| 18000 | 0.695331 | 0.700382 | 0.697847 | 0.926372 | 301.435 | 137.293 | |
|
| 19000 | 0.70618 | 0.693775 | 0.699923 | 0.926492 | 301.032 | 137.477 | |
|
| 20000 | 0.700665 | 0.701572 | 0.701118 | 0.927128 | 299.706 | 138.085 | |
|
| 21000 | 0.706467 | 0.695591 | 0.700987 | 0.926318 | 299.62 | 138.125 | |
|
| 22000 | 0.698079 | 0.710756 | 0.704361 | 0.928094 | 300.041 | 137.931 | |
|
| 24000 | 0.709286 | 0.695769 | 0.702463 | 0.926329 | 300.339 | 137.794 | |
|
| 25000 | 0.701648 | 0.709755 | 0.705678 | 0.92792 | 299.905 | 137.994 | |
|
| 26000 | 0.702509 | 0.708147 | 0.705317 | 0.927998 | 301.161 | 137.418 | |
|
| 27000 | 0.707315 | 0.698796 | 0.703029 | 0.926493 | 299.692 | 138.092 | |
|
``` |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** SpanMarker |
|
- **Encoder:** [muppet-roberta-large](https://huggingface.co/facebook/muppet-roberta-large) |
|
- **Maximum Sequence Length:** 256 tokens |
|
- **Maximum Entity Length:** 6 words |
|
- **Training Dataset:** [DFKI-SLT/few-nerd](https://huggingface.co/datasets/DFKI-SLT/few-nerd) |
|
- **Language:** en |
|
- **License:** cc-by-sa-4.0 |
|
|
|
### Useful Links |
|
|
|
- Training was done with SpanMarker Trainer that can be found here: [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER) |
|
|
|
## Uses |
|
|
|
### Direct Use for Inference |
|
|
|
```python |
|
from span_marker import SpanMarkerModel |
|
|
|
# Download from the 🤗 Hub |
|
model = SpanMarkerModel.from_pretrained("eek/span-marker-muppet-roberta-large-fewnerd-fine-super") |
|
# Run inference |
|
entities = model.predict("His name was Radu.") |
|
``` |
|
|
|
or it can be used directly in spacy via [SpanMarker](https://spacy.io/universe/project/span_marker). |
|
|
|
```python |
|
import spacy |
|
|
|
nlp = spacy.load("en_core_web_sm", exclude=["ner"]) |
|
nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-roberta-large-ontonotes5"}) |
|
|
|
text = """Cleopatra VII, also known as Cleopatra the Great, was the last active ruler of the \ |
|
Ptolemaic Kingdom of Egypt. She was born in 69 BCE and ruled Egypt from 51 BCE until her \ |
|
death in 30 BCE.""" |
|
doc = nlp(text) |
|
print([(entity, entity.label_) for entity in doc.ents]) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Framework Versions |
|
- Python: 3.10.13 |
|
- SpanMarker: 1.5.0 |
|
- Transformers: 4.36.2 |
|
- PyTorch: 2.2.1+cu121 |
|
- Datasets: 2.18.0 |
|
- Tokenizers: 0.15.2 |
|
|
|
### Training Arguments |
|
|
|
``` |
|
args = TrainingArguments( |
|
output_dir="models/span-marker-muppet-roberta-large-fewnerd-fine-super", |
|
learning_rate=1e-5, |
|
gradient_accumulation_steps=2, |
|
per_device_train_batch_size=8, |
|
per_device_eval_batch_size=8, |
|
num_train_epochs=8, |
|
evaluation_strategy="steps", |
|
save_strategy="steps", |
|
save_steps=1000, |
|
eval_steps=500, |
|
push_to_hub=False, |
|
logging_steps=50, |
|
fp16=True, |
|
warmup_ratio=0.1, |
|
dataloader_num_workers=1, |
|
load_best_model_at_end=True |
|
) |
|
``` |
|
|
|
## Thanks |
|
|
|
Thanks to Tom Aarsen for the SpanMarker library. |
|
|
|
### BibTeX |
|
``` |
|
@software{Aarsen_SpanMarker, |
|
author = {Aarsen, Tom}, |
|
license = {Apache-2.0}, |
|
title = {{SpanMarker for Named Entity Recognition}}, |
|
url = {https://github.com/tomaarsen/SpanMarkerNER} |
|
} |
|
``` |
|
|
|
## Model Card Authors |
|
|
|
- [Radu-Sebastian Amarie](https://huggingface.co/eek) |