Token Classification
PyTorch
English
bert
alecocc commited on
Commit
6fdb5de
·
verified ·
1 Parent(s): 9e10748

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -12
README.md CHANGED
@@ -5,6 +5,8 @@ language:
5
  base_model:
6
  - google-bert/bert-base-cased
7
  pipeline_tag: token-classification
 
 
8
  ---
9
 
10
  # ZeroNER: Fueling Zero-Shot Named Entity Recognition via Entity Type Descriptions
@@ -70,17 +72,9 @@ We have created a free [Google Colab notebook](https://colab.research.google.com
70
 
71
 
72
  ## 📥 Training Data (Unfiltered)
73
- The model is trained on synthetic annotations generated by LLaMA-3.1-8B-instruct on [Pile Uncopyrighted](https://huggingface.co/datasets/monology/pile-uncopyrighted)
74
- Download the raw distillation data (BIO format) here: [link to raw data](https://drive.google.com/file/d/1slUHvSIP0yrzNJBIJivBRWe0Z10fjlM1/view?usp=sharing)
75
 
76
- Then, you can load the data into a dataset using the following code:
77
- ```python
78
- # pip install datasets
79
- from datasets import Dataset
80
- ds = Dataset.from_json('pretrain_data.jsonl')
81
- ```
82
-
83
- *⚠️ Note: This dataset is **unfiltered** and contains noisy annotations as well as type leakage with respect to the benchmark entity types. A cleaned, benchmark-safe version will be released along with the official code.*
84
 
85
  ## 📊 Performance
86
 
@@ -117,5 +111,4 @@ If you use ZeroNER in your research, please cite:
117
  ISBN = "979-8-89176-256-5",
118
  abstract = "What happens when a named entity recognition (NER) system encounters entities it has never seen before? In practical applications, models must generalize to unseen entity types where labeled training data is either unavailable or severely limited{---}a challenge that demands zero-shot learning capabilities. While large language models (LLMs) offer extensive parametric knowledge, they fall short in cost-effectiveness compared to specialized small encoders. Existing zero-shot methods predominantly adopt a relaxed definition of the term with potential leakage issues and rely on entity type names for generalization, overlooking the value of richer descriptions for disambiguation. In this work, we introduce ZeroNER, a description-driven framework that enhances hard zero-shot NER in low-resource settings. By leveraging general-domain annotations and entity type descriptions with LLM supervision, ZeroNER enables a BERT-based student model to successfully identify unseen entity types. Evaluated on three real-world benchmarks, ZeroNER consistently outperforms LLMs by up to 16{\%} in F1 score, and surpasses lightweight baselines that use type names alone. Our analysis further reveals that LLMs derive significant benefits from incorporating type descriptions in the prompts."
119
  }
120
- ```
121
-
 
5
  base_model:
6
  - google-bert/bert-base-cased
7
  pipeline_tag: token-classification
8
+ datasets:
9
+ - disi-unibo-nlp/PileUncopyrighted-NER-BIO
10
  ---
11
 
12
  # ZeroNER: Fueling Zero-Shot Named Entity Recognition via Entity Type Descriptions
 
72
 
73
 
74
  ## 📥 Training Data (Unfiltered)
75
+ The model is trained on synthetic annotations generated by LLaMA-3.1-8B-instruct over the [Pile Uncopyrighted](https://huggingface.co/datasets/monology/pile-uncopyrighted) dataset.
 
76
 
77
+ The resulting automatically annotated dataset, [PileUncopyrighted-NER-BIO](https://huggingface.co/datasets/disi-unibo-nlp/PileUncopyrighted-NER-BIO), follows the BIO format and was used as the training source for this model.
 
 
 
 
 
 
 
78
 
79
  ## 📊 Performance
80
 
 
111
  ISBN = "979-8-89176-256-5",
112
  abstract = "What happens when a named entity recognition (NER) system encounters entities it has never seen before? In practical applications, models must generalize to unseen entity types where labeled training data is either unavailable or severely limited{---}a challenge that demands zero-shot learning capabilities. While large language models (LLMs) offer extensive parametric knowledge, they fall short in cost-effectiveness compared to specialized small encoders. Existing zero-shot methods predominantly adopt a relaxed definition of the term with potential leakage issues and rely on entity type names for generalization, overlooking the value of richer descriptions for disambiguation. In this work, we introduce ZeroNER, a description-driven framework that enhances hard zero-shot NER in low-resource settings. By leveraging general-domain annotations and entity type descriptions with LLM supervision, ZeroNER enables a BERT-based student model to successfully identify unseen entity types. Evaluated on three real-world benchmarks, ZeroNER consistently outperforms LLMs by up to 16{\%} in F1 score, and surpasses lightweight baselines that use type names alone. Our analysis further reveals that LLMs derive significant benefits from incorporating type descriptions in the prompts."
113
  }
114
+ ```