disi-unibo-nlp
/

zeroner-base

Token Classification

Model card Files Files and versions Community

alecocc commited on 17 days ago

Commit

49419fa

·

verified ·

1 Parent(s): 99485ac

Update README.md

Files changed (1) hide show

README.md +9 -1

README.md CHANGED Viewed

@@ -72,7 +72,15 @@ We have created a free [Google Colab notebook](https://colab.research.google.com
 ## 📥 Training Data (Unfiltered)
 The model is trained on synthetic annotations generated by LLaMA-3.1-8B-instruct on [Pile Uncopyrighted](https://huggingface.co/datasets/monology/pile-uncopyrighted)
 Download the raw distillation data (BIO format) here: [link to raw data](https://drive.google.com/file/d/1slUHvSIP0yrzNJBIJivBRWe0Z10fjlM1/view?usp=sharing)
-*⚠️ Note: This dataset is **unfiltered**, and may contain type leakage with respect to the benchmark entity types. A cleaned, benchmark-safe version will be released along with the official code.*
 ## 📊 Performance

 ## 📥 Training Data (Unfiltered)
 The model is trained on synthetic annotations generated by LLaMA-3.1-8B-instruct on [Pile Uncopyrighted](https://huggingface.co/datasets/monology/pile-uncopyrighted)
 Download the raw distillation data (BIO format) here: [link to raw data](https://drive.google.com/file/d/1slUHvSIP0yrzNJBIJivBRWe0Z10fjlM1/view?usp=sharing)
+You can load the data into a dataset using the following code:
+```python
+# pip install datasets
+from datasets import Dataset
+ds = Dataset.from_json('pretrain_data.jsonl')
+```
+*⚠️ Note: This dataset is **unfiltered** and contains noisy annotations as well as type leakage with respect to the benchmark entity types. A cleaned, benchmark-safe version will be released along with the official code.*
 ## 📊 Performance