Token Classification
PyTorch
English
bert
alecocc commited on
Commit
49419fa
·
verified ·
1 Parent(s): 99485ac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -1
README.md CHANGED
@@ -72,7 +72,15 @@ We have created a free [Google Colab notebook](https://colab.research.google.com
72
  ## 📥 Training Data (Unfiltered)
73
  The model is trained on synthetic annotations generated by LLaMA-3.1-8B-instruct on [Pile Uncopyrighted](https://huggingface.co/datasets/monology/pile-uncopyrighted)
74
  Download the raw distillation data (BIO format) here: [link to raw data](https://drive.google.com/file/d/1slUHvSIP0yrzNJBIJivBRWe0Z10fjlM1/view?usp=sharing)
75
- *⚠️ Note: This dataset is **unfiltered**, and may contain type leakage with respect to the benchmark entity types. A cleaned, benchmark-safe version will be released along with the official code.*
 
 
 
 
 
 
 
 
76
 
77
  ## 📊 Performance
78
 
 
72
  ## 📥 Training Data (Unfiltered)
73
  The model is trained on synthetic annotations generated by LLaMA-3.1-8B-instruct on [Pile Uncopyrighted](https://huggingface.co/datasets/monology/pile-uncopyrighted)
74
  Download the raw distillation data (BIO format) here: [link to raw data](https://drive.google.com/file/d/1slUHvSIP0yrzNJBIJivBRWe0Z10fjlM1/view?usp=sharing)
75
+
76
+ You can load the data into a dataset using the following code:
77
+ ```python
78
+ # pip install datasets
79
+ from datasets import Dataset
80
+ ds = Dataset.from_json('pretrain_data.jsonl')
81
+ ```
82
+
83
+ *⚠️ Note: This dataset is **unfiltered** and contains noisy annotations as well as type leakage with respect to the benchmark entity types. A cleaned, benchmark-safe version will be released along with the official code.*
84
 
85
  ## 📊 Performance
86