Update README.md
Browse files
README.md
CHANGED
@@ -72,7 +72,15 @@ We have created a free [Google Colab notebook](https://colab.research.google.com
|
|
72 |
## 📥 Training Data (Unfiltered)
|
73 |
The model is trained on synthetic annotations generated by LLaMA-3.1-8B-instruct on [Pile Uncopyrighted](https://huggingface.co/datasets/monology/pile-uncopyrighted)
|
74 |
Download the raw distillation data (BIO format) here: [link to raw data](https://drive.google.com/file/d/1slUHvSIP0yrzNJBIJivBRWe0Z10fjlM1/view?usp=sharing)
|
75 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
76 |
|
77 |
## 📊 Performance
|
78 |
|
|
|
72 |
## 📥 Training Data (Unfiltered)
|
73 |
The model is trained on synthetic annotations generated by LLaMA-3.1-8B-instruct on [Pile Uncopyrighted](https://huggingface.co/datasets/monology/pile-uncopyrighted)
|
74 |
Download the raw distillation data (BIO format) here: [link to raw data](https://drive.google.com/file/d/1slUHvSIP0yrzNJBIJivBRWe0Z10fjlM1/view?usp=sharing)
|
75 |
+
|
76 |
+
You can load the data into a dataset using the following code:
|
77 |
+
```python
|
78 |
+
# pip install datasets
|
79 |
+
from datasets import Dataset
|
80 |
+
ds = Dataset.from_json('pretrain_data.jsonl')
|
81 |
+
```
|
82 |
+
|
83 |
+
*⚠️ Note: This dataset is **unfiltered** and contains noisy annotations as well as type leakage with respect to the benchmark entity types. A cleaned, benchmark-safe version will be released along with the official code.*
|
84 |
|
85 |
## 📊 Performance
|
86 |
|