File size: 22,481 Bytes
4d32947 5f99a86 4d32947 5f99a86 6e95ea5 5f99a86 39c7cf2 88dabe3 d38f446 d838fce 5f99a86 baefcc1 2d41677 d38f446 baefcc1 0562674 baefcc1 5aedbc3 2d41677 61c3719 2d41677 c18a321 2d41677 baefcc1 763aedc 5f99a86 4d32947 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 |
---
base_model: Tasmay-Tib/sarvam-entity-normalisation-llama-3.1-8b
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- gguf
license: apache-2.0
language:
- en
---
# Uploaded model
- **Developed by:** Tasmay-Tib
- **License:** apache-2.0
- **GGUF version of model:** Tasmay-Tib/sarvam-entity-normalisation-llama-3.1-8b
- **Finetuned originally from model :** unsloth/meta-llama-3.1-8b-bnb-4bit
Quantisation: `4-bit`, `Q4_K_M`
As stated this is the gguf version of the model: `Tasmay-Tib/sarvam-entity-normalisation-llama-3.1-8b`, which is a finetune of: `unsloth/meta-llama-3.1-8b-bnb-4bit`, on the dataset: `Tasmay-Tib/sarvam-entity-recognition-gemini-2.0-flash-thinking-01-21-distill-1600`.
To find out more, refer to the original model link below.
Original (non-gguf) bnb model link: [Hugging Face](https://huggingface.co/Tasmay-Tib/sarvam-entity-normalisation-llama-3.1-8b)
Model checkpoint zip (if you need it, try using directly from HF though): [Google Drive](https://drive.google.com/file/d/14xQg7Hr4BB9fFgpJdI_3vwJStPfloCRa/view?usp=sharing)
Inference the model using the given script:
Firstly install unsloth:
```bash
!pip install unsloth # for colab / jupyter notebooks
```
for terminal use this:
```
pip install unsloth
```
now run:
```python
data_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
## Instruction:
Normalize entities in a given sentence, including dates (various formats), currencies (multiple symbols and notations), and scientific units (single and compound). Convert them into their full, standardized textual representations in the same language.
### Example Input:
15/03/1990 को, वैज्ञानिक ने $120 में 500mg यौगिक का एक नमूना खरीदा।
### Example Response:
पंद्रह मार्च उन्नीस सौ नब्बे को, वैज्ञानिक ने एक सौ बीस अमेरिकी डॉलर में पाँच सौ मिलीग्राम यौगिक का एक नमूना खरीदा।
Just as entities like dates, currencies, and scientific units have been normalized into simple terms, you must do the same. Do not leave any entity un-normalised.
## Input:
{}
## Response:
{}"""
max_seq_length = 2048
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage.
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
# model_name = "sarvam-entity-normalisation-llama-3.1-8b", # YOUR MODEL YOU USED FOR TRAINING
model_name = "Tasmay-Tib/sarvam-entity-normalisation-llama-3.1-8b", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
data_prompt.format(
"सूर्य का तापमान लगभग 5500°C है।", # input
"", # output - leave this blank for generation!
)
], return_tensors = "pt").to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
```
GGUF conversion and saving notebook: [Colab](https://colab.research.google.com/drive/18iuIWcybaLSxJwRifQVGpLg4OY0jUpjF?usp=sharing)
Model training script can be found here: [Colab](https://colab.research.google.com/drive/16_c-qOG-4iaHVIcr0BQjRUtK7YFYG99L?usp=sharing)
Wandb Plots: [Weights and Biases](https://api.wandb.ai/links/tasmaytibrewal-iit-kharagpur/4rdl0shl)
Dataset link: [Hugging Face](https://huggingface.co/datasets/Tasmay-Tib/sarvam-entity-recognition-gemini-2.0-flash-thinking-01-21-distill-1600)
Model inference script can be found here: [Colab](https://colab.research.google.com/drive/1TLwc1UXJ1nIY0eVpd6u7nq-uXp4oAPZi?usp=sharing)
Reproduction notebook can be found here: [Colab](https://colab.research.google.com/drive/1oE_0v3zjO2z19hOktWle1pmAyCp_3qgx?usp=sharing)
Model predictions can be found in this and non-gguf repo files. It can also be found on the HF dataset's files. named as: `eval_data_001_predictions.csv` and `eval_data_001_predictions_excel.csv`.
Model predictions can be found in this dataset and both the repo files. named as:
- `eval_data_001_predictions.csv` and `eval_data_001_predictions_excel.csv`.
- `train_data_001_predictions.csv` and `train_data_001_predictions_excel.csv`.
- `data_001_predictions.csv` and `data_001_predictions_excel.csv`.
Predictions done via bnb (non-gguf) model on an L4 GPU, using unsloth's library.
Notebook used for creating the prediction is here: [Colab](https://colab.research.google.com/drive/1lzhRDCB3bFIOYhfo9cjulEYigNxdbR2i?usp=sharing)
It also viewable via the links:
- `eval_data_001_predictions.csv` (`utf-8` encoded): [Google Drive](https://drive.google.com/file/d/1U5mVmjGCh1fApBD-zH5SUezv5YP8s-le/view?usp=sharing)
- `eval_data_001_predictions_excel.csv` (`utf-8-sig` encoded): [Google Drive](https://drive.google.com/file/d/1LAdfANg4IPozTeMjTnoG2fyJa2Edd9hY/view?usp=sharing)
- `train_data_001_predictions.csv` (`utf-8` encoded): [Google Drive](https://drive.google.com/file/d/1-aSzGuMDN_Lc6K-8p28iwcvPmv-fE_wJ/view?usp=sharing)
- `train_data_001_predictions_excel.csv` (`utf-8-sig` encoded): [Google Drive](https://drive.google.com/file/d/11muwFOTfaM5dg-QZP0gZ6xwITUqA9Eaa/view?usp=sharing)
- `data_001_predictions.csv` (`utf-8` encoded): [Google Drive](https://drive.google.com/file/d/1h73ePJZE0o3x_QMtxtGxUCjxVU3zSzTe/view?usp=sharing)
- `data_001_predictions_excel.csv` (`utf-8-sig` encoded): [Google Drive](https://drive.google.com/file/d/1M1ic-RhragAMaV8CoiuMrG445h6uPgj9/view?usp=sharing)
`eval_data_001_predictions_excel.csv`, `train_data_001_predictions_excel.csv` and `data_001_predictions_excel.csv` are for viewing in excel, since it is `utf-8-sig` encoded, which is easily decoded by it.
Reproduction notebook can be found here: [Colab](https://colab.research.google.com/drive/1oE_0v3zjO2z19hOktWle1pmAyCp_3qgx?usp=sharing)

This is just a comprehensive readme, for introduction purposes. I will provide the detailed writeup later. That would contain detailed explanation and intuitive deep-dives.
I have completed the task in three methods:
1. Agentic based method (used a single agent, iterative, fixed chain of set user responses to go on the model recursively, to act as if the entity normalisation task was part of a storyline and the model had to transform the sentence given by the other character in the story).
- All base models used (since instruct would by default do this)
- Performance was bad for sarvam-1-2b and Qwen-2.5-3b. Low scope for improvements
- Llama-3.1-8b-bnb-4bit (unsloth's) quantised and optimised model was used (which took lesser space and inference time than the other two, despite being a larger model, due to inference engine of unsloth).
- Llama model performed best, and the outputs were quite good in a single prompt format as well (with some careful prompt engineering)
2. Training based method (SFT was done, using PEFT, using unsloth's and trl libraries)
- Again Llama-3.1-8b-bnb-4bit is used, for its optimised training engines, and higher performance.
- Synthetic data generated using Gemini in google ai studio, for model training.
- Obtained model using various different adjustments, optimisations, bug-fixing and hyper-parameter tuning.
- Ran a total of 46 runs (40 minor runs, 1 crashed major run, 4 complete major runs, 1 final reproducibility run)
- Obtained a model highly performant on metrics and datasets, though a lot of drawbacks and shortcomings were found, due to dataset issues (discussed later)
3. An algorithmic technique for entity normalisation
- Probably the most interesting of all
- Highly, highly performant
- code so good that seemed like an overkill
- such a huge section that this needs to be described directly in details, in the writeup
- included things like language recogniser. custom tokeniser. script recognition. vowel/nasalised consonants/other character detection.
- huge overkill of month names, currency patterns, date logic, special symbol logic, scientific symbol data.
- it is so so good that seemed like a complete overkill
- fast, nearly almost correct, deterministic, highly improvable, if combinations increased for matching.
- regex and pattern matching on steroids.
- so good that used for final filtering operation in agentic method to avoid major shortcomings.
## Onto method 2 (method 1 and 3 will be described in the detailed writeup):
Model chosen:
`sarvam_training_run_main_5`: at checkpoint step `20`.
Model Metrics at checkpoint:
- `train_loss`: 0.101
- `eval_loss`: 0.11551
- `cer`: 0.12292
- `wer`: 0.09581
- `bleu`: 0.87392
- `chrf`: 94.0154
- `chrf++`: 93.78756
- `cutom_metric` (`squared_eval_to_train_loss_ratio` = `eval_loss`<sup>`2`</sup>/`train_loss`): 0.1312
this is a custom metric i love. i do not know whether it exists otherwhere. it minimises both the eval_loss and the ratio of eval to train loss (signifying overfit).
this mostly matches best performance across metrics (thus this metric when good, is often when all the other given metrics are in their best spots).
found to be consistent from experimentation across 46 training runs.
one major drawback of this is that this often goes wrong on sudden peaks in train loss.
an improvement is to use `(eval_loss`<sub>`i`</sub><sup>`2`</sup>)/(min(train_loss`<sub>`j`</sub>))` for `j` ranging from `1 to i`.
this is often a better estimate. here `eval_loss`<sub>`x`</sub> and `train_loss`<sub>`x`</sub> signifies the respective losses at `step = x`.
an even better estimate is ranging `j` from `max(0, i-k)` to `i`. where `k` is a hyper-parameter decided by the user based on volatility of the training run and the number of total steps.
Validation Plot for chosen model:

While the crashed run 3 was not replicated on further tries. but the chosen brown plot, was easily recoverable and on using lower epochs, faster degradation in lr, the runs prooved to be more stable even later.
Model inference script can be found here: [Colab](https://colab.research.google.com/drive/1TLwc1UXJ1nIY0eVpd6u7nq-uXp4oAPZi?usp=sharing)
Dataset: generated synthetically using `gemini-2.0-flash-thinking-01-21`. generated in a single chat like format. total `1600` queries. (`1185` train + `415` val).
Dataset link: [Hugging Face](https://huggingface.co/datasets/Tasmay-Tib/sarvam-entity-recognition-gemini-2.0-flash-thinking-01-21-distill-1600)
Queries generated in a single chat, to avoid maximum repition. Chat link: [Google AI Studio](https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%5B%221N42mDVHBEHtaKOgkIZS4LcNTLYTLbiaM%22%5D,%22action%22:%22open%22,%22userId%22:%22107745987607842002805%22,%22resourceKeys%22:%7B%7D%7D&usp=sharing)
Gemini thinking model chosen because:
- better prompt adherence as compared to most open source models
- extremely high context windows. supported extremely long query gen chats (~`200k` tokens)
- thinking model thus the outputs tend to be more aligned to user's instructions, and better in quality.
- thinking models have high output token limits (~`65,000 tokens`) allowing for larger batch generation of queries.
- queries generated in json format, later converted to a pandas dataframe.
- gemini models tend to perform better at writing and multi-lingual tasks, having a broader vocabulary.
- gemini is free on google ai studio.
Datset contains various languages:
- `Hindi`
- `Tamil`
- `Telugu`
- `Kannada`
- `Malayalam`
- `Odia`
- `Bengali`
- `Gujarati`
- `Punjabi`
- `Marathi`
Dataset is also generated at various range of temperatures (for a higher diversity of content, while maintaining, instruction following part). Temperatures: `0.1`, `0.4`, `0.7`, `1`.
Dataset also comprises of a range of domains:
- `Scientific`
- `Medical`
- `Financial`
- `Literature`
- `General`
- `Technical`
- `Academic`
- `News`
- `Legal`
- `Geography`
Dataset spit is done in approximate ~3:1 ratio (3 for train, 1 for eval). It is done such that the classes, and subcombination of classes of the three categories (domain, language and temperature), remains balanced.
This means that the train and val set both contain nearly equal proportions of languages, domains and temperature ranges. It also contains similar distribution of a language's samples among temperature or domain, or some other combination of the three, between the train and val set.
This also means that the distribution for a specified language and a domain on temperature, or some other combination of the three, is also similar between the train set and the val set.
The distribution is done in such a manner, as to ensure that for class combinations with <4 samples, atleast one sample goes to each class. And for rest, it is rounded of to the nearest int. Thus the slight discrepancy from 400 -> 415 and 1200 -> 1185.
Problems identified in Dataset (for future works):
- hallucinations, sudden ending in long sentences (`200+` tokens)
- problems with decimal numbers
- issues with some very rare and different date formats
- since it was fine-tuned with an instruction, it rarely hallucinated newer instructions and sentences after giving the output, instead of outputing an `EOS token`.
- (quirk: does not work on english sentences, since not part of trainset)
- complex unit handling (hallucinations, for rarer units)
- wrong number understanding (occassionally messes up, inverts to the nearest common number, say `55` for `54`)
Solution: making the dataset larger (~10k queries), with highly diverse scenarios, forms, rare number and unit occurences, longer sentences, removing instruction tuning, etc.
Not implemented due to shortage of time. (though i am trying it now)
Prompt format for model training:
```txt
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
## Instruction:
Normalize entities in a given sentence, including dates (various formats), currencies (multiple symbols and notations), and scientific units (single and compound). Convert them into their full, standardized textual representations in the same language.
### Example Input:
15/03/1990 को, वैज्ञानिक ने $120 में 500mg यौगिक का एक नमूना खरीदा।
### Example Response:
पंद्रह मार्च उन्नीस सौ नब्बे को, वैज्ञानिक ने एक सौ बीस अमेरिकी डॉलर में पाँच सौ मिलीग्राम यौगिक का एक नमूना खरीदा।
Just as entities like dates, currencies, and scientific units have been normalized into simple terms, you must do the same. Do not leave any entity un-normalised.
## Input:
{}
## Response:
{}
```
Here the string was formatted, and the input sentence was inserted, response was left blank, for model to generate. While training the actual responses were attached to the response regions with an `EOS token` indication end of sentence.
Use the same prompt format to convert the dataset into usable training data (inspired from alpaca's prompt and dataset, on unsloth's notebook).
Prompt used for query gen:
```txt
Objective:
Generate high-quality synthetic training data to train a text normalization model for Indic languages. The model must normalize specific entities—dates, currencies, and scientific units—across multiple Indian languages while preserving all other words in the sentence unchanged.
Languages Covered:
Hindi
Tamil
Telugu
Kannada
Malayalam
Odia
Bengali
Gujarati
Punjabi
Marathi
Entity Normalization Requirements:
Dates:
Convert various date formats (e.g., "15/03/1990", "2nd Jan 2022", "March 15, 1990") into fully written-out formats in the target language.
Currencies:
Convert numerical currency values (e.g., "$120", "₹500", "CNY 2500", "700 yen") into their fully spelled-out equivalents in the target language. Convert all of the top used currencies. Atleast the top 10 currencies. These include USD, INR, EUR, JPY, GBP, AUD, CAD, CHF, CNH, HKD, NZD). Use different forms of each currency (with symbol, in short, in full, etc.) as part of the input.
Scientific Units:
Convert units (e.g., "500mg", "10 km/h", "20°C"), and all other type of units, seconds, distance, weight, temperature, velocity, etc. into fully written-out equivalents in the target language. Also in the inputs, use different unit types of each category say g, kg, mg, tonnes, lbs for weight.
Important: Only modify the specified entities (dates, currencies, scientific units). Do not add, remove, or change any other words in the sentence.
Sentence Diversity & Domain Distribution:
Sentence Types:
Include sentences with multiple entities, single entities, and sentences with no entities to maintain contextual diversity.
Domains:
Ensure an equal distribution of examples across these four domains:
News
Medical
Financial
Scientific
Legal
Academic
Literature
Technical
General (normal conversational)
Miscellaneous
Style:
Vary the tone and style (from formal to conversational) while maintaining correctness in all languages.
Incorporate real-world scenarios such as news articles, medical records, financial transactions, and scientific reports.
Data Volume & Output Format:
Volume:
Generate at least 400 sentences per language to ensure robust training (for initial tests 100 examples, at least 10 examples per language can be generated).
Output Format:
Each example must be provided in JSON format with the following keys:
"sl. no.": Serial number of the current example (integer number, e.g., 1 , 193, 1202, etc.)
"language": Name of the target language (e.g., "Hindi", "Tamil", etc.)
"input": The original sentence containing the entities in non-normalized form.
"output": The normalized sentence with dates, currencies, and scientific units fully written-out.
"domain": The domain of the sentence (e.g., "news", "medical", "financial", "scientific").
Example Format:
{
"sl. no.": 1,
"language": "Hindi",
"input": "15/03/1990 को, वैज्ञानिक ने $120 में 500mg यौगिक का एक नमूना खरीदा।",
"output": "पंद्रह मार्च उन्नीस सौ नब्बे को, वैज्ञानिक ने एक सौ बीस डॉलर में पाँच सौ मिलीग्राम यौगिक का एक नमूना खरीदा।",
"domain": "scientific"
}
Additional Instructions:
Linguistic Inclusivity:
Use standard written forms and be as inclusive as possible for each of the 10 Indic languages.
Do Not Overthink:
Generate a large number of diverse examples without overcomplicating the process.
No External Tools or Formats:
Do not use any canvas-based formats; provide the output solely in JSON.
Your task is to generate synthetic training examples that strictly adhere to the above guidelines. Do not repeat similar sentences. Generate different sentences, use different vocabulary, different set of words and different phrases. generate semantically different sentences as well, with different meanings. You may include multiple entities in single sentences and along with that you may include multi-sentence examples as well, entirely upto you. Now you may go on to generate the 100 initial examples.
```
A multi-step prompting process was used. at a time a few-hundred queries were generated (400 for each temp) and the model was continously guided and checked to ensure output diversity, quality and consistency.
one of these prompts inlude:
```
keep continuing, i am considering till 1062, continue from 1063 again, generate till 1200, remember to maintain a balance across language, style, category, types of entities, forms of entities etc. also remember to not generate similar examples.
```
Pros:
- Dataset turned out to be good, as per the instructions
- Instructions ensured to cover range of domains, languages, currencies, common formats, etc., to ensure that the data was not limited to a single class
- Various different generation temperatures were used to ensure variance int the overall data distribution
- Instructions specifically mentioned to maintain class balance, and remove redundancy and repeatability, a large context input model was used to ensure that the examples were different
Drawbacks:
- Data generation style could have been more formal (low impact), since the examples turned out to be decent
- While dataset generation aspects like complex sentences, larger sentences, decimal numbers, rarer numbers, and currencies should have been addressed in prompt, for a better quality and diverse dataset.
- A larger dataset should have been created (~10k samples atleast)
## Reproducibility run:
A reproducibility run was successfully and easily performed. ensuring that the model was good. A reproducibility run with a higher learning rate decay rate would have been more stable.
The checkpoint for this model was again estimated around 18-20.
Reproduction notebook can be found here: [Colab](https://colab.research.google.com/drive/1oE_0v3zjO2z19hOktWle1pmAyCp_3qgx?usp=sharing)
WandB report: [Weights and Biases](https://api.wandb.ai/links/tasmaytibrewal-iit-kharagpur/ayu5wh5v)
Validation plot (with the selected and the one crashed run):

This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
|