In this experiment, I trained a tokenizer that supports multiple Indian languages and merged it with the Llama-3 tokenizer.

STEP 1:

I sampled data from the multilingual(7 Indic languages) aloobun/dhpileIN dataset and trained a SentencePiece tokenizer.

STEP 2:

I evaluated the tokenizer's performance on:

  • Unicode coverage.
  • Token distribution.
  • Tokenization complexity across different scripts.
  • Encoding and decoding capabilities &
  • Edge cases e.g., special characters, numbers, etc.

STEP 2.1:

The first test gives detailed results of the tokenizer's performance on unicode coverage, token distribution visualiztion and complexity across scripts.

Step 2.2:

The second script tests the encoding and decoding capabilities. Example output:

Bengali Analysis:
Original Text Length: 48 characters
Token IDs Count: 11
Token Strings: ['▁আমি', '▁বাংলাদেশ', '▁থেকে', '▁এসে', 'ছি', '।', '▁কলকাতা', '▁একটি', '▁সুন্দর', '▁শহর', '।']
Text Reconstruction: True

Hindi Analysis:
Original Text Length: 49 characters
Token IDs Count: 15
Token Strings: ['▁नम', 'स्ते', ',', '▁मैं', '▁भारत', '▁से', '▁हू', 'ँ', '।', '▁दिल्ली', '▁बहुत', '▁बड़ा', '▁शहर', '▁है', '।']
Text Reconstruction: True

Kannada Analysis:
Original Text Length: 53 characters
Token IDs Count: 13
Token Strings: ['▁ನಾನು', '▁ಬೆಂಗಳೂರಿ', 'ನಿಂದ', '▁ಬಂದ', 'ಿದ್ದೇನೆ', '।', '▁ಕನ್ನಡ', '▁ಒಂದು', '▁ಸೋ', 'ಂಪ', 'ಿನ', '▁ಭಾಷೆ', '।']
Text Reconstruction: True

Malayalam Analysis:
Original Text Length: 47 characters
Token IDs Count: 15
Token Strings: ['▁ഞ', 'ാ', 'ൻ', '▁കേരള', 'ത്തി', 'ൽ', '▁നിന്നാണ്', '.', '▁കൊച്ചി', '▁ഒരു', '▁സുന്ദ', 'ര', '▁നഗ', 'രം', '.']
Text Reconstruction: True

Telugu Analysis:
Original Text Length: 53 characters
Token IDs Count: 10
Token Strings: ['▁నేను', '▁తెలంగాణ', '▁నుంచి', '▁వచ్చ', 'ాను', '.', '▁హైదరాబాద్', '▁అద్భుతమైన', '▁నగరం', '.']
Text Reconstruction: True

Tamil Analysis:
Original Text Length: 54 characters
Token IDs Count: 13
Token Strings: ['▁நான்', '▁தமிழ்நா', 'ட்டை', 'ச்', '▁சேர்ந்த', 'வன்', '.', '▁சென்னை', '▁ஒரு', '▁பெரிய', '▁நக', 'ரம்', '.']
Text Reconstruction: True

Gujarati Analysis:
Original Text Length: 50 characters
Token IDs Count: 12
Token Strings: ['▁હું', '▁ગુજરાત', '▁થી', '▁આવ્યો', '▁છું', '।', '▁અમદાવાદ', '▁એક', '▁સુંદર', '▁શહેર', '▁છે', '।']
Text Reconstruction: True

STEP 3:

This script is used to merge and extend the tokenizer for the Llama3 tokenizer.

Script ensures:

  • No duplicate tokens are added.
  • Tokens arent excessively long.
  • New tokens are correctly integrated.
  • Token mappings, etc

I feel there are some unecessary bloat like token validation and redundant test methods in the script. I'm still working on how to improve things and will update as soon as I have any progress.

Here's a comparison of sub word fertility scores between sarvam-1 and this model.

sarvam-1 IN-Llama-3-Tokenizer
Bengali 1.7 3.52
Gujrati 2.784313 3.588235
Hindi 1.583333 2.933333
Kannada 2.571428 3.976190
Malayalam 3.487804 4.365853
Tamil 2.767441 3.860465
Telugu 2.372093 3.511627
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.