Byte Fallback BPE Tokenizer
- Trained using huggingface/tokenizers
- Vocab Size :
72000
Training Args
### gpt-4-turbo regex (you can use your own but this works fine)
pat_str = "|".join(
[
r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
r"""\p{N}{1,3}""",
r""" ?[^\s\p{L}\p{N}]+[\r\n/]*""",
r"""\s*[\r\n]+""",
r"""\s+(?!\S)""",
r"""\s+""",
]
)
# Initialize tokenizer with script-aware settings
tokenizer = Tokenizer(models.BPE(
byte_fallback=True,
unk_token=None,
fuse_unk=False
))
# pre-tokenizer for multilingual support
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
pre_tokenizers.Split(
pattern=Regex(pat_str),
behavior="isolated",
invert=False
),
pre_tokenizers.ByteLevel(
add_prefix_space=False,
trim_offsets=True,
use_regex=False
)
])
# (modified for Indic languages)
tokenizer.normalizer = normalizers.Sequence([
normalizers.NFC(), # Safer than NFKC for Indic scripts
])
from tokenizers import decoders
tokenizer.decoder = decoders.ByteLevel(
add_prefix_space=False # Must match pre-tokenizer settings
)
# Optimized trainer configuration
trainer = trainers.BpeTrainer(
vocab_size=VOCAB_SIZE,
special_tokens=SPECIAL_TOKENS,
min_frequency=1, # Lower frequency for low-resource languages
show_progress=True,
initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
max_token_length=24,
continuing_subword_prefix=""
)
def get_corpus():
# Load and process full dataset
dataset = load_dataset(DATASET_NAME, split="train")
shuffled = dataset.shuffle(seed=42)
return [text for text in shuffled[TEXT_COLUMN] if text.strip()] * 3
Special Tokens
{'bos_token': '<|begin_of_text|>', 'eos_token': '<|end_of_text|>', 'pad_token': '<|pad|>'}
Training Composition:
Maths: 550 M * 3
"aluncstokes/mathpile_arxiv_subset"
Code: 800 M * 3
codeparrot/github-code
Hinglish : 250 M * 3
Abhishekcr448/Hinglish-Everyday-Conversations-1M
Maihar/hinglish-80k
English : 2 000 M * 3
"allenai/c4", "en"
Hindi : 2 200 M * 3
aloobun/dhpileIN
,data_dir='hi'
Evals
Tokenization Efficency (Less is Better)
Tokenizer | English | Hindi | Tamil | Bengali | Malayalam | Telugu | Gujarati | Punjabi | Code_Python | Code_Java | c++ | Math | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | deepseek-ai/DeepSeek-R1 (128k) | 338874 | 22855 | 48957 | 39617 | 73928 | 40345 | 101020 | 79172 | 5231 | 2224 | 7055 | 5376 |
1 | unsloth/phi-4 (100k) | 308645 | 40456 | 59750 | 116122 | 149889 | 48689 | 118335 | 87413 | 4809 | 2110 | 6529 | 5573 |
2 | deepseek-ai/DeepSeek-R1-Distill-Llama-8B (128k) | 308512 | 21110 | 59625 | 115138 | 149883 | 48661 | 118061 | 86765 | 4809 | 2111 | 6530 | 5574 |
3 | unsloth/gemma-2-9b-it(256k) | 323335 | 15916 | 53913 | 53402 | 57219 | 47610 | 107925 | 87222 | 5948 | 2569 | 8639 | 5871 |
4 | Ornaments/72k-Bilingual-BBPE-TK-SPM (72k) (Old) | 366710 | 11447 | 61408 | 94191 | 97207 | 50229 | 117874 | 90045 | 8201 | 4000 | 13706 | 5585 |
5 | Ornaments/72k-Bilingual-BBPE-TK-SPM-Identity (72k) | 330830 | 10318 | 59089 | 93740 | 92655 | 44975 | 109411 | 87922 | 7819 | 3743 | 12953 | 5253 |
> | Ornaments/72k-TK-BBPE-HF (72k) | 321274 | 10813 | 67585 | 159985 | 193813 | 55654 | 134397 | 97063 | 5225 | 2263 | 7090 | 5150 |
7 | nvidia/Nemotron-4-Mini-Hindi-4B-Instruct (256k) | 332271 | 14327 | 55473 | 36615 | 45783 | 48270 | 160115 | 117174 | 6186 | 2732 | 8861 | 6136 |
8 | sarvamai/OpenHathi-7B-Hi-v0.1-Base (48k) | 370133 | 15633 | 67845 | 120340 | 105953 | 68315 | 159122 | 113817 | 6595 | 2792 | 9233 | 6223 |
9 | sarvamai/sarvam-1 (68k) | 385386 | 11257 | 61396 | 27348 | 31822 | 51463 | 119666 | 103344 | 7331 | 3068 | 9724 | 6864 |
Encode-Decode
- Hindi
Input : ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
Tokens: ['à¤ĭ', 'त', 'à¥ģà¤°à¤¾à¤ľ', 'Ġà¤Ĺायà¤ķ', 'वाड़', 'Ġ(', 'à¤ķपà¥įतान', '),', 'Ġडà¥ĩ', 'व', 'à¥ĭन', 'Ġà¤ķà¥īन', 'वà¥ĩ', ',', 'Ġरà¤ļ', 'िन', 'Ġरविà¤Ĥदà¥įर', ',', 'Ġराहà¥ģल', 'Ġतà¥įरिप', 'à¤¾à¤łà¥Ģ', ',', 'Ġशिवम', 'Ġदà¥ģबà¥ĩ', ',', 'Ġरविà¤Ĥदà¥įर', 'Ġà¤ľà¤¡à¥ĩà¤ľà¤¾', ',', 'Ġà¤ıमà¤ıस', 'Ġधà¥ĭनà¥Ģ', 'Ġ(', 'व', 'िà¤ķà¥ĩà¤Ł', 'à¤ķà¥Ģपर', '),', 'Ġà¤Ĩर', 'Ġà¤ħशà¥įविन', ',', 'Ġमà¥Ģ', 'थ', 'ाशा', 'Ġपथ', 'िर', 'ाना', ',', 'Ġà¤ĸलà¥Ģल', 'Ġà¤ħहमद', ',', 'Ġनà¥Ĥर', 'Ġà¤ħहमद', '।']
Encoded: [38659, 299, 21358, 15506, 7249, 509, 28249, 1222, 2308, 357, 1731, 8940, 2506, 14, 17890, 504, 19058, 14, 4384, 9183, 7568, 14, 18827, 13293, 14, 19058, 13516, 14, 17978, 12756, 509, 357, 3072, 14080, 1222, 2215, 17009, 14, 7584, 942, 22395, 11558, 647, 901, 14, 39383, 6593, 14, 25750, 6593, 337]
Len Tokens 51
Decoded: ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
- English
Input : Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
Tokens: ['B', 'ang', 'alore', 'Ġand', 'ĠChennai', 'Ġhave', 'Ġfaced', 'Ġeach', 'Ġother', 'Ġin', 'Ġ', '33', 'Ġmatches', 'Ġin', 'ĠIPL', '.', 'ĠOut', 'Ġof', 'Ġthese', 'Ġ', '33', 'Ġgames', ',', 'ĠBangalore', 'Ġhave', 'Ġwon', 'Ġ', '11', 'Ġwhereas', 'ĠChennai', 'Ġhave', 'Ġcome', 'Ġout', 'Ġvict', 'orious', 'Ġon', 'Ġ', '21', 'Ġoccasion', '.', 'Ġ', '1', 'Ġmatch', 'Ġended', 'Ġwithout', 'Ġa', 'Ġresult', '.']
Encoded: [36, 951, 30658, 364, 45274, 688, 20861, 1993, 1101, 360, 223, 3276, 15006, 360, 11519, 16, 7921, 368, 1576, 223, 3276, 5013, 14, 45076, 688, 4896, 223, 1281, 21170, 45274, 688, 3051, 892, 9592, 29166, 462, 223, 2428, 13344, 16, 223, 19, 5359, 12784, 2752, 284, 1899, 16]
Len Tokens 48
Decoded: Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
- Math
Input : % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
% If using xelatex or lualatex:
\setmainfont{Roboto Slab}
\setsansfont{Lato}
\renewcommand{\familydefault}{\sfdefault}
\else
% If using pdflatex:
\usepackage[rm]{roboto}
\usepackage[defaultsans]{lato}
% \usepackage{sourcesanspro}
\renewcommand{\familydefault}{\sfdefault}
\fi
Tokens: ['%', 'ĠChange', 'Ġthe', 'Ġfont', 'Ġif', 'Ġyou', 'Ġwant', 'Ġto', ',', 'Ġdepending', 'Ġon', 'Ġwhether', 'Ċ', '%', "Ġyou're", 'Ġusing', 'Ġpd', 'fl', 'ate', 'x', 'Ġor', 'Ġx', 'el', 'ate', 'x', '/l', 'ual', 'ate', 'x', 'Ċ', '%', 'ĠWH', 'EN', 'ĠCOMP', 'IL', 'ING', 'ĠWITH', 'ĠX', 'EL', 'ATE', 'X', 'ĠPLEASE', 'ĠUSE', 'Ċ', '%', 'Ġx', 'el', 'ate', 'x', 'Ġ-', 'shell', '-', 'escape', 'Ġ-', 'output', '-d', 'river', '="', 'xd', 'v', 'ip', 'df', 'mx', 'Ġ-', 'z', 'Ġ', '0', '"', 'Ġsample', '.', 'tex', 'Ċ', '\\', 'ift', 'utex', 'Ċ', 'Ġ', 'Ġ%', 'ĠIf', 'Ġusing', 'Ġx', 'el', 'ate', 'x', 'Ġor', 'Ġl', 'ual', 'ate', 'x', ':Ċ', 'Ġ', 'Ġ\\', 'set', 'main', 'font', '{R', 'ob', 'oto', 'ĠSl', 'ab', '}Ċ', 'Ġ', 'Ġ\\', 'sets', 'ans', 'font', '{L', 'ato', '}Ċ', 'Ġ', 'Ġ\\', 'renewcommand', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}Ċ', '\\', 'else', 'Ċ', 'Ġ', 'Ġ%', 'ĠIf', 'Ġusing', 'Ġpd', 'fl', 'ate', 'x', ':Ċ', 'Ġ', 'Ġ\\', 'us', 'ep', 'ackage', '[', 'rm', ']{', 'rob', 'oto', '}Ċ', 'Ġ', 'Ġ\\', 'us', 'ep', 'ackage', '[', 'defaults', 'ans', ']{', 'l', 'ato', '}Ċ', 'Ġ', 'Ġ%', 'Ġ\\', 'us', 'ep', 'ackage', '{s', 'ources', 'ans', 'pro', '}Ċ', 'Ġ', 'Ġ\\', 'renewcommand', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}Ċ', '\\fi', 'Ċ']
Encoded: [7, 20642, 307, 10013, 803, 449, 1654, 349, 14, 11248, 462, 4806, 201, 7, 7412, 2233, 34245, 6404, 520, 90, 578, 2163, 395, 520, 90, 21145, 1316, 520, 90, 201, 7, 20360, 2167, 49037, 5195, 4249, 25624, 2712, 7413, 7119, 58, 65107, 22822, 201, 7, 2163, 395, 520, 90, 904, 47931, 15, 38885, 904, 9854, 3209, 11707, 772, 27503, 88, 1056, 8772, 44531, 904, 92, 223, 18, 4, 10164, 16, 8774, 201, 62, 3113, 17783, 201, 223, 3259, 1783, 2233, 2163, 395, 520, 90, 578, 390, 1316, 520, 90, 1215, 223, 514, 1292, 7517, 5685, 4020, 1216, 6289, 11833, 483, 612, 223, 514, 8645, 820, 5685, 6459, 10542, 612, 223, 514, 67762, 676, 34277, 7107, 4403, 5765, 7107, 612, 62, 7583, 201, 223, 3259, 1783, 2233, 34245, 6404, 520, 90, 1215, 223, 514, 447, 1057, 14270, 61, 1876, 6592, 20636, 6289, 612, 223, 514, 447, 1057, 14270, 61, 71659, 820, 6592, 78, 10542, 612, 223, 3259, 514, 447, 1057, 14270, 6170, 4113, 820, 1387, 612, 223, 514, 67762, 676, 34277, 7107, 4403, 5765, 7107, 612, 68146, 201]
Len Tokens 177
Decoded: % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
% If using xelatex or lualatex:
\setmainfont{Roboto Slab}
\setsansfont{Lato}
\renewcommand{\familydefault}{\sfdefault}
\else
% If using pdflatex:
\usepackage[rm]{roboto}
\usepackage[defaultsans]{lato}
% \usepackage{sourcesanspro}
\renewcommand{\familydefault}{\sfdefault}
\fi
- Code
Input : class SentencePieceUnigramTokenizer(BaseTokenizer):
"""SentencePiece Unigram Tokenizer
Represents the Unigram algorithm, with the pretokenization used by SentencePiece
"""
def __init__(
self,
vocab: Optional[List[Tuple[str, float]]] = None,
replacement: str = "▁",
add_prefix_space: bool = True,
):
if vocab is not None:
# Let Unigram(..) fail if only one of them is None
tokenizer = Tokenizer(Unigram(vocab))
else:
tokenizer = Tokenizer(Unigram())
Tokens: ['class', 'ĠSentence', 'P', 'iece', 'Un', 'ig', 'ram', 'Token', 'izer', '(Base', 'Token', 'izer', '):Ċ', 'ĠĠĠ', 'Ġ"""', 'Sentence', 'P', 'iece', 'ĠUn', 'ig', 'ram', 'ĠToken', 'izer', 'ĊĊ', 'ĠĠĠ', 'ĠRep', 'resents', 'Ġthe', 'ĠUn', 'ig', 'ram', 'Ġalgorithm', ',', 'Ġwith', 'Ġthe', 'Ġpret', 'oken', 'ization', 'Ġused', 'Ġby', 'ĠSentence', 'P', 'iece', 'Ċ', 'ĠĠĠ', 'Ġ"""ĊĊ', 'ĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__', '(Ċ', 'ĠĠĠĠĠĠĠ', 'Ġself', ',Ċ', 'ĠĠĠĠĠĠĠ', 'Ġvoc', 'ab', ':', 'ĠOptional', '[', 'List', '[T', 'uple', '[str', ',', 'Ġfloat', ']]', ']', 'Ġ=', 'ĠNone', ',Ċ', 'ĠĠĠĠĠĠĠ', 'Ġreplacement', ':', 'Ġstr', 'Ġ=', 'Ġ"', 'âĸ', 'ģ', '",Ċ', 'ĠĠĠĠĠĠĠ', 'Ġadd', '_prefix', '_space', ':', 'Ġbool', 'Ġ=', 'ĠTrue', ',Ċ', 'ĠĠĠ', 'Ġ):Ċ', 'ĠĠĠĠĠĠĠ', 'Ġif', 'Ġvoc', 'ab', 'Ġis', 'Ġnot', 'ĠNone', ':Ċ', 'ĠĠĠĠĠĠĠĠĠĠĠ', 'Ġ#', 'ĠLet', 'ĠUn', 'ig', 'ram', '(', '..', ')', 'Ġfail', 'Ġif', 'Ġonly', 'Ġone', 'Ġof', 'Ġthem', 'Ġis', 'ĠNone', 'Ċ', 'ĠĠĠĠĠĠĠĠĠĠĠ', 'Ġtoken', 'izer', 'Ġ=', 'ĠToken', 'izer', '(', 'Un', 'ig', 'ram', '(v', 'oc', 'ab', '))Ċ', 'ĠĠĠĠĠĠĠ', 'Ġelse', ':Ċ', 'ĠĠĠĠĠĠĠĠĠĠĠ', 'Ġtoken', 'izer', 'Ġ=', 'ĠToken', 'izer', '(', 'Un', 'ig', 'ram', '())']
Encoded: [2805, 45192, 50, 15717, 5091, 436, 1293, 13735, 7625, 62228, 13735, 7625, 2818, 413, 8533, 63588, 50, 15717, 2644, 436, 1293, 37433, 7625, 1025, 413, 4954, 13180, 307, 2644, 436, 1293, 11436, 14, 505, 307, 5992, 6907, 2920, 1909, 679, 45192, 50, 15717, 201, 413, 25641, 413, 1333, 4304, 3747, 1614, 3873, 545, 1572, 740, 545, 25497, 483, 28, 22800, 61, 3754, 42378, 15732, 27446, 14, 10809, 17233, 63, 532, 5200, 740, 545, 13804, 28, 2030, 532, 698, 27234, 226, 3288, 545, 1290, 31498, 34542, 28, 7817, 532, 9402, 740, 413, 42359, 545, 803, 25497, 483, 429, 696, 5200, 1215, 829, 1769, 4983, 2644, 436, 1293, 10, 879, 11, 6312, 803, 1407, 963, 368, 1212, 429, 5200, 201, 829, 15025, 7625, 532, 37433, 7625, 10, 5091, 436, 1293, 8425, 1287, 483, 4095, 545, 2589, 1215, 829, 15025, 7625, 532, 37433, 7625, 10, 5091, 436, 1293, 9066]
Len Tokens 146
Decoded: class SentencePieceUnigramTokenizer(BaseTokenizer):
"""SentencePiece Unigram Tokenizer
Represents the Unigram algorithm, with the pretokenization used by SentencePiece
"""
def __init__(
self,
vocab: Optional[List[Tuple[str, float]]] = None,
replacement: str = "▁",
add_prefix_space: bool = True,
):
if vocab is not None:
# Let Unigram(..) fail if only one of them is None
tokenizer = Tokenizer(Unigram(vocab))
else:
tokenizer = Tokenizer(Unigram())
- Emoji
Input : 😜🫤☹️😖🤢🤮😇🐻❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑🦲👨🚒👨🚀
Tokens: ['ðŁ', 'ĺ', 'ľ', 'ðŁ', '«', '¤', 'â', 'ĺ', '¹', 'ï¸ı', 'ðŁ', 'ĺ', 'ĸ', 'ðŁ', '¤¢', 'ðŁ', '¤®', 'ðŁ', 'ĺ', 'ĩ', 'ðŁ', 'IJ', '»', 'âĢ', 'į', 'â', 'Ŀ', 'Ħ', 'ï¸ı', 'ðŁ', '¦', 'Ħ', 'ðŁ', 'IJ', '¾', 'ðŁ', 'IJ', '½', 'ðŁ', 'IJ', 'į', 'ðŁ', '¦', 'ŀ', 'ðŁ', '¦', 'IJ', 'ðŁ', '¦', '¿', 'ðŁ', '¤', '´', 'ðŁ', '§', 'ij', 'âĢ', 'į', 'ðŁ', '¦', '²', 'ðŁ', 'ij', '¨', 'âĢ', 'į', 'ðŁ', 'ļ', 'Ĵ', 'ðŁ', 'ij', '¨', 'âĢ', 'į', 'ðŁ', 'ļ', 'Ģ']
Encoded: [17635, 249, 253, 17635, 107, 100, 161, 249, 120, 67378, 17635, 249, 247, 17635, 6387, 17635, 326, 17635, 249, 232, 17635, 241, 122, 461, 238, 161, 254, 229, 67378, 17635, 102, 229, 17635, 241, 125, 17635, 241, 124, 17635, 241, 238, 17635, 102, 255, 17635, 102, 241, 17635, 102, 126, 17635, 100, 115, 17635, 103, 242, 461, 238, 17635, 102, 113, 17635, 242, 104, 461, 238, 17635, 251, 243, 17635, 242, 104, 461, 238, 17635, 251, 225]
Len Tokens 77
Decoded: 😜🫤☹️😖🤢🤮😇🐻❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑🦲👨🚒👨🚀
- Sanskrit
Input : ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ.
Tokens: ['à¥', 'IJ', 'Ġतà¥įर', 'à¥įयम', 'à¥įब', 'à¤ķà¤Ĥ', 'Ġय', 'à¤ľà¤¾à¤®', 'हà¥ĩ', 'Ġसà¥ģ', 'à¤Ĺन', 'à¥įध', 'िà¤Ĥ', 'Ġपà¥ģषà¥įà¤Ł', 'िव', 'रà¥įध', 'नम', 'à¥į', 'Ġà¤īरà¥įव', 'ार', 'à¥ģà¤ķ', 'म', 'िव', 'Ġबन', 'à¥įध', 'नान', 'à¥įम', 'à¥ĥतà¥įय', 'à¥ĭ', 'रà¥įम', 'à¥ģ', 'à¤ķà¥įष', 'à¥Ģय', 'Ġमाम', 'à¥ĥत', 'ात', 'à¥į', 'Ġà¥IJ', '.', 'Ġ']
Encoded: [261, 241, 5148, 1385, 2474, 69046, 452, 13431, 24956, 1196, 9464, 1074, 571, 56898, 616, 3985, 12134, 270, 19111, 315, 704, 327, 616, 854, 1074, 54741, 632, 15421, 282, 760, 304, 625, 2095, 1061, 1583, 471, 270, 21199, 16, 223]
Len Tokens 40
Decoded: ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ.
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no pipeline_tag.