Byte Fallback BPE Tokenizer

  • Trained using huggingface/tokenizers
  • Vocab Size : 72000

Training Args



### gpt-4-turbo regex (you can use your own but this works fine)
pat_str = "|".join(
        [
            r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
            r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
            r"""\p{N}{1,3}""",
            r""" ?[^\s\p{L}\p{N}]+[\r\n/]*""",
            r"""\s*[\r\n]+""",
            r"""\s+(?!\S)""",
            r"""\s+""",
        ]
    )


# Initialize tokenizer with script-aware settings
tokenizer = Tokenizer(models.BPE(
    byte_fallback=True,
    unk_token=None,
    fuse_unk=False
))

#  pre-tokenizer for multilingual support
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
    pre_tokenizers.Split(
        pattern=Regex(pat_str),
        behavior="isolated",
        invert=False
    ),
    pre_tokenizers.ByteLevel(
        add_prefix_space=False,
        trim_offsets=True,
        use_regex=False
    )
])

# (modified for Indic languages)
tokenizer.normalizer = normalizers.Sequence([
    normalizers.NFC(),  # Safer than NFKC for Indic scripts
])

from tokenizers import decoders
tokenizer.decoder = decoders.ByteLevel(
    add_prefix_space=False  # Must match pre-tokenizer settings
)



# Optimized trainer configuration
trainer = trainers.BpeTrainer(
    vocab_size=VOCAB_SIZE,
    special_tokens=SPECIAL_TOKENS,
    min_frequency=1,  # Lower frequency for low-resource languages
    show_progress=True,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    max_token_length=24,
    continuing_subword_prefix=""
)

def get_corpus():
    # Load and process full dataset
    dataset = load_dataset(DATASET_NAME, split="train")
    shuffled = dataset.shuffle(seed=42)
    return [text for text in shuffled[TEXT_COLUMN] if text.strip()] * 3

Special Tokens

{'bos_token': '<|begin_of_text|>', 'eos_token': '<|end_of_text|>', 'pad_token': '<|pad|>'}

Training Composition:

  • Maths: 550 M * 3 "aluncstokes/mathpile_arxiv_subset"

  • Code: 800 M * 3 codeparrot/github-code

  • Hinglish : 250 M * 3
    Abhishekcr448/Hinglish-Everyday-Conversations-1M Maihar/hinglish-80k

  • English : 2 000 M * 3 "allenai/c4", "en"

  • Hindi : 2 200 M * 3 aloobun/dhpileIN , data_dir='hi'

Evals

Tokenization Efficency (Less is Better)

Tokenizer English Hindi Tamil Bengali Malayalam Telugu Gujarati Punjabi Code_Python Code_Java c++ Math
0 deepseek-ai/DeepSeek-R1 (128k) 338874 22855 48957 39617 73928 40345 101020 79172 5231 2224 7055 5376
1 unsloth/phi-4 (100k) 308645 40456 59750 116122 149889 48689 118335 87413 4809 2110 6529 5573
2 deepseek-ai/DeepSeek-R1-Distill-Llama-8B (128k) 308512 21110 59625 115138 149883 48661 118061 86765 4809 2111 6530 5574
3 unsloth/gemma-2-9b-it(256k) 323335 15916 53913 53402 57219 47610 107925 87222 5948 2569 8639 5871
4 Ornaments/72k-Bilingual-BBPE-TK-SPM (72k) (Old) 366710 11447 61408 94191 97207 50229 117874 90045 8201 4000 13706 5585
5 Ornaments/72k-Bilingual-BBPE-TK-SPM-Identity (72k) 330830 10318 59089 93740 92655 44975 109411 87922 7819 3743 12953 5253
> Ornaments/72k-TK-BBPE-HF (72k) 321274 10813 67585 159985 193813 55654 134397 97063 5225 2263 7090 5150
7 nvidia/Nemotron-4-Mini-Hindi-4B-Instruct (256k) 332271 14327 55473 36615 45783 48270 160115 117174 6186 2732 8861 6136
8 sarvamai/OpenHathi-7B-Hi-v0.1-Base (48k) 370133 15633 67845 120340 105953 68315 159122 113817 6595 2792 9233 6223
9 sarvamai/sarvam-1 (68k) 385386 11257 61396 27348 31822 51463 119666 103344 7331 3068 9724 6864

Encode-Decode

  • Hindi
Input  : ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
Tokens: ['à¤ĭ', 'त', 'à¥ģà¤°à¤¾à¤ľ', 'Ġà¤Ĺायà¤ķ', 'वाड़', 'Ġ(', 'à¤ķपà¥įतान', '),', 'Ġडà¥ĩ', 'व', 'à¥ĭन', 'Ġà¤ķà¥īन', 'वà¥ĩ', ',', 'Ġरà¤ļ', 'िन', 'Ġरविà¤Ĥदà¥įर', ',', 'Ġराहà¥ģल', 'Ġतà¥įरिप', 'à¤¾à¤łà¥Ģ', ',', 'Ġशिवम', 'Ġदà¥ģबà¥ĩ', ',', 'Ġरविà¤Ĥदà¥įर', 'Ġà¤ľà¤¡à¥ĩà¤ľà¤¾', ',', 'Ġà¤ıमà¤ıस', 'Ġधà¥ĭनà¥Ģ', 'Ġ(', 'व', 'िà¤ķà¥ĩà¤Ł', 'à¤ķà¥Ģपर', '),', 'Ġà¤Ĩर', 'Ġà¤ħशà¥įविन', ',', 'Ġमà¥Ģ', 'थ', 'ाशा', 'Ġपथ', 'िर', 'ाना', ',', 'Ġà¤ĸलà¥Ģल', 'Ġà¤ħहमद', ',', 'Ġनà¥Ĥर', 'Ġà¤ħहमद', '।']
Encoded: [38659, 299, 21358, 15506, 7249, 509, 28249, 1222, 2308, 357, 1731, 8940, 2506, 14, 17890, 504, 19058, 14, 4384, 9183, 7568, 14, 18827, 13293, 14, 19058, 13516, 14, 17978, 12756, 509, 357, 3072, 14080, 1222, 2215, 17009, 14, 7584, 942, 22395, 11558, 647, 901, 14, 39383, 6593, 14, 25750, 6593, 337]
Len Tokens 51
Decoded: ऋतुराज गायकवाड़ (कप्तान), डेवोन कॉनवे, रचिन रविंद्र, राहुल त्रिपाठी, शिवम दुबे, रविंद्र जडेजा, एमएस धोनी (विकेटकीपर), आर अश्विन, मीथाशा पथिराना, खलील अहमद, नूर अहमद।
  • English
Input  : Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
Tokens: ['B', 'ang', 'alore', 'Ġand', 'ĠChennai', 'Ġhave', 'Ġfaced', 'Ġeach', 'Ġother', 'Ġin', 'Ġ', '33', 'Ġmatches', 'Ġin', 'ĠIPL', '.', 'ĠOut', 'Ġof', 'Ġthese', 'Ġ', '33', 'Ġgames', ',', 'ĠBangalore', 'Ġhave', 'Ġwon', 'Ġ', '11', 'Ġwhereas', 'ĠChennai', 'Ġhave', 'Ġcome', 'Ġout', 'Ġvict', 'orious', 'Ġon', 'Ġ', '21', 'Ġoccasion', '.', 'Ġ', '1', 'Ġmatch', 'Ġended', 'Ġwithout', 'Ġa', 'Ġresult', '.']
Encoded: [36, 951, 30658, 364, 45274, 688, 20861, 1993, 1101, 360, 223, 3276, 15006, 360, 11519, 16, 7921, 368, 1576, 223, 3276, 5013, 14, 45076, 688, 4896, 223, 1281, 21170, 45274, 688, 3051, 892, 9592, 29166, 462, 223, 2428, 13344, 16, 223, 19, 5359, 12784, 2752, 284, 1899, 16]
Len Tokens 48
Decoded: Bangalore and Chennai have faced each other in 33 matches in IPL. Out of these 33 games, Bangalore have won 11 whereas Chennai have come out victorious on 21 occasion. 1 match ended without a result.
  • Math
Input  : % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
  % If using xelatex or lualatex:
  \setmainfont{Roboto Slab}
  \setsansfont{Lato}
  \renewcommand{\familydefault}{\sfdefault}
\else
  % If using pdflatex:
  \usepackage[rm]{roboto}
  \usepackage[defaultsans]{lato}
  % \usepackage{sourcesanspro}
  \renewcommand{\familydefault}{\sfdefault}
\fi

Tokens: ['%', 'ĠChange', 'Ġthe', 'Ġfont', 'Ġif', 'Ġyou', 'Ġwant', 'Ġto', ',', 'Ġdepending', 'Ġon', 'Ġwhether', 'Ċ', '%', "Ġyou're", 'Ġusing', 'Ġpd', 'fl', 'ate', 'x', 'Ġor', 'Ġx', 'el', 'ate', 'x', '/l', 'ual', 'ate', 'x', 'Ċ', '%', 'ĠWH', 'EN', 'ĠCOMP', 'IL', 'ING', 'ĠWITH', 'ĠX', 'EL', 'ATE', 'X', 'ĠPLEASE', 'ĠUSE', 'Ċ', '%', 'Ġx', 'el', 'ate', 'x', 'Ġ-', 'shell', '-', 'escape', 'Ġ-', 'output', '-d', 'river', '="', 'xd', 'v', 'ip', 'df', 'mx', 'Ġ-', 'z', 'Ġ', '0', '"', 'Ġsample', '.', 'tex', 'Ċ', '\\', 'ift', 'utex', 'Ċ', 'Ġ', 'Ġ%', 'ĠIf', 'Ġusing', 'Ġx', 'el', 'ate', 'x', 'Ġor', 'Ġl', 'ual', 'ate', 'x', ':Ċ', 'Ġ', 'Ġ\\', 'set', 'main', 'font', '{R', 'ob', 'oto', 'ĠSl', 'ab', '}Ċ', 'Ġ', 'Ġ\\', 'sets', 'ans', 'font', '{L', 'ato', '}Ċ', 'Ġ', 'Ġ\\', 'renewcommand', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}Ċ', '\\', 'else', 'Ċ', 'Ġ', 'Ġ%', 'ĠIf', 'Ġusing', 'Ġpd', 'fl', 'ate', 'x', ':Ċ', 'Ġ', 'Ġ\\', 'us', 'ep', 'ackage', '[', 'rm', ']{', 'rob', 'oto', '}Ċ', 'Ġ', 'Ġ\\', 'us', 'ep', 'ackage', '[', 'defaults', 'ans', ']{', 'l', 'ato', '}Ċ', 'Ġ', 'Ġ%', 'Ġ\\', 'us', 'ep', 'ackage', '{s', 'ources', 'ans', 'pro', '}Ċ', 'Ġ', 'Ġ\\', 'renewcommand', '{\\', 'family', 'default', '}{\\', 'sf', 'default', '}Ċ', '\\fi', 'Ċ']
Encoded: [7, 20642, 307, 10013, 803, 449, 1654, 349, 14, 11248, 462, 4806, 201, 7, 7412, 2233, 34245, 6404, 520, 90, 578, 2163, 395, 520, 90, 21145, 1316, 520, 90, 201, 7, 20360, 2167, 49037, 5195, 4249, 25624, 2712, 7413, 7119, 58, 65107, 22822, 201, 7, 2163, 395, 520, 90, 904, 47931, 15, 38885, 904, 9854, 3209, 11707, 772, 27503, 88, 1056, 8772, 44531, 904, 92, 223, 18, 4, 10164, 16, 8774, 201, 62, 3113, 17783, 201, 223, 3259, 1783, 2233, 2163, 395, 520, 90, 578, 390, 1316, 520, 90, 1215, 223, 514, 1292, 7517, 5685, 4020, 1216, 6289, 11833, 483, 612, 223, 514, 8645, 820, 5685, 6459, 10542, 612, 223, 514, 67762, 676, 34277, 7107, 4403, 5765, 7107, 612, 62, 7583, 201, 223, 3259, 1783, 2233, 34245, 6404, 520, 90, 1215, 223, 514, 447, 1057, 14270, 61, 1876, 6592, 20636, 6289, 612, 223, 514, 447, 1057, 14270, 61, 71659, 820, 6592, 78, 10542, 612, 223, 3259, 514, 447, 1057, 14270, 6170, 4113, 820, 1387, 612, 223, 514, 67762, 676, 34277, 7107, 4403, 5765, 7107, 612, 68146, 201]
Len Tokens 177
Decoded: % Change the font if you want to, depending on whether
% you're using pdflatex or xelatex/lualatex
% WHEN COMPILING WITH XELATEX PLEASE USE
% xelatex -shell-escape -output-driver="xdvipdfmx -z 0" sample.tex
\iftutex
  % If using xelatex or lualatex:
  \setmainfont{Roboto Slab}
  \setsansfont{Lato}
  \renewcommand{\familydefault}{\sfdefault}
\else
  % If using pdflatex:
  \usepackage[rm]{roboto}
  \usepackage[defaultsans]{lato}
  % \usepackage{sourcesanspro}
  \renewcommand{\familydefault}{\sfdefault}
\fi
  • Code
Input  : class SentencePieceUnigramTokenizer(BaseTokenizer):
    """SentencePiece Unigram Tokenizer

    Represents the Unigram algorithm, with the pretokenization used by SentencePiece
    """

    def __init__(
        self,
        vocab: Optional[List[Tuple[str, float]]] = None,
        replacement: str = "▁",
        add_prefix_space: bool = True,
    ):
        if vocab is not None:
            # Let Unigram(..) fail if only one of them is None
            tokenizer = Tokenizer(Unigram(vocab))
        else:
            tokenizer = Tokenizer(Unigram())
Tokens: ['class', 'ĠSentence', 'P', 'iece', 'Un', 'ig', 'ram', 'Token', 'izer', '(Base', 'Token', 'izer', '):Ċ', 'ĠĠĠ', 'Ġ"""', 'Sentence', 'P', 'iece', 'ĠUn', 'ig', 'ram', 'ĠToken', 'izer', 'ĊĊ', 'ĠĠĠ', 'ĠRep', 'resents', 'Ġthe', 'ĠUn', 'ig', 'ram', 'Ġalgorithm', ',', 'Ġwith', 'Ġthe', 'Ġpret', 'oken', 'ization', 'Ġused', 'Ġby', 'ĠSentence', 'P', 'iece', 'Ċ', 'ĠĠĠ', 'Ġ"""ĊĊ', 'ĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__', '(Ċ', 'ĠĠĠĠĠĠĠ', 'Ġself', ',Ċ', 'ĠĠĠĠĠĠĠ', 'Ġvoc', 'ab', ':', 'ĠOptional', '[', 'List', '[T', 'uple', '[str', ',', 'Ġfloat', ']]', ']', 'Ġ=', 'ĠNone', ',Ċ', 'ĠĠĠĠĠĠĠ', 'Ġreplacement', ':', 'Ġstr', 'Ġ=', 'Ġ"', 'âĸ', 'ģ', '",Ċ', 'ĠĠĠĠĠĠĠ', 'Ġadd', '_prefix', '_space', ':', 'Ġbool', 'Ġ=', 'ĠTrue', ',Ċ', 'ĠĠĠ', 'Ġ):Ċ', 'ĠĠĠĠĠĠĠ', 'Ġif', 'Ġvoc', 'ab', 'Ġis', 'Ġnot', 'ĠNone', ':Ċ', 'ĠĠĠĠĠĠĠĠĠĠĠ', 'Ġ#', 'ĠLet', 'ĠUn', 'ig', 'ram', '(', '..', ')', 'Ġfail', 'Ġif', 'Ġonly', 'Ġone', 'Ġof', 'Ġthem', 'Ġis', 'ĠNone', 'Ċ', 'ĠĠĠĠĠĠĠĠĠĠĠ', 'Ġtoken', 'izer', 'Ġ=', 'ĠToken', 'izer', '(', 'Un', 'ig', 'ram', '(v', 'oc', 'ab', '))Ċ', 'ĠĠĠĠĠĠĠ', 'Ġelse', ':Ċ', 'ĠĠĠĠĠĠĠĠĠĠĠ', 'Ġtoken', 'izer', 'Ġ=', 'ĠToken', 'izer', '(', 'Un', 'ig', 'ram', '())']
Encoded: [2805, 45192, 50, 15717, 5091, 436, 1293, 13735, 7625, 62228, 13735, 7625, 2818, 413, 8533, 63588, 50, 15717, 2644, 436, 1293, 37433, 7625, 1025, 413, 4954, 13180, 307, 2644, 436, 1293, 11436, 14, 505, 307, 5992, 6907, 2920, 1909, 679, 45192, 50, 15717, 201, 413, 25641, 413, 1333, 4304, 3747, 1614, 3873, 545, 1572, 740, 545, 25497, 483, 28, 22800, 61, 3754, 42378, 15732, 27446, 14, 10809, 17233, 63, 532, 5200, 740, 545, 13804, 28, 2030, 532, 698, 27234, 226, 3288, 545, 1290, 31498, 34542, 28, 7817, 532, 9402, 740, 413, 42359, 545, 803, 25497, 483, 429, 696, 5200, 1215, 829, 1769, 4983, 2644, 436, 1293, 10, 879, 11, 6312, 803, 1407, 963, 368, 1212, 429, 5200, 201, 829, 15025, 7625, 532, 37433, 7625, 10, 5091, 436, 1293, 8425, 1287, 483, 4095, 545, 2589, 1215, 829, 15025, 7625, 532, 37433, 7625, 10, 5091, 436, 1293, 9066]
Len Tokens 146
Decoded: class SentencePieceUnigramTokenizer(BaseTokenizer):
    """SentencePiece Unigram Tokenizer

    Represents the Unigram algorithm, with the pretokenization used by SentencePiece
    """

    def __init__(
        self,
        vocab: Optional[List[Tuple[str, float]]] = None,
        replacement: str = "▁",
        add_prefix_space: bool = True,
    ):
        if vocab is not None:
            # Let Unigram(..) fail if only one of them is None
            tokenizer = Tokenizer(Unigram(vocab))
        else:
            tokenizer = Tokenizer(Unigram())
  • Emoji
Input  : 😜🫤☹️😖🤢🤮😇🐻‍❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑‍🦲👨‍🚒👨‍🚀
Tokens: ['ðŁ', 'ĺ', 'ľ', 'ðŁ', '«', '¤', 'â', 'ĺ', '¹', 'ï¸ı', 'ðŁ', 'ĺ', 'ĸ', 'ðŁ', '¤¢', 'ðŁ', '¤®', 'ðŁ', 'ĺ', 'ĩ', 'ðŁ', 'IJ', '»', 'âĢ', 'į', 'â', 'Ŀ', 'Ħ', 'ï¸ı', 'ðŁ', '¦', 'Ħ', 'ðŁ', 'IJ', '¾', 'ðŁ', 'IJ', '½', 'ðŁ', 'IJ', 'į', 'ðŁ', '¦', 'ŀ', 'ðŁ', '¦', 'IJ', 'ðŁ', '¦', '¿', 'ðŁ', '¤', '´', 'ðŁ', '§', 'ij', 'âĢ', 'į', 'ðŁ', '¦', '²', 'ðŁ', 'ij', '¨', 'âĢ', 'į', 'ðŁ', 'ļ', 'Ĵ', 'ðŁ', 'ij', '¨', 'âĢ', 'į', 'ðŁ', 'ļ', 'Ģ']
Encoded: [17635, 249, 253, 17635, 107, 100, 161, 249, 120, 67378, 17635, 249, 247, 17635, 6387, 17635, 326, 17635, 249, 232, 17635, 241, 122, 461, 238, 161, 254, 229, 67378, 17635, 102, 229, 17635, 241, 125, 17635, 241, 124, 17635, 241, 238, 17635, 102, 255, 17635, 102, 241, 17635, 102, 126, 17635, 100, 115, 17635, 103, 242, 461, 238, 17635, 102, 113, 17635, 242, 104, 461, 238, 17635, 251, 243, 17635, 242, 104, 461, 238, 17635, 251, 225]
Len Tokens 77
Decoded: 😜🫤☹️😖🤢🤮😇🐻‍❄️🦄🐾🐽🐍🦞🦐🦿🤴🧑‍🦲👨‍🚒👨‍🚀
  • Sanskrit
Input  : ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ. 
Tokens: ['à¥', 'IJ', 'Ġतà¥įर', 'à¥įयम', 'à¥įब', 'à¤ķà¤Ĥ', 'Ġय', 'à¤ľà¤¾à¤®', 'हà¥ĩ', 'Ġसà¥ģ', 'à¤Ĺन', 'à¥įध', 'िà¤Ĥ', 'Ġपà¥ģषà¥įà¤Ł', 'िव', 'रà¥įध', 'नम', 'à¥į', 'Ġà¤īरà¥įव', 'ार', 'à¥ģà¤ķ', 'म', 'िव', 'Ġबन', 'à¥įध', 'नान', 'à¥įम', 'à¥ĥतà¥įय', 'à¥ĭ', 'रà¥įम', 'à¥ģ', 'à¤ķà¥įष', 'à¥Ģय', 'Ġमाम', 'à¥ĥत', 'ात', 'à¥į', 'Ġà¥IJ', '.', 'Ġ']
Encoded: [261, 241, 5148, 1385, 2474, 69046, 452, 13431, 24956, 1196, 9464, 1074, 571, 56898, 616, 3985, 12134, 270, 19111, 315, 704, 327, 616, 854, 1074, 54741, 632, 15421, 282, 760, 304, 625, 2095, 1061, 1583, 471, 270, 21199, 16, 223]
Len Tokens 40
Decoded: ॐ त्र्यम्बकं यजामहे सुगन्धिं पुष्टिवर्धनम् उर्वारुकमिव बन्धनान्मृत्योर्मुक्षीय मामृतात् ॐ. 
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.