This is an updated version of https://huggingface.co/LnL-AI/dbrx-base-tokenizer which completes the tokenizer's vocabulary with extra unused tokens to ensure that config.vocab_size == tokenizer.vocab_size, which was not the case in the original model, making it compatible with llama.cpp.

Why should you use this and not the tiktoken included in the orignal model?

This tokenizer is validated with the https://huggingface.co/datasets/xn (all languages) to be encode/decode compatible with dbrx-base tiktoken
Original tokenizer pad the vocabulary to correct size with <extra_N> tokens but encoder never uses them
Original tokenizer use eos as pad token which may confuse trainers to mask out the eos token so model never output eos.
This tokenizer has a complete vocabulary.

modified from original code @ https://huggingface.co/Xenova/dbrx-instruct-tokenizer

Changes:
1. Remove non-base model tokens
2. Keep/Add `<|pad|>` special token to make sure padding can be differentiated from eos/bos.
3. Expose 15 unused/reserved `<|extra_N|>` for use
4. Expose 75 more unused/reserved `<|extra_added_N|>` tokens

# pad token
 "100256": {
      "content": "<|pad|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },

# 15 unused/reserved extra tokens
"<|extra_0|>": 100261
"<|extra_1|>": 100262
...
"<|extra_14|>": 100275

# 75 unused/reserved "extra" extra tokens after the EOS token
"<|extra_added_0|>": 100277
"<|extra_added_1|>": 100278
...
"<|extra_added_74|>": 100351

DBRX Instruct Tokenizer

A 🤗-compatible version of the DBRX Instruct (adapted from databricks/dbrx-instruct). This means it can be used with Hugging Face libraries including Transformers, Tokenizers, and Transformers.js.

Example usage:

Transformers/Tokenizers

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/dbrx-instruct-tokenizer')
assert tokenizer.encode('hello world') == [15339, 1917]

Transformers.js

import { AutoTokenizer } from '@xenova/transformers';

const tokenizer = await AutoTokenizer.from_pretrained('Xenova/dbrx-instruct-tokenizer');
const tokens = tokenizer.encode('hello world'); // [15339, 1917]