--- library_name: transformers tags: - transformers.js - tokenizers --- This is an updated version of which completes the tokenizer's vocabulary with extra unused tokens to ensure that `config.vocab_size == tokenizer.vocab_size`, which was [not the case](https://huggingface.co/databricks/dbrx-base/discussions/18) in the original model, making it compatible with llama.cpp. ## Why should you use this and not the tiktoken included in the orignal model? 1. This tokenizer is validated with the https://huggingface.co/datasets/xn (all languages) to be encode/decode compatible with dbrx-base tiktoken 2. Original tokenizer pad the vocabulary to correct size with `` tokens but encoder never uses them 3. Original tokenizer use eos as pad token which may confuse trainers to mask out the eos token so model never output eos. 4. This tokenizer has a complete vocabulary. modified from original code @ https://huggingface.co/Xenova/dbrx-instruct-tokenizer ```json Changes: 1. Remove non-base model tokens 2. Keep/Add `<|pad|>` special token to make sure padding can be differentiated from eos/bos. 3. Expose 15 unused/reserved `<|extra_N|>` for use 4. Expose 75 more unused/reserved `<|extra_added_N|>` tokens # pad token "100256": { "content": "<|pad|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true }, # 15 unused/reserved extra tokens "<|extra_0|>": 100261 "<|extra_1|>": 100262 ... "<|extra_14|>": 100275 # 75 unused/reserved "extra" extra tokens after the EOS token "<|extra_added_0|>": 100277 "<|extra_added_1|>": 100278 ... "<|extra_added_74|>": 100351 ``` # DBRX Instruct Tokenizer A 🤗-compatible version of the **DBRX Instruct** (adapted from [databricks/dbrx-instruct](https://huggingface.co/databricks/dbrx-instruct)). This means it can be used with Hugging Face libraries including [Transformers](https://github.com/huggingface/transformers), [Tokenizers](https://github.com/huggingface/tokenizers), and [Transformers.js](https://github.com/xenova/transformers.js). ## Example usage: ### Transformers/Tokenizers ```py from transformers import GPT2TokenizerFast tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/dbrx-instruct-tokenizer') assert tokenizer.encode('hello world') == [15339, 1917] ``` ### Transformers.js ```js import { AutoTokenizer } from '@xenova/transformers'; const tokenizer = await AutoTokenizer.from_pretrained('Xenova/dbrx-instruct-tokenizer'); const tokens = tokenizer.encode('hello world'); // [15339, 1917] ```