Chat Template Tokenizer for c4ai-command-r-v01

This repository includes a fast tokenizer for CohereForAI/c4ai-command-r-v01 with the Chat Template. The Tokenizer was created by replacing the string values of original tokens with id 255000 (<|START_OF_TURN_TOKEN|>) and 255001 (<|END_OF_TURN_TOKEN|>) with the role tokens <|SYSTEM_TOKEN|>, <|USER_TOKEN|> and <|CHATBOT_TOKEN|>.

No new tokens were added during that process to ensure that the original model's embedding doesn't need to be modified.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("prince-canuma/c4ai-command-r-v01-tokenizer-chat-template")

messages = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]

chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
print(chatml)

# <|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN |>
# <|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Hello! I'm doing well, thank you for asking! I'm excited to assist you and I'm looking forward to hearing your questions. How can I help you today?<| END_OF_TURN_TOKE NI>

Test

tokenizer = AutoTokenizer.from_pretrained("prince-canuma/c4ai-command-r-v01-tokenizer-chat-template")
original_tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")

# get special tokens
print(tokenizer.special_tokens_map)
print(original_tokenizer.special_tokens_map)

# check length of vocab
assert len(tokenizer) == len(original_tokenizer), "tokenizer are not having the same length"
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.