Transformers documentation

Getting Started with Chat Templates for Text LLMs

Transformers

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Getting Started with Chat Templates for Text LLMs

An increasingly common use case for LLMs is chat. In a chat context, rather than continuing a single string of text (as is the case with a standard language model), the model instead continues a conversation that consists of one or more messages, each of which includes a role, like “user” or “assistant”, as well as message text.

Much like tokenization, different models expect very different input formats for chat. This is the reason we added chat templates as a feature. Chat templates are part of the tokenizer for text-only LLMs or processor for multimodal LLMs. They specify how to convert conversations, represented as lists of messages, into a single tokenizable string in the format that the model expects.

We’ll explore the basic usage of chat templates with text-only LLMs in this page. For detailed guidance on multimodal models, we have a dedicated documentation oage for multimodal models, which covers how to work with image, video and audio inputs in your templates.

Let’s make this concrete with a quick example using the mistralai/Mistral-7B-Instruct-v0.1 model:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

>>> chat = [
...   {"role": "user", "content": "Hello, how are you?"},
...   {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
...   {"role": "user", "content": "I'd like to show off how chat templating works!"},
... ]

>>> tokenizer.apply_chat_template(chat, tokenize=False)
"<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"

Notice how the tokenizer has added the control tokens [INST] and [/INST] to indicate the start and end of user messages (but not assistant messages!), and the entire chat is condensed into a single string. If we use tokenize=True, which is the default setting, that string will also be tokenized for us.

Now, try the same code, but swap in the HuggingFaceH4/zephyr-7b-beta model instead, and you should get:

<|user|>
Hello, how are you?</s>
<|assistant|>
I'm doing great. How can I help you today?</s>
<|user|>
I'd like to show off how chat templating works!</s>

Both Zephyr and Mistral-Instruct were fine-tuned from the same base model, Mistral-7B-v0.1. However, they were trained with totally different chat formats. Without chat templates, you would have to write manual formatting code for each model, and it’s very easy to make minor errors that hurt performance! Chat templates handle the details of formatting for you, allowing you to write universal code that works for any model.

How do I use chat templates?

As you can see in the example above, chat templates are easy to use. Simply build a list of messages, with role and content keys, and then pass it to the apply_chat_template() or apply_chat_template() method depending on what type of model you are using. Once you do that, you’ll get output that’s ready to go! When using chat templates as input for model generation, it’s also a good idea to use add_generation_prompt=True to add a generation prompt.

Here’s an example of preparing input for model.generate(), using Zephyr again:

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceH4/zephyr-7b-beta"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)  # You may want to use bfloat16 and/or move to GPU here

messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))

This will yield a string in the input format that Zephyr expects.

<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s> 
<|user|>
How many helicopters can a human eat in one sitting?</s> 
<|assistant|>

Now that our input is formatted correctly for Zephyr, we can use the model to generate a response to the user’s question:

outputs = model.generate(tokenized_chat, max_new_tokens=128) 
print(tokenizer.decode(outputs[0]))

This will yield:

<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s> 
<|user|>
How many helicopters can a human eat in one sitting?</s> 
<|assistant|>
Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all.

Arr, ‘twas easy after all!

Is there an automated pipeline for chat?

Yes, there is! Our text generation pipelines support chat inputs, which makes it easy to use chat models. In the past, we used to use a dedicated “ConversationalPipeline” class, but this has now been deprecated and its functionality has been merged into the TextGenerationPipeline. Let’s try the Zephyr example again, but this time using a pipeline:

from transformers import pipeline

pipe = pipeline("text-generation", "HuggingFaceH4/zephyr-7b-beta")
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
print(pipe(messages, max_new_tokens=128)[0]['generated_text'][-1])  # Print the assistant's response

{'role': 'assistant', 'content': "Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all."}

The pipeline will take care of all the details of tokenization and calling apply_chat_template for you - once the model has a chat template, all you need to do is initialize the pipeline and pass it the list of messages!

What are “generation prompts”?

You may have noticed that the apply_chat_template method has an add_generation_prompt argument. This argument tells the template to add tokens that indicate the start of a bot response. For example, consider the following chat:

messages = [
    {"role": "user", "content": "Hi there!"},
    {"role": "assistant", "content": "Nice to meet you!"},
    {"role": "user", "content": "Can I ask a question?"}
]

Here’s what this will look like without a generation prompt, for a model that uses standard “ChatML” formatting:

tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
"""<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
"""

And here’s what it looks like with a generation prompt:

tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
"""<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
<|im_start|>assistant
"""

Note that this time, we’ve added the tokens that indicate the start of a bot response. This ensures that when the model generates text it will write a bot response instead of doing something unexpected, like continuing the user’s message. Remember, chat models are still just language models - they’re trained to continue text, and chat is just a special kind of text to them! You need to guide them with appropriate control tokens, so they know what they’re supposed to be doing.

Not all models require generation prompts. Some models, like LLaMA, don’t have any special tokens before bot responses. In these cases, the add_generation_prompt argument will have no effect. The exact effect that add_generation_prompt has will depend on the template being used.

What does “continue_final_message” do?

When passing a list of messages to apply_chat_template or TextGenerationPipeline, you can choose to format the chat so the model will continue the final message in the chat instead of starting a new one. This is done by removing any end-of-sequence tokens that indicate the end of the final message, so that the model will simply extend the final message when it begins to generate text. This is useful for “prefilling” the model’s response.

Here’s an example:

chat = [
    {"role": "user", "content": "Can you format the answer in JSON?"},
    {"role": "assistant", "content": '{"name": "'},
]

formatted_chat = tokenizer.apply_chat_template(chat, tokenize=True, return_dict=True, continue_final_message=True)
model.generate(**formatted_chat)

The model will generate text that continues the JSON string, rather than starting a new message. This approach can be very useful for improving the accuracy of the model’s instruction-following when you know how you want it to start its replies.

Because add_generation_prompt adds the tokens that start a new message, and continue_final_message removes any end-of-message tokens from the final message, it does not make sense to use them together. As a result, you’ll get an error if you try!

The default behaviour of TextGenerationPipeline is to set add_generation_prompt=True so that it starts a new message. However, if the final message in the input chat has the “assistant” role, it will assume that this message is a prefill and switch to continue_final_message=True instead, because most models do not support multiple consecutive assistant messages. You can override this behaviour by explicitly passing the continue_final_message argument when calling the pipeline.

Can I use chat templates in training?

Yes! This is a good way to ensure that the chat template matches the tokens the model sees during training. We recommend that you apply the chat template as a preprocessing step for your dataset. After this, you can simply continue like any other language model training task. When training, you should usually set add_generation_prompt=False, because the added tokens to prompt an assistant response will not be helpful during training. Let’s see an example:

from transformers import AutoTokenizer
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")

chat1 = [
    {"role": "user", "content": "Which is bigger, the moon or the sun?"},
    {"role": "assistant", "content": "The sun."}
]
chat2 = [
    {"role": "user", "content": "Which is bigger, a virus or a bacterium?"},
    {"role": "assistant", "content": "A bacterium."}
]

dataset = Dataset.from_dict({"chat": [chat1, chat2]})
dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)})
print(dataset['formatted_chat'][0])

And we get:

<|user|>
Which is bigger, the moon or the sun?</s>
<|assistant|>
The sun.</s>

From here, just continue training like you would with a standard language modelling task, using the formatted_chat column.

By default, some tokenizers add special tokens like <bos> and <eos> to text they tokenize. Chat templates should already include all the special tokens they need, and so additional special tokens will often be incorrect or duplicated, which will hurt model performance.

Therefore, if you format text with apply_chat_template(tokenize=False), you should set the argument add_special_tokens=False when you tokenize that text later. If you use apply_chat_template(tokenize=True), you don’t need to worry about this!

< > Update on GitHub

←Best Practices for Generation with Cache Multimodal Chat Templates for Vision and Audio LLMs→