fix missing the `{% generation %}` keyword while using tokenizer.apply_chat_template(...return_assistant_tokens_mask=True)
#112
by
lllIIIlIlIk
- opened
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("/opt/tiger/gpt-oss-20b")
messages = [
{
"role": "user",
"content": "hi"
},
{
"role": "assistant",
"thinking": "think a moment",
"content": "Hello"
}
]
print(tokenizer.apply_chat_template(messages, tokenize=False).split('<|end|>', 1)[1])
processed = tokenizer.apply_chat_template(
messages,
reasoning_effort="high",
return_assistant_tokens_mask=True,
return_dict=True,)
first_end = processed["input_ids"].index(200007) + 1
print(processed['input_ids'][first_end:])
print(processed['attention_mask'][first_end:])
print(processed['assistant_masks'][first_end:])
Original Output:
return_assistant_tokens_mask==True but chat template does not contain `{% generation %}` keyword.
<|start|>user<|message|>hi<|end|><|start|>assistant<|channel|>analysis<|message|>think a moment<|end|><|start|>assistant<|channel|>final<|message|>Hello<|return|>
[200006, 1428, 200008, 3686, 200007, 200006, 173781, 200005, 35644, 200008, 49631, 261, 4205, 200007, 200006, 173781, 200005, 17196, 200008, 13225, 200002]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Output after fixing:
<|start|>user<|message|>hi<|end|><|start|>assistant<|channel|>analysis<|message|>think a moment<|end|><|start|>assistant<|channel|>final<|message|>Hello<|return|>
[200006, 1428, 200008, 3686, 200007, 200006, 173781, 200005, 35644, 200008, 49631, 261, 4205, 200007, 200006, 173781, 200005, 17196, 200008, 13225, 200002]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]