Tokenizer Mismatch

#28

by clpoehl - opened 16 days ago

16 days ago

2025-08-05 11:45:54,728 - INFO - Vocab size: 150000
2025-08-05 11:45:54,729 - INFO - Cutting vocab to first 130072 tokens.
2025-08-05 11:45:55,168 - INFO - Tokenizer IDs -> EOT: None, EOS: None, UNK: 0, PAD: 0
2025-08-05 11:45:55,342 - INFO - Model-declared vocab_size: 131072

Does anyone know why I get the tokenizer mismatch using the mistral_common tokenizer?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment