File size: 4,329 Bytes
5ec211b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f037ba0
5ec211b
 
f037ba0
 
 
5ec211b
 
f037ba0
 
 
e4520fd
 
5ec211b
f037ba0
5ec211b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec20474
 
 
5ec211b
 
 
 
ec20474
5ec211b
 
2b29236
 
5ec211b
 
 
 
19d951e
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
license: mit
base_model:
- meta-llama/Llama-3.2-3B-Instruct
---

# Llama3-2-3B-IT-Byte 🔢 

__[Llama3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) transferred to byte-level tokenization via [cross-tokenizer distillation](https://arxiv.org/abs/2503.20083).__

__🚧This model is intended as a proof-of-concept that we can quickly & effectively transfer pretrained (subword-based) models to the byte-level. It is not optimized for production use (in particular, it is not optimized for speed)!🚧__

## Benchmarks

Llama3-2-3B-IT-Byte performs competitively although it has been trained only on 1.3B bytes (328M subword tokens total).

|                                   | MMLU | BoolQ | PiQA  | IFEval | ARC-C | Avg. |
|-----------------------------------|------|-------|-------|--------|-------|------|
| [EvaByte-6.5B-SFT](https://huggingface.co/EvaByte/EvaByte-SFT)                  | 49.5 | 79.5* | 74.1* | 60.2   | 64.6* | 65.6 |
| [Llama3.2-3B-Instruct (original)](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)   | 62.4 | 78.8  | 76.9  | 76.6   | 43.9  | 67.7 |
| [Gemma2-2B-IT (original)](https://huggingface.co/google/gemma-2-2b-it)            | 56.9 | 83.8  | 79.6  | 62.5   | 50.4  | 66.6 |
| __Llama3-2-3B-IT-Byte (this model)__              | __57.0__ | __76.6__  | __73.6__  | __58.8__   | __39.8__  | __61.2__ |
| [Gemma2-2B-IT-Byte](https://huggingface.co/benjamin/Gemma2-2B-IT-Byte) | 51.0 | 80.5  | 71.5  | 51.9   | 38.2  | 58.6 |

<small>*Numbers from EvaByte-6.5B (Base) since they are not reported for the SFT model.</small>

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("benjamin/Llama3-2-3B-IT-Byte")
print("Vocab Size:", len(tokenizer))  # 256 bytes + some special tokens

device = "cuda"
model = AutoModelForCausalLM.from_pretrained(
    "benjamin/Llama3-2-3B-IT-Byte", trust_remote_code=True
)
model = model.to(device)

tokens = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Hello, how are you doing?"}], return_tensors="pt"
)
eot_id = tokenizer.convert_tokens_to_ids("<|eot_id|>")
out = model.generate(tokens.to(model.device), eos_token_id=eot_id)
print(tokenizer.decode(out[0]))

```

## Training

This model has been trained using [`tokenkit`](https://github.com/bminixhofer/tokenkit) with the following command:

```
python3 scripts/cross_tokenizer_distill.py \
    --config=configs/cross_tokenizer_distill.yaml \
    --overrides \
    losses=[sft,alm_unconstrained,alm_latents] \
    multitask_aggregation_fn=approx_gradmag_preserve_mag \
    alm_mode=merge_by_space_prob+append_space \
    tokenizer_pair_bias_threshold=0.1 \
    max_student_length=2048 \
    steps=20000 \
    eval_interval=20000 \
    save_interval=20000 \
    optimizer.learning_rate=3.e-5 \
    optimizer.weight_decay=0.0 \
    optimizer.max_grad_norm=null \
    optimizer.grad_acc_steps=1 \
    train_model_mode=full \
    expand_input_ids=true \
    output_embeddings_mode=untie \
    eval.tasks=[arc_easy,arc_challenge,piqa,boolq,arithmetic,mmlu,ifeval,agieval_en,agieval_cn] \
    data.batch_size=32 \
    student.pretrained_model_name_or_path=benjamin/Llama-3.2-3B-Instruct-flax \
    student.tokenizer_name=meta-llama/Llama-3.2-3B-Instruct:source=Llama3 \
    target_tokenizer_name=meta-llama/Llama-3.2-3B-Instruct:source=Llama3:target=Llama3:conversion=byte \
    n_model_parallel=4 \
    n_data_parallel=4 \
    data.num_workers=16 \
    num_workers=16 \
    name=llama3_to_byte_20k
```

Training took ~26 hours on a TPU v4-32.

## Future Work

The current version of this model is trained for 20k steps with 32*2048 bytes per batch (= 1.3B bytes ≈ 328M subword tokens total). It was unexpected that it performs as well as it does with this very short training procedure. We plan to train a new version for more steps (you can also do so yourself using [`tokenkit`](https://github.com/bminixhofer/tokenkit)).

To preserve efficiency, we would have to add (a combination of) [BLT-style hierarchical processing](https://arxiv.org/abs/2412.09871), [attention approximations](https://hkunlp.github.io/blog/2025/evabyte/), and [self-speculative decoding](https://arxiv.org/abs/2309.08168).

## Acknowledgments

Training was enabled by Cloud TPUs from Google’s TPU Research Cloud (TRC).