DaisyCore — daisy_milli
Model Description
DaisyCore transformer with 26 layers, 14 attention heads, and a model dimension of 1,792. Uses block-causal sliding window attention (window size 2,048) with standard attention implementation.
Architecture
| Property |
Value |
| Architecture |
DaisyCore |
| Layers |
26 |
| Attention Heads |
14 |
| Model Dimension |
1,792 |
| Head Dimension |
128 |
| Sliding Window Size |
2,048 |
| Max Sequence Length |
131,072 |
| Vocabulary Size |
49,152 |
| Attention Implementation |
standard |
| Value Embeddings |
True |
| Tied Embeddings |
False |
| Skip Mix Mode |
linear |
| Tokenizer |
jonathanmiddleton/daisy |
| Dtype |
bfloat16 |
| Parameters (total) |
2,323,120,245 |
| Parameters (non-embedding) |
1,001,914,485 |
| Parameters (embedding) |
1,321,205,760 |
Training Progress
| Metric |
Value |
| Checkpoint Step |
34,625 |
| Tokens Processed |
108.53B (108,527,616,000) |
| Target Tokens |
150.00B (150,000,000,000) |
| Progress |
72.4% |
| Best Validation Loss |
2.07058 |
| Evaluations Performed |
772 |
| Saved |
2026-03-03 19:55 UTC |
Training Configuration
Optimizers
| Optimizer |
Parameter Group |
Learning Rate |
| AdamW |
head_params |
0.003216 |
| AdamW |
embed_params |
0.1865 |
| AdamW |
scalar_params |
0.02099 |
| Muon |
hidden_matrix_params |
0.025 |
Schedule & Regularization
| Parameter |
Value |
| LR Scale |
1.0 |
| LR Schedule |
n_phase_linear |
| LR Schedule — begin_after_fraction |
0.0 |
| LR Schedule — cooldown_fraction |
0.0 |
| LR Schedule — floor |
0.0 |
| LR Schedule — phases |
[{'progress': 0.0, 'scale': 1.0}, {'progress': 0.5766, 'scale': 0.23492}, {'progress': 1.0, 'scale': 0.15}] |
| LR Schedule — warmup_fraction |
0.0 |
| Gradient Accumulation Steps |
3 |
| Muon Warmup Steps |
300 |
| Seed |
1337 |
Training Data
| Type |
Sequence Length |
Path |
| fineweb-edu-dedup |
16,384 |
data/fineweb-edu-dedup/fineweb-edu-dedup_jonathanmiddleton_daisy_train_*.bin |
Checkpoint Provenance
- Resumed from:
JonathanMiddleton/daisy-milli-base-v18
All Hyperparameters
| Parameter |
Value |
| window_size |
2048 |
| vocab_size |
49152 |
| eos_token_id |
49131 |
| num_layers |
26 |
| num_heads |
14 |
| model_dim |
1792 |
| head_dim |
128 |
| max_seq_len |
131072 |
| model_spec |
daisy_milli |
| model_class |
models.daisy.daisy_core.DaisyCore |
| target_tokens |
100000000000 |
| full_window_target_tokens |
3000000000 |
| torch_coordinate_descent_tuning |
False |
| torch_inductor_config_max_autotune |
False |
| overfit |
False |
| full_windows |
False |
| wandb_log |
True |
| wandb_project |
milli |
| wandb_run_name |
milli_v18d |
| wandb_group |
pretrain |
| resume_checkpoint |
JonathanMiddleton/daisy-milli-base-v18 |
| resume_target_tokens_override |
150000000000 |
| use_value_embeddings |
True |
| use_tied_embeddings |
False |
| seed |
1337 |
| task_val_debug_log_samples |
False |
| log_interval |
16384 |
| muon_warmup_steps |
300 |
| lr_scale |
1.0 |
| cooldown_fraction |
0.0 |
| lr_schedule |
{"name": "n_phase_linear", "config": {"cooldown_fraction": 0.0, "phases": [{"progress": 0.0, "scale": 1.0}, {"progress": 0.5766, "scale": 0.23492}, {"progress": 1.0, "scale": 0.15}], "floor": 0.0, "warmup_fraction": 0.0, "begin_after_fraction": 0.0}} |
| grad_acc_steps |
3 |
| val_loss_every_tokens |
196608000 |
| checkpoint_warmup_tokens |
93000000000 |
| checkpoint_per_n_tokens |
393215999 |
| save_checkpoint |
True |
| benchmarks_frequency |
4 |
| mmlu_cache_bin_path |
data/mmlu_cache/mmlu_cache.bin |
| mmlu_cache_bin_rebuild |
False |
| task_training |
False |
| track_last_n_layers |
0 |