DaisyCore — daisy_milli

Model Description

DaisyCore transformer with 26 layers, 14 attention heads, and a model dimension of 1,792. Uses block-causal sliding window attention (window size 2,048) with standard attention implementation.

Architecture

Property Value
Architecture DaisyCore
Layers 26
Attention Heads 14
Model Dimension 1,792
Head Dimension 128
Sliding Window Size 2,048
Max Sequence Length 131,072
Vocabulary Size 49,152
Attention Implementation standard
Value Embeddings True
Tied Embeddings False
Skip Mix Mode linear
Tokenizer jonathanmiddleton/daisy
Dtype bfloat16
Parameters (total) 2,323,120,245
Parameters (non-embedding) 1,001,914,485
Parameters (embedding) 1,321,205,760

Training Progress

Metric Value
Checkpoint Step 34,625
Tokens Processed 108.53B (108,527,616,000)
Target Tokens 150.00B (150,000,000,000)
Progress 72.4%
Best Validation Loss 2.07058
Evaluations Performed 772
Saved 2026-03-03 19:55 UTC

Training Configuration

Optimizers

Optimizer Parameter Group Learning Rate
AdamW head_params 0.003216
AdamW embed_params 0.1865
AdamW scalar_params 0.02099
Muon hidden_matrix_params 0.025

Schedule & Regularization

Parameter Value
LR Scale 1.0
LR Schedule n_phase_linear
LR Schedule — begin_after_fraction 0.0
LR Schedule — cooldown_fraction 0.0
LR Schedule — floor 0.0
LR Schedule — phases [{'progress': 0.0, 'scale': 1.0}, {'progress': 0.5766, 'scale': 0.23492}, {'progress': 1.0, 'scale': 0.15}]
LR Schedule — warmup_fraction 0.0
Gradient Accumulation Steps 3
Muon Warmup Steps 300
Seed 1337

Training Data

Type Sequence Length Path
fineweb-edu-dedup 16,384 data/fineweb-edu-dedup/fineweb-edu-dedup_jonathanmiddleton_daisy_train_*.bin

Checkpoint Provenance

  • Resumed from: JonathanMiddleton/daisy-milli-base-v18

All Hyperparameters

Parameter Value
window_size 2048
vocab_size 49152
eos_token_id 49131
num_layers 26
num_heads 14
model_dim 1792
head_dim 128
max_seq_len 131072
model_spec daisy_milli
model_class models.daisy.daisy_core.DaisyCore
target_tokens 100000000000
full_window_target_tokens 3000000000
torch_coordinate_descent_tuning False
torch_inductor_config_max_autotune False
overfit False
full_windows False
wandb_log True
wandb_project milli
wandb_run_name milli_v18d
wandb_group pretrain
resume_checkpoint JonathanMiddleton/daisy-milli-base-v18
resume_target_tokens_override 150000000000
use_value_embeddings True
use_tied_embeddings False
seed 1337
task_val_debug_log_samples False
log_interval 16384
muon_warmup_steps 300
lr_scale 1.0
cooldown_fraction 0.0
lr_schedule {"name": "n_phase_linear", "config": {"cooldown_fraction": 0.0, "phases": [{"progress": 0.0, "scale": 1.0}, {"progress": 0.5766, "scale": 0.23492}, {"progress": 1.0, "scale": 0.15}], "floor": 0.0, "warmup_fraction": 0.0, "begin_after_fraction": 0.0}}
grad_acc_steps 3
val_loss_every_tokens 196608000
checkpoint_warmup_tokens 93000000000
checkpoint_per_n_tokens 393215999
save_checkpoint True
benchmarks_frequency 4
mmlu_cache_bin_path data/mmlu_cache/mmlu_cache.bin
mmlu_cache_bin_rebuild False
task_training False
track_last_n_layers 0
Downloads last month
17
Safetensors
Model size
2B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support