pagarsky/pikoGPT-51M · Hugging Face

pikoGPT-51M (base)

The second model in the "piko" family, which is my take on training smaller GPT-2 like models. Not tuned, just a base model.

Training

Trained on a single 3090 for ~30k steps with Karpathy's train_gpt2.py script from the llm.c repo. Dataset used is edu_fineweb10B from the aforementioned repo. The model achieved the val loss of ~3.57.

Optimizations

Compared to the pathfinder 16M model variant this one is:

trained in bfloat16 (compared to float32 for 16M model)
has vocabulary size bumped to 50304 (from 50257 in GPT-2 and pikoGPT-16M) following Karpathy's rule of "nice" numbers

Model file

This repo contains the .pt file which has the following structure

{
    'step': step,
    'config': asdict(model.config),
    'model_state_dict': model.state_dict(),
},

To load the model you can use the following piece of code (not very pretty, I know)

checkpoint = torch.load(path, weights_only=True)

config = GPTConfig(**checkpoint['config'])
model = GPT(config)

any_key = next(iter(checkpoint['model_state_dict'].keys()))
if any_key.startswith("_orig_mod."):
    # strip "_orig_mod." if the model was compiled
    model_state_dict = {k[10:]: v for k, v in checkpoint['model_state_dict'].items()}
else:
    model_state_dict = checkpoint['model_state_dict']

model.load_state_dict(model_state_dict)

pagarsky
/

pikoGPT-51M

pikoGPT-51M (base)

Training

Optimizations

Model file

Dataset used to train pagarsky/pikoGPT-51M