|
--- |
|
tags: |
|
- model_hub_mixin |
|
- pytorch_model_hub_mixin |
|
license: mit |
|
datasets: |
|
- kjj0/fineweb100B-gpt2 |
|
language: |
|
- en |
|
--- |
|
|
|
3.2B parameter base model trained for ~64B tokens from the FineWeb dataset |
|
|
|
uses gpt2 tokenizer from tiktoken |
|
|
|
[wandb training metrics](https://api.wandb.ai/links/teammapo-mapo-labs/zooq3iig) |
|
- note: increased batch size from 8 to 512 at step 2,160,000 |
|
- Final checkpoint: step 2,187,000, val_loss: 2.7489 |
|
- Trained on a 8xH100 80GB node using data parallel |
|
|
|
Model config: |
|
``` |
|
"d_head": 128, |
|
"d_model": 8192, |
|
"n_heads": 64, |
|
"n_layers": 3, |
|
"n_vocab": 50257 |
|
``` |
|
|
|
Usage: |
|
``` |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
import torch |
|
|
|
model = AutoModelForCausalLM.from_pretrained("michaelbzhu/test-3.2B-base", trust_remote_code=True) |
|
model = model.cuda() |
|
tokenizer = AutoTokenizer.from_pretrained("michaelbzhu/test-3.2B-base", trust_remote_code=True) |
|
|
|
prompt = "The future of AI is" |
|
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device) |
|
for _ in range(20): |
|
logits = model(input_ids).logits[0, -1, :] |
|
next_token = torch.multinomial(torch.softmax(logits, dim=-1), 1).unsqueeze(0) |
|
input_ids = torch.cat([input_ids, next_token], dim=1) |
|
print(tokenizer.decode(input_ids[0])) |
|
``` |
|
|
|
Eval: |
|
``` |
|
$ lm_eval --model hf \ |
|
--model_args pretrained=michaelbzhu/test-3.2B-base,trust_remote_code=True \ |
|
--tasks mmlu_college_medicine,hellaswag,lambada_openai,arc_easy,winogrande,arc_challenge,openbookqa \ |
|
--device cuda:0 \ |
|
--batch_size 16 |
|
|
|
| Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr| |
|
|----------------|------:|------|-----:|----------|---|------:|---|-----:| |
|
|arc_challenge | 1|none | 0|acc |↑ | 0.2363|± |0.0124| |
|
| | |none | 0|acc_norm |↑ | 0.2637|± |0.0129| |
|
|arc_easy | 1|none | 0|acc |↑ | 0.5758|± |0.0101| |
|
| | |none | 0|acc_norm |↑ | 0.4996|± |0.0103| |
|
|hellaswag | 1|none | 0|acc |↑ | 0.3827|± |0.0049| |
|
| | |none | 0|acc_norm |↑ | 0.4846|± |0.0050| |
|
|lambada_openai | 1|none | 0|acc |↑ | 0.4238|± |0.0069| |
|
| | |none | 0|perplexity|↓ |14.7850|± |0.4335| |
|
|college_medicine| 1|none | 0|acc |↑ | 0.2370|± |0.0324| |
|
|openbookqa | 1|none | 0|acc |↑ | 0.2180|± |0.0185| |
|
| | |none | 0|acc_norm |↑ | 0.3180|± |0.0208| |
|
|winogrande | 1|none | 0|acc |↑ | 0.5367|± |0.0140| |
|
``` |