Fine tune 120b at 8 H100s getting cuda OOM error

#117

by jinxu88 - opened 7 days ago

7 days ago

I am using this script (https://github.com/huggingface/gpt-oss-recipes/blob/main/README.md) for tuning got oss 120b. The OpenAI blog (https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers) mentioned it is doable on single H100, but I kept getting OOM. Any one successfully fine tuned it on H100?

shaobaij

5 days ago

The case you linked seems to SFT 20b not 120b...

jinxu88

5 days ago

I am interested in hearing if anyone has managed a successful fine tune run for 120b on H100, as the below model card mentioned. The GitHub link was only provided as a reference, but it did not work for the 120B model on H100.

Model card mentioned:

Fine-tuning
Both gpt-oss models can be fine-tuned for a variety of specialized use cases.
This larger model gpt-oss-120b can be fine-tuned on a single H100 node, whereas the smaller gpt-oss-20b can even be fine-tuned on consumer hardware.

yuchenxie

3 days ago

Pretty sure it meant QLoRA with deepspeed Z3 and other optimizations.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment