Problems with Metal

#42
by Thalesian - opened

I'm working on getting a version of this working with metal (MacOS) but I'm not sure it's supported. I followed the instruction on the GitHub:

'''
conda create -n gpt-oss python=3.12
conda activate gpt-oss
pip install --upgrade pip setuptools wheel
git clone https://github.com/openai/gpt-oss.git
GPTOSS_BUILD_METAL=1 pip install -e ".[metal]"
'''
I pre-downloaded the model.bin in the metal folder, then tried to implement following these recommendations:
'''
python gpt_oss/metal/examples/generate.py gpt-oss-20b/metal/model.bin -p "why did the chicken cross the road?"
'''
(gpt-oss) @Mac gpt-oss % python gpt_oss/metal/examples/generate.py gpt_oss/metal/model.bin -p "why did the chicken cross the road?"
Error: failed to create Metal compute pipeline state for function gptoss_f32_bf16w_unembedding: Unsupported float atomic operation for given target.
Traceback (most recent call last):
File "/Users//GitHub/gpt-oss/gpt_oss/metal/examples/generate.py", line 34, in
main(sys.argv[1:])
File "/Users//GitHub/gpt-oss/gpt_oss/metal/examples/generate.py", line 18, in main
model = Model(options.model)
^^^^^^^^^^^^^^^^^^^^
SystemError: <class 'gptoss.Model'> returned NULL without setting an exception
''
As far as I can tell I've done everything correctly. But no go so far. Has anyone successfully deployed this using metal? I've used the recommended pre-compiled metal.bin, I've recompiled it from safetensors, but no dice.

Note - Ollama does work with full metal support. I'd like to use it by command line, but the model is very fast on a Mac with enough RAM.

I opted to use venv and pyenv instead of Conda, and the reference command successfully loaded the model after I specified the context length. I set it to 2048 since I only have 16 GB of RAM on an M4. Without this, the model wouldn’t load. However, even after loading, I didn’t get any response. The example code also lacks any performance optimizations, so it ran very slowly—I eventually had to kill the process before it finished. For comparison, I was able to get 22 tokens per second using the same model in LM Studio.

Please check your path of model.bin is correct. If you just copy and paste from README, the path is wrong.

Sign up or log in to comment