openai/gpt-oss-20b · Is CPU offloading possible?

SpaghettiM

16 days ago

I have only 8GB VRAM, is it possible to try the model on cuda+cpu?
Ollama returns Error 500

Lynrayy

16 days ago

Same question, my GPU not utilizing at all in Ollama

roxlukas

15 days ago

•

edited 15 days ago

I am facing the same issue and I think I have an explanation. Based on the log below you can spot, that:

model requires 12.2 GiB of memory to run fully on GPU
my GPU only has 6.8 GiB available VRAM (out of total 8)
The model's repeating layers alone need 10.7 GiB (memory.weights.repeating="10.7 GiB") <--- this is the core problem

Ollama determined that even a single layer of this model is too large to fit in the available GPU memory.

Raw log below:

time=2025-08-06T09:24:13.355+02:00 level=INFO source=server.go:135 msg="system memory" total="63.9 GiB" free="54.2 GiB" free_swap="57.4 GiB"
time=2025-08-06T09:24:13.355+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[6.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="12.2 GiB" memory.required.partial="0 B" memory.required.kv="492.0 MiB" memory.required.allocations="[0 B]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="4.0 GiB" memory.graph.partial="8.0 GiB"
time=2025-08-06T09:24:13.414+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\Users\*****\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --model D:\AI\models\blobs\sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 16384 --batch-size 512 --threads 6 --no-mmap --parallel 1 --port 50090"
time=2025-08-06T09:24:13.419+02:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-06T09:24:13.419+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-06T09:24:13.423+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
time=2025-08-06T09:24:13.453+02:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-06T09:24:13.462+02:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:50090"
time=2025-08-06T09:24:13.508+02:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
time=2025-08-06T09:24:13.675+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
load_backend: loaded CUDA backend from C:\Users*****\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users*****\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-08-06T09:24:17.944+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-08-06T09:24:18.030+02:00 level=INFO source=ggml.go:367 msg="offloading 0 repeating layers to GPU"
time=2025-08-06T09:24:18.030+02:00 level=INFO source=ggml.go:371 msg="offloading output layer to CPU"
time=2025-08-06T09:24:18.030+02:00 level=INFO source=ggml.go:378 msg="offloaded 0/25 layers to GPU"
time=2025-08-06T09:24:18.030+02:00 level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="12.8 GiB"
time=2025-08-06T09:24:18.107+02:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-08-06T09:24:18.107+02:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="4.0 GiB"
time=2025-08-06T09:24:24.193+02:00 level=INFO source=server.go:637 msg="llama runner started in 10.77 seconds"
[GIN] 2025/08/06 - 09:36:55 | 200 | 12m41s | 192.168.1.250 | POST "/api/generate"

ikarius

11 days ago

My GPU has 16GB vram, and I am trying to run this Gpt 20B OSS in 4-bit bnb, but Transformers insist on offloading it to CPU.