Is CPU offloading possible?
I have only 8GB VRAM, is it possible to try the model on cuda+cpu?
Ollama returns Error 500
Same question, my GPU not utilizing at all in Ollama
I am facing the same issue and I think I have an explanation. Based on the log below you can spot, that:
- model requires 12.2 GiB of memory to run fully on GPU
- my GPU only has 6.8 GiB available VRAM (out of total 8)
- The model's repeating layers alone need 10.7 GiB (memory.weights.repeating="10.7 GiB") <--- this is the core problem
Ollama determined that even a single layer of this model is too large to fit in the available GPU memory.
Raw log below:
time=2025-08-06T09:24:13.355+02:00 level=INFO source=server.go:135 msg="system memory" total="63.9 GiB" free="54.2 GiB" free_swap="57.4 GiB"
time=2025-08-06T09:24:13.355+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[6.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="12.2 GiB" memory.required.partial="0 B" memory.required.kv="492.0 MiB" memory.required.allocations="[0 B]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="4.0 GiB" memory.graph.partial="8.0 GiB"
time=2025-08-06T09:24:13.414+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\Users\*****\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --model D:\AI\models\blobs\sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 16384 --batch-size 512 --threads 6 --no-mmap --parallel 1 --port 50090"
time=2025-08-06T09:24:13.419+02:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-06T09:24:13.419+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-06T09:24:13.423+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
time=2025-08-06T09:24:13.453+02:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-06T09:24:13.462+02:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:50090"
time=2025-08-06T09:24:13.508+02:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
time=2025-08-06T09:24:13.675+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
load_backend: loaded CUDA backend from C:\Users*****\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users*****\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-08-06T09:24:17.944+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-08-06T09:24:18.030+02:00 level=INFO source=ggml.go:367 msg="offloading 0 repeating layers to GPU"
time=2025-08-06T09:24:18.030+02:00 level=INFO source=ggml.go:371 msg="offloading output layer to CPU"
time=2025-08-06T09:24:18.030+02:00 level=INFO source=ggml.go:378 msg="offloaded 0/25 layers to GPU"
time=2025-08-06T09:24:18.030+02:00 level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="12.8 GiB"
time=2025-08-06T09:24:18.107+02:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-08-06T09:24:18.107+02:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="4.0 GiB"
time=2025-08-06T09:24:24.193+02:00 level=INFO source=server.go:637 msg="llama runner started in 10.77 seconds"
[GIN] 2025/08/06 - 09:36:55 | 200 | 12m41s | 192.168.1.250 | POST "/api/generate"
My GPU has 16GB vram, and I am trying to run this Gpt 20B OSS in 4-bit bnb, but Transformers insist on offloading it to CPU.