RTX 5000 can't not use this model...

#41
by arvis - opened

got a error on vllm serve
(APIServer pid=1334738) Value error, The quantization method mxfp4 is not supported for the current GPU. Minimum capability: 80. Current capability: 75. [type=value_error, input_value=ArgsKwargs((), {'model_co...additional_config': {}}), input_type=ArgsKwargs]

vllm serve /home/models/gpt-oss-20b
INFO 08-06 03:36:40 [init.py:241] Automatically detected platform cuda.
(APIServer pid=1334738) INFO 08-06 03:36:46 [api_server.py:1787] vLLM API server version 0.10.2.dev2+gf5635d62e.d20250806
(APIServer pid=1334738) INFO 08-06 03:36:46 [utils.py:326] non-default args: {'model_tag': '/home/models/gpt-oss-20b', 'model': '/home/models/gpt-oss-20b'}
(APIServer pid=1334738) INFO 08-06 03:36:53 [config.py:726] Resolved architecture: GptOssForCausalLM
(APIServer pid=1334738) ERROR 08-06 03:36:53 [config.py:123] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/models/gpt-oss-20b'. Use repo_type argument if needed., retrying 1 of 2
(APIServer pid=1334738) ERROR 08-06 03:36:55 [config.py:121] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/models/gpt-oss-20b'. Use repo_type argument if needed.
(APIServer pid=1334738) INFO 08-06 03:36:55 [config.py:3628] Downcasting torch.float32 to torch.float16.
(APIServer pid=1334738) INFO 08-06 03:36:55 [config.py:1759] Using max model len 131072
(APIServer pid=1334738) WARNING 08-06 03:36:55 [config.py:1198] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
(APIServer pid=1334738) WARNING 08-06 03:36:55 [arg_utils.py:1771] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.
(APIServer pid=1334738) WARNING 08-06 03:36:55 [arg_utils.py:1555] The model has a long context length (131072). This may causeOOM during the initial memory profiling phase, or result in low performance due to small KV cache size. Consider setting --max-model-len to a smaller value.
(APIServer pid=1334738) INFO 08-06 03:36:56 [config.py:244] Overriding cuda graph sizes to [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024]
(APIServer pid=1334738) Traceback (most recent call last):
(APIServer pid=1334738) File "/home/gorilla/vllm_env/.venv/bin/vllm", line 10, in
(APIServer pid=1334738) sys.exit(main())
(APIServer pid=1334738) ^^^^^^
(APIServer pid=1334738) File "/home/gorilla/vllm_env/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=1334738) args.dispatch_function(args)
(APIServer pid=1334738) File "/home/gorilla/vllm_env/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=1334738) uvloop.run(run_server(args))
(APIServer pid=1334738) File "/home/gorilla/vllm_env/.venv/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
(APIServer pid=1334738) return __asyncio.run(
(APIServer pid=1334738) ^^^^^^^^^^^^^^
(APIServer pid=1334738) File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1334738) return runner.run(main)
(APIServer pid=1334738) ^^^^^^^^^^^^^^^^
(APIServer pid=1334738) File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1334738) return self._loop.run_until_complete(task)
(APIServer pid=1334738) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1334738) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1334738) File "/home/gorilla/vllm_env/.venv/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=1334738) return await main
(APIServer pid=1334738) ^^^^^^^^^^
(APIServer pid=1334738) File "/home/gorilla/vllm_env/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1827, in run_server
(APIServer pid=1334738) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1334738) File "/home/gorilla/vllm_env/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1847, in run_server_worker
(APIServer pid=1334738) async with build_async_engine_client(
(APIServer pid=1334738) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1334738) File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1334738) return await anext(self.gen)
(APIServer pid=1334738) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1334738) File "/home/gorilla/vllm_env/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 167, in build_async_engine_client
(APIServer pid=1334738) async with build_async_engine_client_from_engine_args(
(APIServer pid=1334738) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1334738) File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1334738) return await anext(self.gen)
(APIServer pid=1334738) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1334738) File "/home/gorilla/vllm_env/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 193, in build_async_engine_client_from_engine_args
(APIServer pid=1334738) vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=1334738) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1334738) File "/home/gorilla/vllm_env/.venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1339, in create_engine_config
(APIServer pid=1334738) config = VllmConfig(
(APIServer pid=1334738) ^^^^^^^^^^^
(APIServer pid=1334738) File "/home/gorilla/vllm_env/.venv/lib/python3.12/site-packages/pydantic/_internal/_dataclasses.py", line 127, in init
(APIServer pid=1334738) s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=1334738) pydantic_core._pydantic_core.ValidationError: 1 validation error for VllmConfig
(APIServer pid=1334738) Value error, The quantization method mxfp4 is not supported for the current GPU. Minimum capability: 80. Current capability: 75. [type=value_error, input_value=ArgsKwargs((), {'model_co...additional_config': {}}), input_type=ArgsKwargs]
(APIServer pid=1334738) For further information visit https://errors.pydantic.dev/2.12/v/value_error

Correct, this has something to do Python/Pytorch or something along those lines. Things like this and Stable Diffusion/Automatic1111 (webui) don't work either.

It's because the libs they're using does not yet support RTX 5000 series-card if the chipnumber (or something like that) is SM_120, and for my RTX5090, it obviously was a SM_120.

With transformers main, it should even work on a T4 ! Please try to following google colab: https://colab.research.google.com/drive/15DJv6QWgc49MuC7dlNS9ifveXBDjCWO5?usp=sharing

Sign up or log in to comment