RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3!
#17
by
hoseongahn
- opened
A RuntimeError occurs when I execute the code using 4 V100 (16GB) GPUs.
error:
/.venv/lib/python3.10/site-packages/transformers/generation/utils.py:2505: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model’s device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to(‘cuda’) before running `.generate()`.
warnings.warn(
Traceback (most recent call last):
File “/main.py”, line 43, in <module>
outputs = model.generate(**inputs, max_new_tokens=500)
File “/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py”, line 116, in decorate_context
return func(*args, **kwargs)
File “/.venv/lib/python3.10/site-packages/transformers/generation/utils.py”, line 2633, in generate
result = self._sample(
File “/.venv/lib/python3.10/site-packages/transformers/generation/utils.py”, line 3614, in _sample
outputs = self(**model_inputs, return_dict=True)
File “/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File “/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1762, in _call_impl
return forward_call(*args, **kwargs)
File “/.venv/lib/python3.10/site-packages/accelerate/hooks.py”, line 175, in new_forward
output = module._old_forward(*args, **kwargs)
File “/.venv/lib/python3.10/site-packages/transformers/utils/generic.py”, line 961, in wrapper
output = func(self, *args, **kwargs)
File “/.venv/lib/python3.10/site-packages/transformers/models/voxtral/modeling_voxtral.py”, line 512, in forward
inputs_embeds[audio_token_mask] = audio_embeds
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3!
run command: CUDA_VISIBLE_DEVICES=4,5,6,7 uv run main.py
code:
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
repo_id = "mistralai/Voxtral-Small-24B-2507"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map="auto")
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
},
{"type": "text", "text": "Describe briefly what you can hear."},
],
},
{
"role": "assistant",
"content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.",
},
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
},
{"type": "text", "text": "Ok, now compare this new audio with the previous one."},
],
},
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to("cuda", dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
packages:
$ uv pip list
Package Version
------------------------- ------------
accelerate 1.9.0
annotated-types 0.7.0
attrs 25.3.0
audioread 3.0.1
certifi 2025.7.14
cffi 1.17.1
charset-normalizer 3.4.2
decorator 5.2.1
filelock 3.13.1
fsspec 2024.6.1
hf-xet 1.1.5
huggingface-hub 0.34.1
idna 3.10
jinja2 3.1.4
joblib 1.5.1
jsonschema 4.25.0
jsonschema-specifications 2025.4.1
lazy-loader 0.4
librosa 0.11.0
llvmlite 0.44.0
markupsafe 2.1.5
mistral-common 1.8.2
mpmath 1.3.0
msgpack 1.1.1
networkx 3.3
numba 0.61.2
numpy 2.2.6
nvidia-cublas-cu11 11.11.3.6
nvidia-cuda-cupti-cu11 11.8.87
nvidia-cuda-nvrtc-cu11 11.8.89
nvidia-cuda-runtime-cu11 11.8.89
nvidia-cudnn-cu11 9.1.0.70
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.3.0.86
nvidia-cusolver-cu11 11.4.1.48
nvidia-cusparse-cu11 11.7.5.86
nvidia-nccl-cu11 2.21.5
nvidia-nvtx-cu11 11.8.86
packaging 25.0
pillow 11.3.0
platformdirs 4.3.8
pooch 1.8.2
psutil 7.0.0
pycountry 24.6.1
pycparser 2.22
pydantic 2.11.7
pydantic-core 2.33.2
pydantic-extra-types 2.10.5
pyyaml 6.0.2
referencing 0.36.2
regex 2024.11.6
requests 2.32.4
rpds-py 0.26.0
safetensors 0.5.3
scikit-learn 1.7.1
scipy 1.15.3
sentencepiece 0.2.0
setuptools 70.2.0
soundfile 0.13.1
soxr 0.5.0.post1
sympy 1.13.3
threadpoolctl 3.6.0
tiktoken 0.9.0
tokenizers 0.21.2
torch 2.7.1+cu118
torchaudio 2.7.1+cu118
torchvision 0.22.1+cu118
tqdm 4.67.1
transformers 4.54.0.dev0
triton 3.3.1
typing-extensions 4.14.1
typing-inspection 0.4.1
urllib3 2.5.0
hoseongahn
changed discussion title from
BUG: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3!
to RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3!