information how to get it working on a 3090

#63
by TheBigBlockPC - opened

for the kernels you need to do:

pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

you need to inmstall transformers using this command:

pip install git+https://github.com/huggingface/transformers.git 

update triton as the last step bnecause pip sometimes just downgrades it

3090 is not enough, I think.

do you have this problem RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback):
module 'torch' has no attribute 'uint64'?

do you have this problem RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback):
module 'torch' has no attribute 'uint64'?

i didn't have such a error. try updating pytorch, and installing transformers with the provided command.
i have:
torch version: 2.7.1
transformers version: 4.56.0.dev0
triton versionm: 3.4.0

3090 is not enough, I think.

on my machine it defenetly runs on a single 3090 but you need to update packages that it works if it doesn't run, then check the terminal output if there is a bessage that it's dequantizing to bf16 you need to update triton and install the triton_kernels package

You're a life saver! I got pretty far, but I didn't know how to install the triton_kernels, and your instruction shows exactly that. Thanks :)

Broad steps involved:

  • Conda environment with Python 3.12 (Pytorch recommends this Python version)
  • Torch nightly build with CUDA 12.9 backend (see official Pytorch site)
  • Huggingface transformers as in the OP's post
  • Upgrade triton from v3.3 to v3.4 with pip
  • Triton kernels as in the OP's post

Then it can be loaded in Python with the usual transformers llm/tokenizers. You can get the basic code from this official repo here.

PS: Got it working on my RTX 4080.

Yea works with transformers, now that the quantization supposedly works on older hardware, it should work with vllm too right?

With transformers main, it should even work on a T4 ! Please try to following google colab: https://colab.research.google.com/drive/15DJv6QWgc49MuC7dlNS9ifveXBDjCWO5?usp=sharing

Yea works with transformers, now that the quantization supposedly works on older hardware, it should work with vllm too right?

I didn't test it on vllm, i only used transformers.

Couldn't get it to work with vllm for now. It's basically the same issue as with transformers: no triton 3.4.0, but if you upgrade triton then you also need to upgrade Pytorch to 2.8.0, which in turn is incompatible with vllm, etc... This will sort itself out over time, when modules get updated. But for now, I stick to what I got working: transformers :)

It annoys me though when official instructions don't work: vllm official install docs provide following steps to install vllm with an existing Pytorch install:

git clone https://github.com/vllm-project/vllm.git
cd vllm
python use_existing_torch.py
pip install -r requirements-build.txt # -> This file does not exist, so it fails and the next step will fail as well (not sure if it is related, but it is likely)
pip install -e . --no-build-isolation

Couldn't get it to work with vllm for now. It's basically the same issue as with transformers: no triton 3.4.0, but if you upgrade triton then you also need to upgrade Pytorch to 2.8.0, which in turn is incompatible with vllm, etc... This will sort itself out over time, when modules get updated. But for now, I stick to what I got working: transformers :)

It annoys me though when official instructions don't work: vllm official install docs provide following steps to install vllm with an existing Pytorch install:

git clone https://github.com/vllm-project/vllm.git
cd vllm
python use_existing_torch.py
pip install -r requirements-build.txt # -> This file does not exist, so it fails and the next step will fail as well (not sure if it is related, but it is likely)
pip install -e . --no-build-isolation

Well I kind of got around the triton problem using the instructions above, however I dont think this quantization method is particularly practical for our GPUs lol
All other requantized models dont work so far either, but I still somehow got it running using lmsys bf16 upcast.
HOWEVER this had missing w2_bias keys (which for testing, I just skipped loading) and it DID infere (on two rtx3090) but is producing just gibberish.
Honestly I'd just use transformers if it wasnt so damn slow (even the gibberish producing bf16 vllm version I got working now is just sooo nicely fast)
Anyone got any more insider clues on this? :)
Currently trying other bf16 upcasts to test wether they maybe just forgot to include those bias keys

do you have this problem RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback):
module 'torch' has no attribute 'uint64'?

Today, I also ran into this error. I think that the root cause lies in how the transformers package internally depends on the safetensors library.

Here’s the problem:

  • safetensors tries to use torch.uint64 (and similar types like torch.uint16, torch.uint32).
  • These types are only available starting from PyTorch 2.4 (i think).
  • If you’re using an older PyTorch version (torch < 2.4), torch.uint64 simply doesn’t exist.

Why this is suddenly affecting people:
Just a couple of days ago, safetensors released a new version ... I think that they just started using torch.uint64.
And transformers will just take the latest version of "safetensor"

This means that anyone installing or updating packages will now get this error unless they’re on a new enough PyTorch version.

Solution:

  • Pin safetensors to a version "<0.6.0", for example:
    Or, if you’re using Poetry, add:
  • safetensors = "<0.6.0"

Sign up or log in to comment