Error when trying chat demo

by ilintar - opened Apr 8

Apr 8

I've tried running the chat demo with your model, but I'm getting the following:

/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.

Any ideas?

Rainnighttram

Owner Apr 9

Thank you for bringing up the problem. I see that you are not using the chat demo provided by original model creator. You can find official chat demo for Dream 7B from this repository: HKUNLP/Dream.

You should notice that the dream model is a diffusion model which different from traditional autoregression (AR) model. The token generation follows a different mechanism when compared to AR model like Llama(following token was selected based on probability). This variance in token generation is likely the cause for the error you have encountered.

ilintar

Apr 9

I was using the chat demo from the original model creator, that's the thing. I took their demo_multiturn_chat.py and replaced the model loading parts with the quantized ones and I'm getting this error.

Rainnighttram

Owner Apr 9

The error you brought up has been replicated on a RXT A6000 machine. It seems that the official chat demo cannot handle the non full precision model. I have also noticed that the quantized model has other performance issues for repeating same token over and over. A newly quantized model in 4 bit has been uploaded Dream-v0-Instruct-7B-4bit with sample script for multi-round conversation. On my test round, the model was successfully loaded and consumed around 9GB VRAM. I would be very grateful if you could give it a try and share your comment to the model performances.

ilintar

Apr 9

Thanks, will give it a try and tell you how it went.

ilintar

Apr 9

Happy to report that the new version works :> Unfortunately, I have a 10 GB VRAM card, so it's just exactly as much RAM to try out the demo - tried extending the context even a little bit, but I get OOM errors. I'm guessing you had to dequantize some layers to get it working, so it got pretty big. Nevertheless, the results seem pretty interesting (the inference is quite slow for 512 steps though, probably too slow for typical assistant usages).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment