Dream-VL 7B
Dream-VL 7B is an open diffusion vision-language model trained on 12M multimodal data from the MAmmoTH-VL-Instruct-12M dataset. The model takes language instructions and images as input and generates language outputs.
All Dream-VL checkpoints, as well as our training codebase are released under an Apache 2.0 License.
For full details, please read our blog and paper (pending).
Model Summary
- Model type: Vision-language (language, image => language)
- Language(s) (NLP): en
- License: apache-2.0
- Finetuned from:
Dream-7B, with Qwen2ViT Vision Backbone. - Pretraining Dataset: MAmmoTH-VL-Instruct-12M.
- Repository: https://github.com/DreamLM/DreamVLX
- Project Page & Videos: https://hkunlp.github.io/blog/2025/dream-vlx
Getting Started
import torch
from transformers import AutoProcessor, AutoModel
model_name = "Dream-org/Dream-VL-7B"
model = AutoModel.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).to('cuda')
processor = AutoProcessor.from_pretrained(
model_name,
trust_remote_code=True
)
####### Method 1
from PIL import Image
import requests
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
messages = [
{
"role": "user","content": [{"type": "image"}, {"type": "text", "text": "Describe this image"}]
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
print(text)
inputs = processor(
text=[text], images=[image], padding=True, return_tensors="pt"
)
####### Method 2: use qwen_vl_utils
# messages = [
# {
# "role": "user",
# "content": [
# {
# "type": "image",
# "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
# },
# {"type": "text", "text": "Describe this image."},
# ],
# }
# ]
# text = processor.apply_chat_template(
# messages, tokenize=False, add_generation_prompt=True
# )
# from qwen_vl_utils import process_vision_info
# image_inputs, video_inputs = process_vision_info(messages)
# inputs = processor(
# text=[text],
# images=image_inputs,
# videos=video_inputs,
# padding=True,
# return_tensors="pt",
# )
inputs = inputs.to("cuda")
input_ids = inputs.pop("input_ids")
output = model.diffusion_generate(
input_ids,
max_new_tokens=128,
output_history=True,
return_dict_in_generate=True,
steps=128,
temperature=0.1,
top_p=1,
alg="maskgit_plus",
alg_temp=0,
use_cache=False,
**inputs
)
generations = [
processor.tokenizer.decode(g[len(p):].cpu().tolist())
for p, g in zip(input_ids, output.sequences)
]
for j in range(len(messages)):
print("output:", j, generations[j].split(processor.tokenizer.eos_token)[0])
# output: The image depicts a serene beach scene featuring a young woman and a golden retriever.
# The woman, dressed in a plaid shirt and dark pants, is seated on the sandy shore, smiling warmly at the camera.
# The golden retriever, adorned with a colorful harness, sits attentively beside her, its gaze fixed on the woman.
# The background reveals the vast expanse of the ocean, with waves gently kissing the shore. The sky above is a clear blue, suggesting a sunny day.
# The overall atmosphere exudes a sense of peace and companionship between the woman and her dog.
Citation
BibTeX:
@article{ye2025dreamvla,
title={Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone},
author={Ye, Jiacheng and Gong, Shansan and Gao, Jiahui and Fan, Junming and Wu, Shuang and Bi, Wei and Bai, Haoli and Shang, Lifeng and Kong, Lingpeng},
journal={arXiv preprint},
year={2025}
}
- Downloads last month
- 12