UForm

Pocket-Sized Multimodal AI
For Content Understanding and Generation

Description

UForm-Gen is a small generative vision-language model primarily designed for Image Captioning and Visual Question Answering. The model consists of two parts:

  1. CLIP-like ViT-H/14
  2. Qwen1.5-0.5B-Chat

The model was pre-trained on the internal image captioning dataset and fine-tuned on public instructions datasets: SVIT, LVIS, VQAs datasets. The model took one day to train on a DGX-H100 with 8x H100 GPUs. Thanks to Nebius.ai for providing the compute πŸ€—

Usage

The generative model can be used to caption images, answer questions about them. Also it is suitable for a multimodal chat.

from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained("unum-cloud/uform-gen2-qwen-500m", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("unum-cloud/uform-gen2-qwen-500m", trust_remote_code=True)

prompt = "Question or Instruction"
image = Image.open("image.jpg")

inputs = processor(text=[prompt], images=[image], return_tensors="pt")
with torch.inference_mode():
     output = model.generate(
        **inputs,
        do_sample=False,
        use_cache=True,
        max_new_tokens=256,
        eos_token_id=151645,
        pad_token_id=processor.tokenizer.pad_token_id
    )

prompt_len = inputs["input_ids"].shape[1]
decoded_text = processor.batch_decode(output[:, prompt_len:])[0]

You can check examples of different prompts in our demo space.

Evaluation

Model LLM Size SQA MME MMBench AverageΒΉ
UForm-Gen2-Qwen-500m 0.5B 45.5 880.1 42.0 29.31
MobileVLM v2 1.4B 52.1 1302.8 57.7 36.81
LLaVA-Phi 2.7B 68.4 1335.1 59.8 42.95

ΒΉMME scores were divided by 2000 before averaging.

Downloads last month
23,442
Safetensors
Model size
1.27B params
Tensor type
F32
Β·
Inference Examples
Examples
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support model that require custom code execution.

Datasets used to train unum-cloud/uform-gen2-qwen-500m

Spaces using unum-cloud/uform-gen2-qwen-500m 4

Collection including unum-cloud/uform-gen2-qwen-500m