---
library_name: transformers
tags:
- multimodal
- multilingual
- vlm
- translation
language:
- en
- de
- nl
- es
- fr
- pt
- uk
- hi
- zh
- ru
- cs
- ko
- ja
- it
- pl
- ro
- nb
- nn
base_model:
- Unbabel/Tower-Plus-9B
pipeline_tag: image-text-to-text
license: cc-by-nc-sa-4.0
---

# Model Card for TowerVision

<p align="left">
<img src="Tower.png" alt="TowerVision Logo" width="200">
</p>

TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. **TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks**, demonstrating exceptional performance across **20 languages and dialects**.

This model card covers the TowerVision family, including the 2B and 9B parameter versions, both in their instruct-tuned (it) and pretrained (pt) variants, with the latter not undergoing instruction tuning.

- **Model Family**: TowerVision (2B, 9B variants)
- **Context length**: 8192 tokens
- **Languages**: 20+ languages including European, Asian, and other language families

<span style="font-size: 1.2em;"><strong>🌟 Try TowerVision</strong></span>: [Project Page](https://guilhermeviveiros.github.io/TowerVision.io/) | [Code Repository](https://github.com/GuilhermeViveiros/LLaVA-NeXT)

## Available Models

<p align="left">

| Model | Parameters | HF Link |
|-------|------------|---------|
| TowerVision-2B | 2B | [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVision-2B)
| TowerVision-2B-pt | 2B | [🤗 utter-project/TowerVision-2B-pt](https://huggingface.co/utter-project/TowerVision-2B-pt)
| TowerVision-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)
| TowerVision-9B-pt | 9B | [🤗 utter-project/TowerVision-9B-pt](https://huggingface.co/utter-project/TowerVision-9B-pt)

## How to Use TowerVision

When using the model, make sure your prompt is formated correctly! 
Also, we recommend using **bfloat16** rather than **fp32/16**

### Quick Start with Transformers

<details open>
<summary>Click to expand/collapse code</summary>

```python
from transformers import (
    LlavaNextProcessor,
    LlavaNextForConditionalGeneration
)
import requests
from PIL import Image

model_id = "utter-project/TowerVision-2B"  # or any other variant

def prepare_prompt(query):
    conversation = [
        {
            "role": "user", 
            "content": f"<image>\n{query}"
        }
    ]
    
    # Format message with the towervision chat template
    prompt = processor.apply_chat_template(
        conversation, 
        tokenize=False,
        add_generation_prompt=True
    )
    
    return prompt

# we recommend using "bfloat16" as torch_dtype
kwargs = {
    "torch_dtype": "bfloat16",
    "device_map": "auto",
}
processor = LlavaNextProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(model_id, **kwargs)

# img url
img_url = "https://cms.mistral.ai/assets/a10b924e-56b3-4359-bf6c-571107811c8f"
image = Image.open(requests.get(img_url, stream=True).raw)

# Multilingual prompts - TowerVision supports 20+ languages!
prompt = prepare_prompt("Is this person really big, or is this building just super small?")

# Prepare inputs
inputs = processor(
    text=prompt, images=image, return_tensors="pt"
).to(model.device)

# Generate response ids
gen_tokens = model.generate(**inputs, max_new_tokens=512)
# Decode response
print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
```

</details>

### Batch Inference with Transformers

For processing multiple images and prompts simultaneously:

<details>
<summary>Click to expand/collapse code</summary>

```python
def prepare_prompts(queries):
    prompts = []
    for query in queries:
        conversation = [
            {
                "role": "user", 
                "content": f"<image>\n{query}"
            }
        ]
        
        # Format message with the towervision chat template
        prompt = processor.apply_chat_template(
            conversation, 
            tokenize=False,
            add_generation_prompt=True
        )
        prompts.append(prompt)
    return prompts

# we recommend using "bfloat16" as torch_dtype
kwargs = {
    "torch_dtype": "bfloat16",
    "device_map": "auto",
}
processor = LlavaNextProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(model_id, **kwargs)

# Sample images and queries for batch processing
img_urls = [
    "https://cms.mistral.ai/assets/a10b924e-56b3-4359-bf6c-571107811c8f",
    "https://cms.mistral.ai/assets/a10b924e-56b3-4359-bf6c-571107811c8f",
]

queries = [
    "Is this person really big, or is this building just super small?",
    "Where was this photo taken?"
]

# Load images
images = []
for url in img_urls[:batch_size]:
    image = Image.open(requests.get(url, stream=True).raw)
    images.append(image)

# Prepare prompts
prompts = prepare_prompts(queries[:batch_size])

# Prepare batch inputs
inputs = processor(
    text=prompts, 
    images=images, 
    return_tensors="pt",
    padding=True
).to(model.device)

# Generate response ids for batch
gen_tokens = model.generate(**inputs, max_new_tokens=512, do_sample=False)

# Decode responses
print(f"Batch processing {len(images)} images:")
print("-" * 50)

for i in range(len(images)):
    input_length = inputs.input_ids[i].shape[0]
    response = processor.tokenizer.decode(
        gen_tokens[i][input_length:], 
        skip_special_tokens=True
    )
    print(f"Response: {response}")
    print("-" * 50)
```

</details>

### Pipeline Usage

<summary>Click to expand/collapse code</summary>
<details>

```python
from transformers import pipeline
from PIL import Image
import requests


pipe = pipeline(
    model="utter-project/TowerVision-9B", 
    task="image-text-to-text", 
    device_map="auto",
    dtype="bfloat16"
)

def prepare_prompt(query):
    conversation = [
        {
            "role": "user", 
            "content": f"<image>\n{query}"
        }
    ]
    
    # Format message with the towervision chat template
    return pipe.processor.apply_chat_template(
        conversation, 
        tokenize=False,
        add_generation_prompt=True
    )
    
    
img_url = "https://cms.mistral.ai/assets/a10b924e-56b3-4359-bf6c-571107811c8f"
image = Image.open(requests.get(img_url, stream=True).raw)
text = prepare_prompt("Is this person really big, or is this building just super small?")

outputs = pipe(text=text, images=image, max_new_tokens=300, return_full_text=False)
print(outputs)
```

</details>

## Model Details

**Input**: Model accepts input text and images.

**Output**: Model generates text in multiple languages.

**Model Architecture**: TowerVision uses a multilingual language model based on [Tower-Plus](https://huggingface.co/Unbabel/Tower-Plus-9B) (2B and 9B parameters), paired with [SigLIP2-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384) vision encoder through a multimodal adapter for vision-language understanding.

**Recommended Precision**: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models.

**Languages Covered**: The model has been trained on **20 languages and dialects**:
- **European languages**: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian (Bokmål & Nynorsk)
- **Asian languages**: Chinese (Simplified & Traditional), Japanese, Korean, Hindi  
- **Other languages**: Russian, Ukrainian

**Key Strengths**:
- **🏆 Exceptional performance on culturally-aware benchmarks** with deep understanding of cultural contexts and visual nuances
- **🌐 State-of-the-art results on multimodal multilingual translation benchmarks**, enabling seamless cross-lingual visual communication
- **📊 Strong cross-lingual transfer capabilities** across diverse vision-language tasks

## Training Data

TowerVision models are trained on **VisionBlocks**, a comprehensive multilingual vision-language dataset comprising **6.31M samples** across diverse categories:

| Dataset | Samples | HF Link |  |
|---------|---------|---------|-------|
| VisionBlocks | 6.31M | [🤗 utter-project/VisionBlocks](https://huggingface.co/datasets/utter-project/VisionBlocks) | Coming Soon |

### Dataset Statistics
- **Total samples**: 6.31M
- **Created by our team**: 1.21M samples (~19%)
- **Human-collected/external**: 5.10M samples (~81%)

### Dataset Composition Overview

**VisionBlocks** contains samples across multiple categories with both English-only (63.1%) and multilingual (36.9%) data:

- **Chart/Plot Reasoning**: DVQA, ChartQA, PlotQA, TabMWP (~405K samples)
- **General VQA**: VQAv2, RLAIF-4V (~488K samples) 
- **Document VQA**: DocVQA, TextVQA, ST-VQA, PixMo-Docs (~46K samples)
- **Reasoning/Knowledge**: A-OKVQA, OKVQA, AI2D, ScienceQA (~29K samples)
- **Multilingual/Cultural**: Pangea-Cultural, Pangea-Multi, PixMo-Cap-Translated, CulturalGround datasets (~1.6M samples)
- **Specialized VQA**: IconQA, InfographicVQA, Stratos (~34K samples)
- **Counting/Math**: TallyQA, PixMo-Count (~107K samples)
- **Vision/Text**: VBlocks-PixMo collections, EuroBlocks-SFT (~2.2M samples)
- **Video/Text**: LLaVA-Video collections (~1.4M samples)

**Collection Types**: Human-annotated, synthetically generated, and professionally translated data ensuring high quality and cultural diversity across 20+ languages.

## Evaluation

All evaluations were conducted using [lmms_eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).

### Multiple Purpose Multimodal Benchmarks

TowerVision demonstrates strong performance across diverse multimodal evaluation benchmarks:

<img src="mc-eval1.png" alt="Multiple Purpose Multimodal Benchmarks Results" width="600">

### Multimodal Multilingual Translation Tasks

TowerVision excels particularly in multimodal multilingual translation benchmarks, demonstrating state-of-the-art cross-lingual visual communication capabilities:

<img src="mc-eval2.png" alt="Multimodal Multilingual Translation Results" width="600">

### Supported Languages Performance

✅ **Fully Supported**: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian, Chinese, Japanese, Korean, Hindi, Russian, Ukrainian

📊 **Benchmark Coverage**: Our models are evaluated across diverse multilingual vision-language tasks, demonstrating strong cross-lingual transfer capabilities and exceptional performance in culturally-aware benchmarks.

## Citation

If you find TowerVision useful in your research, please consider citing the following paper:

```bibtex
@article{towervision2025,
  title={Understanding and Improving Multilinguality in Vision-Language Models},
  author={[Authors to be added]},
  journal={[Journal to be added]},
  year={2025},
  note={Paper in preparation}
}
```

## Model Card Contact

For errors or additional questions about details in this model card, contact the research team.

## Acknowledgments

TowerVision builds upon the excellent work of:
- **[LLaVA-NeXT](https://github.com/GuilhermeViveiros/LLaVA-NeXT)** for the foundational vision-language architecture
- **[Tower-Plus](https://huggingface.co/Unbabel/Tower-Plus-9B)** language models for multilingual capabilities
- **[SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384)** for robust vision encoding
- The broader multilingual NLP and multimodal communities