TowerVision-9B / README.md

Update README.md

7bcaca6 verified 29 days ago

11.6 kB

	---
	library_name: transformers
	tags:
	- multimodal
	- multilingual
	- vlm
	- translation
	language:
	- en
	- de
	- nl
	- es
	- fr
	- pt
	- uk
	- hi
	- zh
	- ru
	- cs
	- ko
	- ja
	- it
	- pl
	- ro
	- nb
	- nn
	base_model:
	- Unbabel/Tower-Plus-9B
	pipeline_tag: image-text-to-text
	license: cc-by-nc-sa-4.0
	---

	# Model Card for TowerVision

	<p align="left">
	<img src="Tower.png" alt="TowerVision Logo" width="300">
	</p>

	TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks, demonstrating exceptional performance across 20 languages and dialects.

	This model card covers the TowerVision family, including the 2B and 9B parameter versions, both in their instruct-tuned (it) and pretrained (pt) variants, with the latter not undergoing instruction tuning.

	- Model Family: TowerVision (2B, 9B variants)
	- Context length: 8192 tokens
	- Languages: 20+ languages including European, Asian, and other language families

	<span style="font-size: 1.2em;"><strong>🌟 Try TowerVision</strong></span>: [Project Page](https://guilhermeviveiros.github.io/TowerVision.io/) \| [Code Repository](https://github.com/GuilhermeViveiros/LLaVA-NeXT)

	## Available Models

	<p align="left">

	\| Model \| Parameters \| HF Link \|
	\|-------\|------------\|---------\|
	\| TowerVision-2B \| 2B \| [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVision-2B)
	\| TowerVision-2B-pt \| 2B \| [🤗 utter-project/TowerVision-2B-pt](https://huggingface.co/utter-project/TowerVision-2B-pt)
	\| TowerVision-9B \| 9B \| [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)
	\| TowerVision-9B-pt \| 9B \| [🤗 utter-project/TowerVision-9B-pt](https://huggingface.co/utter-project/TowerVision-9B-pt)

	## How to Use TowerVision

	When using the model, make sure your prompt is formated correctly!
	Also, we recommend using bfloat16 rather than fp32/16

	### Quick Start with Transformers

	<details open>
	<summary>Click to expand/collapse code</summary>

	```python
	from transformers import (
	LlavaNextProcessor,
	LlavaNextForConditionalGeneration
	)
	import requests
	from PIL import Image

	model_id = "utter-project/TowerVision-2B" # or any other variant

	def prepare_prompt(query):
	conversation = [
	{
	"role": "user",
	"content": f"<image>\n{query}"
	}
	]

	# Format message with the towervision chat template
	prompt = processor.apply_chat_template(
	conversation,
	tokenize=False,
	add_generation_prompt=True
	)

	return prompt

	# we recommend using "bfloat16" as torch_dtype
	kwargs = {
	"torch_dtype": "bfloat16",
	"device_map": "auto",
	}
	processor = LlavaNextProcessor.from_pretrained(model_id)
	model = LlavaNextForConditionalGeneration.from_pretrained(model_id, **kwargs)

	# img url
	img_url = "https://cms.mistral.ai/assets/a10b924e-56b3-4359-bf6c-571107811c8f"
	image = Image.open(requests.get(img_url, stream=True).raw)

	# Multilingual prompts - TowerVision supports 20+ languages!
	prompt = prepare_prompt("Is this person really big, or is this building just super small?")

	# Prepare inputs
	inputs = processor(
	text=prompt, images=image, return_tensors="pt"
	).to(model.device)

	# Generate response ids
	gen_tokens = model.generate(**inputs, max_new_tokens=512)
	# Decode response
	print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
	```

	</details>

	### Batch Inference with Transformers

	For processing multiple images and prompts simultaneously:

	<details>
	<summary>Click to expand/collapse code</summary>

	```python
	def prepare_prompts(queries):
	prompts = []
	for query in queries:
	conversation = [
	{
	"role": "user",
	"content": f"<image>\n{query}"
	}
	]

	# Format message with the towervision chat template
	prompt = processor.apply_chat_template(
	conversation,
	tokenize=False,
	add_generation_prompt=True
	)
	prompts.append(prompt)
	return prompts

	# we recommend using "bfloat16" as torch_dtype
	kwargs = {
	"torch_dtype": "bfloat16",
	"device_map": "auto",
	}
	processor = LlavaNextProcessor.from_pretrained(model_id)
	model = LlavaNextForConditionalGeneration.from_pretrained(model_id, **kwargs)

	# Sample images and queries for batch processing
	img_urls = [
	"https://cms.mistral.ai/assets/a10b924e-56b3-4359-bf6c-571107811c8f",
	"https://cms.mistral.ai/assets/a10b924e-56b3-4359-bf6c-571107811c8f",
	]

	queries = [
	"Is this person really big, or is this building just super small?",
	"Where was this photo taken?"
	]

	# Load images
	images = []
	for url in img_urls[:batch_size]:
	image = Image.open(requests.get(url, stream=True).raw)
	images.append(image)

	# Prepare prompts
	prompts = prepare_prompts(queries[:batch_size])

	# Prepare batch inputs
	inputs = processor(
	text=prompts,
	images=images,
	return_tensors="pt",
	padding=True
	).to(model.device)

	# Generate response ids for batch
	gen_tokens = model.generate(**inputs, max_new_tokens=512, do_sample=False)

	# Decode responses
	print(f"Batch processing {len(images)} images:")
	print("-" * 50)

	for i in range(len(images)):
	input_length = inputs.input_ids[i].shape[0]
	response = processor.tokenizer.decode(
	gen_tokens[i][input_length:],
	skip_special_tokens=True
	)
	print(f"Response: {response}")
	print("-" * 50)
	```

	</details>

	### Pipeline Usage

	<summary>Click to expand/collapse code</summary>
	<details>

	```python
	from transformers import pipeline
	from PIL import Image
	import requests


	pipe = pipeline(
	model="utter-project/TowerVision-9B",
	task="image-text-to-text",
	device_map="auto",
	dtype="bfloat16"
	)

	def prepare_prompt(query):
	conversation = [
	{
	"role": "user",
	"content": f"<image>\n{query}"
	}
	]

	# Format message with the towervision chat template
	return pipe.processor.apply_chat_template(
	conversation,
	tokenize=False,
	add_generation_prompt=True
	)


	img_url = "https://cms.mistral.ai/assets/a10b924e-56b3-4359-bf6c-571107811c8f"
	image = Image.open(requests.get(img_url, stream=True).raw)
	text = prepare_prompt("Is this person really big, or is this building just super small?")

	outputs = pipe(text=text, images=image, max_new_tokens=300, return_full_text=False)
	print(outputs)
	```

	</details>

	## Model Details

	Input: Model accepts input text and images.

	Output: Model generates text in multiple languages.

	Model Architecture: TowerVision uses a multilingual language model based on [Tower-Plus](https://huggingface.co/Unbabel/Tower-Plus-9B) (2B and 9B parameters), paired with [SigLIP2-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384) vision encoder through a multimodal adapter for vision-language understanding.

	Recommended Precision: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models.

	Languages Covered: The model has been trained on 20 languages and dialects:
	- European languages: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian (Bokmål & Nynorsk)
	- Asian languages: Chinese (Simplified & Traditional), Japanese, Korean, Hindi
	- Other languages: Russian, Ukrainian

	Key Strengths:
	- 🏆 Exceptional performance on culturally-aware benchmarks with deep understanding of cultural contexts and visual nuances
	- 🌐 State-of-the-art results on multimodal multilingual translation benchmarks, enabling seamless cross-lingual visual communication
	- 📊 Strong cross-lingual transfer capabilities across diverse vision-language tasks

	## Training Data

	TowerVision models are trained on VisionBlocks, a comprehensive multilingual vision-language dataset comprising 6.31M samples across diverse categories:

	\| Dataset \| Samples \| HF Link \| \|
	\|---------\|---------\|---------\|-------\|
	\| VisionBlocks \| 6.31M \| [🤗 utter-project/VisionBlocks](https://huggingface.co/datasets/utter-project/VisionBlocks) \| Coming Soon \|

	### Dataset Statistics
	- Total samples: 6.31M
	- Created by our team: 1.21M samples (~19%)
	- Human-collected/external: 5.10M samples (~81%)

	### Dataset Composition Overview

	VisionBlocks contains samples across multiple categories with both English-only (63.1%) and multilingual (36.9%) data:

	- Chart/Plot Reasoning: DVQA, ChartQA, PlotQA, TabMWP (~405K samples)
	- General VQA: VQAv2, RLAIF-4V (~488K samples)
	- Document VQA: DocVQA, TextVQA, ST-VQA, PixMo-Docs (~46K samples)
	- Reasoning/Knowledge: A-OKVQA, OKVQA, AI2D, ScienceQA (~29K samples)
	- Multilingual/Cultural: Pangea-Cultural, Pangea-Multi, PixMo-Cap-Translated, CulturalGround datasets (~1.6M samples)
	- Specialized VQA: IconQA, InfographicVQA, Stratos (~34K samples)
	- Counting/Math: TallyQA, PixMo-Count (~107K samples)
	- Vision/Text: VBlocks-PixMo collections, EuroBlocks-SFT (~2.2M samples)
	- Video/Text: LLaVA-Video collections (~1.4M samples)

	Collection Types: Human-annotated, synthetically generated, and professionally translated data ensuring high quality and cultural diversity across 20+ languages.

	## Evaluation

	All evaluations were conducted using [lmms_eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).

	### Multiple Purpose Multimodal Benchmarks

	TowerVision demonstrates strong performance across diverse multimodal evaluation benchmarks:

	<img src="mc-eval1.png" alt="Multiple Purpose Multimodal Benchmarks Results" width="600">

	### Multimodal Multilingual Translation Tasks

	TowerVision excels particularly in multimodal multilingual translation benchmarks, demonstrating state-of-the-art cross-lingual visual communication capabilities:

	<img src="mc-eval2.png" alt="Multimodal Multilingual Translation Results" width="600">

	### Supported Languages Performance

	✅ Fully Supported: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian, Chinese, Japanese, Korean, Hindi, Russian, Ukrainian

	📊 Benchmark Coverage: Our models are evaluated across diverse multilingual vision-language tasks, demonstrating strong cross-lingual transfer capabilities and exceptional performance in culturally-aware benchmarks.

	## Citation

	If you find TowerVision useful in your research, please consider citing the following paper:

	```bibtex
	@article{towervision2025,
	title={Understanding and Improving Multilinguality in Vision-Language Models},
	author={[Authors to be added]},
	journal={[Journal to be added]},
	year={2025},
	note={Paper in preparation}
	}
	```

	## Model Card Contact

	For errors or additional questions about details in this model card, contact the research team.

	## Acknowledgments

	TowerVision builds upon the excellent work of:
	- [LLaVA-NeXT](https://github.com/GuilhermeViveiros/LLaVA-NeXT) for the foundational vision-language architecture
	- [Tower-Plus](https://huggingface.co/Unbabel/Tower-Plus-9B) language models for multilingual capabilities
	- [SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384) for robust vision encoding
	- The broader multilingual NLP and multimodal communities