Add initial model card for OpenVision 2 with metadata and links
Browse filesThis pull request adds a comprehensive initial model card for the OpenVision 2 model to its Hugging Face repository. This update significantly improves the documentation and discoverability of the model by:
- Adding the `pipeline_tag: image-text-to-text` to the metadata, which helps users discover the model at https://huggingface.co/models?pipeline_tag=image-text-to-text.
- Including `library_name: open_clip` in the metadata, based on the `open_clip_config.json` file, which indicates compatibility with the `open_clip` library and allows for automated usage snippets.
- Linking directly to the official paper: [OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning](https://huggingface.co/papers/2509.01644).
- Adding a link to the official project page: https://ucsc-vlaa.github.io/OpenVision2/.
- Including a link to the GitHub repository for the code: https://github.com/UCSC-VLAA/OpenVision/blob/main/src/main_openvision2.py.
- Providing the paper's abstract, model details, and a citation guide in the markdown content.
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: image-text-to-text
|
3 |
+
library_name: open_clip
|
4 |
+
---
|
5 |
+
|
6 |
+
# OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
|
7 |
+
|
8 |
+
This repository hosts the **OpenVision 2** model, a family of generative pretrained visual encoders designed for efficient multimodal learning. OpenVision 2 offers a simplified architecture and loss design, enhancing training efficiency while maintaining competitive performance across various multimodal benchmarks.
|
9 |
+
|
10 |
+
- 📚 [Paper](https://huggingface.co/papers/2509.01644)
|
11 |
+
- 🌐 [Project Page](https://ucsc-vlaa.github.io/OpenVision2/)
|
12 |
+
- 💻 [GitHub Repository](https://github.com/UCSC-VLAA/OpenVision/blob/main/src/main_openvision2.py)
|
13 |
+
|
14 |
+
## Abstract
|
15 |
+
This paper provides a simplification on OpenVision's architecture and loss design for enhancing its training efficiency. Following the prior vision-language pretraining works CapPa and AIMv2, as well as modern multimodal designs like LLaVA, our changes are straightforward: we remove the text encoder (and therefore the contrastive loss), retaining only the captioning loss as a purely generative training signal. We name this new version OpenVision 2. The initial results are promising: despite this simplification, OpenVision 2 competitively matches the original model's performance on a broad set of multimodal benchmarks while substantially cutting both training time and memory consumption. For example, with ViT-L/14, it reduces training time by about 1.5x (from 83h to 57h), and memory usage by about 1.8x (from 24.5GB to 13.8GB, equivalently allowing the maximum batch size to grow from 2k to 8k). This superior training efficiency also allows us to scale far beyond the largest vision encoder used in OpenVision, reaching more than 1 billion parameters. We hold a strong belief that this lightweight, generative-only paradigm is compelling for future vision encoder development in multimodal foundation models.
|
16 |
+
|
17 |
+
## Model Details
|
18 |
+
OpenVision 2 achieves superior training efficiency by simplifying its architecture, removing the text encoder, and relying solely on a captioning loss. This generative-only paradigm significantly reduces training time and memory footprint, enabling the development of larger, more powerful vision encoders for multimodal tasks. For instance, using ViT-L/14, OpenVision 2 achieves a 1.5x reduction in training time and a 1.8x reduction in memory usage compared to its predecessor.
|
19 |
+
|
20 |
+
## Usage
|
21 |
+
For detailed instructions on how to use OpenVision 2, please refer to the [GitHub repository](https://github.com/UCSC-VLAA/OpenVision/blob/main/src/main_openvision2.py).
|
22 |
+
|
23 |
+
## Citation
|
24 |
+
If you find OpenVision 2 useful in your research, please cite the original paper:
|
25 |
+
|
26 |
+
```bibtex
|
27 |
+
@article{paper_title,
|
28 |
+
title={OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning},
|
29 |
+
author={Anonymous},
|
30 |
+
journal={arXiv preprint arXiv:2509.01644},
|
31 |
+
year={2025}
|
32 |
+
}
|
33 |
+
```
|