SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper β’ 2502.14786 β’ Published 3 days ago β’ 99
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding Paper β’ 2502.01341 β’ Published 20 days ago β’ 35
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks Paper β’ 2412.04626 β’ Published Dec 5, 2024 β’ 14
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild Paper β’ 2407.04172 β’ Published Jul 4, 2024 β’ 23