SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published 3 days ago • 100
RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm Paper • 2502.12513 • Published 6 days ago • 15
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Paper • 2502.11089 • Published 7 days ago • 133
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning Paper • 2411.18203 • Published Nov 27, 2024 • 34
Multimodal Autoregressive Pre-training of Large Vision Encoders Paper • 2411.14402 • Published Nov 21, 2024 • 43
ORID: Organ-Regional Information Driven Framework for Radiology Report Generation Paper • 2411.13025 • Published Nov 20, 2024 • 2
Unicom: Universal and Compact Representation Learning for Image Retrieval Paper • 2304.05884 • Published Apr 12, 2023 • 2
RWKV-CLIP: A Robust Vision-Language Representation Learner Paper • 2406.06973 • Published Jun 11, 2024 • 1
High-Fidelity Facial Albedo Estimation via Texture Quantization Paper • 2406.13149 • Published Jun 19, 2024 • 2
Multi-label Cluster Discrimination for Visual Representation Learning Paper • 2407.17331 • Published Jul 24, 2024 • 2
Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension Paper • 2410.14332 • Published Oct 18, 2024 • 1
CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination Paper • 2408.09441 • Published Aug 18, 2024 • 2
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption Paper • 2308.08428 • Published Aug 16, 2023 • 1