-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 26 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 42 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 22
Collections
Discover the best community collections!
Collections including paper arxiv:2405.10300
-
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Paper • 2405.08748 • Published • 24 -
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper • 2405.10300 • Published • 28 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 131 -
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Paper • 2405.11143 • Published • 37
-
ibm-granite/granite-34b-code-instruct-8k
Text Generation • Updated • 1.89k • 74 -
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper • 2405.10300 • Published • 28 -
rain1011/pyramid-flow-sd3
Text-to-Video • Updated • 817 -
answerdotai/ModernBERT-large
Fill-Mask • Updated • 667k • 354
-
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
Paper • 2404.04125 • Published • 28 -
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies
Paper • 2404.08197 • Published • 29 -
Probing the 3D Awareness of Visual Foundation Models
Paper • 2404.08636 • Published • 13 -
AM-RADIO: Agglomerative Model -- Reduce All Domains Into One
Paper • 2312.06709 • Published • 1
-
LocalMamba: Visual State Space Model with Windowed Selective Scan
Paper • 2403.09338 • Published • 8 -
GiT: Towards Generalist Vision Transformer through Universal Language Interface
Paper • 2403.09394 • Published • 26 -
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Paper • 2402.19479 • Published • 33 -
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper • 2405.10300 • Published • 28
-
YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
Paper • 2402.13616 • Published • 47 -
Vision Transformers Need Registers
Paper • 2309.16588 • Published • 78 -
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper • 2405.10300 • Published • 28 -
SAMPart3D: Segment Any Part in 3D Objects
Paper • 2411.07184 • Published • 26