-
MIO: A Foundation Model on Multimodal Tokens
Paper • 2409.17692 • Published • 53 -
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Paper • 2010.11929 • Published • 7 -
Going deeper with Image Transformers
Paper • 2103.17239 • Published -
Training data-efficient image transformers & distillation through attention
Paper • 2012.12877 • Published • 2
Collections
Discover the best community collections!
Collections including paper arxiv:2111.06377
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 126 -
Evolutionary Optimization of Model Merging Recipes
Paper • 2403.13187 • Published • 52 -
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Paper • 2402.03766 • Published • 14 -
LLM Agent Operating System
Paper • 2403.16971 • Published • 65
-
Masked Autoencoders Are Scalable Vision Learners
Paper • 2111.06377 • Published • 3 -
Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
Paper • 2311.00430 • Published • 58 -
distil-whisper/distil-large-v2
Automatic Speech Recognition • Updated • 466k • • 505 -
Seven Failure Points When Engineering a Retrieval Augmented Generation System
Paper • 2401.05856 • Published • 2