-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 26 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 42 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 22
Collections
Discover the best community collections!
Collections including paper arxiv:2406.18521
-
3
Multimodal Clembench
🏆Explore and compare models on a leaderboard and plots
-
81
SEED-Bench Leaderboard
🏆 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 35 -
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Paper • 2409.02813 • Published • 29
-
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Paper • 2406.18521 • Published • 29 -
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Paper • 2407.01284 • Published • 77 -
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
Paper • 2407.04172 • Published • 23
-
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Paper • 2406.12275 • Published • 30 -
TroL: Traversal of Layers for Large Language and Vision Models
Paper • 2406.12246 • Published • 35 -
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Paper • 2406.15334 • Published • 9 -
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
Paper • 2406.12742 • Published • 15
-
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Paper • 2401.14405 • Published • 13 -
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Paper • 2406.18521 • Published • 29 -
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Paper • 2408.12590 • Published • 36 -
Law of Vision Representation in MLLMs
Paper • 2408.16357 • Published • 93
-
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 -
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
Paper • 2404.16790 • Published • 8 -
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
Paper • 2405.07990 • Published • 20 -
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Paper • 2406.09411 • Published • 20
-
FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation
Paper • 2403.06775 • Published • 4 -
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Paper • 2010.11929 • Published • 7 -
Data Incubation -- Synthesizing Missing Data for Handwriting Recognition
Paper • 2110.07040 • Published • 2 -
A Mixture of Expert Approach for Low-Cost Customization of Deep Neural Networks
Paper • 1811.00056 • Published • 2