Collections

41

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 26
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 13
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 42
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 22

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Can Large Language Models Understand Context?

OLMo: Accelerating the Science of Language Models

Self-Rewarding Language Models

SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

yale-nlp/MMVU

yale-nlp/MMVU-evaluation-results

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Evolving Deeper LLM Thinking

PaSa: An LLM Agent for Comprehensive Academic Paper Search

Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

VLSBench: Unveiling Visual Leakage in Multimodal Safety

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

BLINK: Multimodal Large Language Models Can See but Not Perceive

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

GAIA: a benchmark for General AI Assistants

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

BLINK: Multimodal Large Language Models Can See but Not Perceive

RULER: What's the Real Context Size of Your Long-Context Language Models?