Collections

41

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 26
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 13
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 42
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 22

Personalized Visual Instruction Tuning

Paper • 2410.07113 • Published Oct 9, 2024 • 70

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation

SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

Personalized Visual Instruction Tuning

Differential Transformer

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Emergent properties with repeated examples

Personalized Visual Instruction Tuning

Differential Transformer

Baichuan-Omni Technical Report

Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss

FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors

Personalized Visual Instruction Tuning

Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

LLMs + Persona-Plug = Personalized LLMs

PDFTriage: Question Answering over Long, Structured Documents

Adapting Large Language Models via Reading Comprehension

Table-GPT: Table-tuned GPT for Diverse Table Tasks

Context-Aware Meta-Learning

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

MAVIS: Mathematical Visual Instruction Tuning

Kvasir-VQA: A Text-Image Pair GI Tract Dataset

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

An Introduction to Vision-Language Modeling

Matryoshka Multimodal Models

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Personalized Visual Instruction Tuning

Differential Transformer

What Matters in Transformers? Not All Attention is Needed