-
Rho-1: Not All Tokens Are What You Need
Paper • 2404.07965 • Published • 90 -
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
Paper • 2404.10667 • Published • 18 -
Instruction-tuned Language Models are Better Knowledge Learners
Paper • 2402.12847 • Published • 26 -
DoRA: Weight-Decomposed Low-Rank Adaptation
Paper • 2402.09353 • Published • 26
Collections
Discover the best community collections!
Collections including paper arxiv:2501.09038
-
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 -
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper • 2404.12803 • Published • 30 -
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Paper • 2404.13013 • Published • 31 -
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Paper • 2404.06512 • Published • 30
-
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 192 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 35 -
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 -
RULER: What's the Real Context Size of Your Long-Context Language Models?
Paper • 2404.06654 • Published • 35
-
High-Quality Image Restoration Following Human Instructions
Paper • 2401.16468 • Published • 13 -
Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding
Paper • 2401.15708 • Published • 12 -
Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support
Paper • 2401.14688 • Published • 13 -
TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts
Paper • 2401.14828 • Published • 9
-
DocGraphLM: Documental Graph Language Model for Information Extraction
Paper • 2401.02823 • Published • 36 -
Understanding LLMs: A Comprehensive Overview from Training to Inference
Paper • 2401.02038 • Published • 63 -
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 181 -
Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration
Paper • 2309.01131 • Published • 1
-
GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation
Paper • 2312.04557 • Published • 13 -
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
Paper • 2312.04410 • Published • 15 -
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
Paper • 2312.04461 • Published • 62 -
Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively
Paper • 2401.02955 • Published • 22