Collections
Discover the best community collections!
Collections including paper arxiv:2404.12390
-
Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion
Paper • 2310.03502 • Published • 78 -
Transferable and Principled Efficiency for Open-Vocabulary Segmentation
Paper • 2404.07448 • Published • 12 -
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Paper • 2404.07973 • Published • 32 -
COCONut: Modernizing COCO Segmentation
Paper • 2404.08639 • Published • 29
-
AQuA: A Benchmarking Tool for Label Quality Assessment
Paper • 2306.09467 • Published • 1 -
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Paper • 2404.07972 • Published • 48 -
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 -
Vision language models are blind
Paper • 2407.06581 • Published • 83
-
Can large language models explore in-context?
Paper • 2403.15371 • Published • 32 -
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Paper • 2403.15042 • Published • 26 -
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 -
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
Paper • 2404.12387 • Published • 39
-
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper • 2402.14289 • Published • 19 -
ImageBind: One Embedding Space To Bind Them All
Paper • 2305.05665 • Published • 5 -
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 181 -
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
Paper • 2206.02770 • Published • 3
-
AtP*: An efficient and scalable method for localizing LLM behaviour to components
Paper • 2403.00745 • Published • 13 -
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 609 -
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
Paper • 2402.16840 • Published • 24 -
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
Paper • 2402.13753 • Published • 116
-
CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation
Paper • 2401.12208 • Published • 22 -
λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space
Paper • 2402.05195 • Published • 19 -
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter
Paper • 2402.10896 • Published • 16 -
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
Paper • 2402.11690 • Published • 10