-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper β’ 2401.00908 β’ Published β’ 181 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper β’ 2401.00849 β’ Published β’ 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper β’ 2311.05437 β’ Published β’ 50 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper β’ 2311.00571 β’ Published β’ 40
Collections
Discover the best community collections!
Collections including paper arxiv:2311.05437
-
Communicative Agents for Software Development
Paper β’ 2307.07924 β’ Published β’ 5 -
Self-Refine: Iterative Refinement with Self-Feedback
Paper β’ 2303.17651 β’ Published β’ 2 -
ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent
Paper β’ 2312.10003 β’ Published β’ 39 -
ReAct: Synergizing Reasoning and Acting in Language Models
Paper β’ 2210.03629 β’ Published β’ 18
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Paper β’ 2402.17177 β’ Published β’ 87 -
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
Paper β’ 2403.13248 β’ Published β’ 78 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper β’ 2311.05437 β’ Published β’ 50 -
UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models
Paper β’ 2409.20551 β’ Published β’ 15
-
Visual In-Context Prompting
Paper β’ 2311.13601 β’ Published β’ 19 -
Textbooks Are All You Need
Paper β’ 2306.11644 β’ Published β’ 142 -
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework
Paper β’ 2308.08155 β’ Published β’ 7 -
LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models
Paper β’ 2303.02927 β’ Published β’ 3
-
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Paper β’ 2401.01885 β’ Published β’ 28 -
Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance
Paper β’ 2401.15687 β’ Published β’ 23 -
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Paper β’ 2312.17172 β’ Published β’ 28 -
MouSi: Poly-Visual-Expert Vision-Language Models
Paper β’ 2401.17221 β’ Published β’ 9
-
Visual Instruction Tuning
Paper β’ 2304.08485 β’ Published β’ 13 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper β’ 2311.05437 β’ Published β’ 50 -
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 37 -
Aligning Large Multimodal Models with Factually Augmented RLHF
Paper β’ 2309.14525 β’ Published β’ 30
-
Self-Rewarding Language Models
Paper β’ 2401.10020 β’ Published β’ 146 -
ReFT: Reasoning with Reinforced Fine-Tuning
Paper β’ 2401.08967 β’ Published β’ 30 -
Tuning Language Models by Proxy
Paper β’ 2401.08565 β’ Published β’ 23 -
TrustLLM: Trustworthiness in Large Language Models
Paper β’ 2401.05561 β’ Published β’ 69
-
57
Llava
π’Chat with LLaVA using images and text
-
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper β’ 2311.05437 β’ Published β’ 50 -
Ziya-VL: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning
Paper β’ 2310.08166 β’ Published β’ 1 -
1.01k
OOTDiffusion
π₯ΌHigh-quality virtual try-on ~ Your cyber fitting room