Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2404.06212

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 181
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Paper • 2401.00849 • Published Jan 1, 2024 • 17
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 50
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 40

image llm works

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Paper • 2404.19752 • Published Apr 30, 2024 • 24
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Paper • 2404.16821 • Published Apr 25, 2024 • 56
MoAI: Mixture of All Intelligence for Large Language and Vision Models

Paper • 2403.07508 • Published Mar 12, 2024 • 75
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Paper • 2403.09611 • Published Mar 14, 2024 • 126

Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies

Paper • 2404.08197 • Published Apr 12, 2024 • 29
OmniFusion Technical Report

Paper • 2404.06212 • Published Apr 9, 2024 • 75

OmniFusion Technical Report

Paper • 2404.06212 • Published Apr 9, 2024 • 75
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Paper • 2404.06512 • Published Apr 9, 2024 • 30

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Paper • 2404.07143 • Published Apr 10, 2024 • 106
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Paper • 2404.05961 • Published Apr 9, 2024 • 65
OmniFusion Technical Report

Paper • 2404.06212 • Published Apr 9, 2024 • 75

OmniFusion Technical Report

Paper • 2404.06212 • Published Apr 9, 2024 • 75

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Paper • 2404.05961 • Published Apr 9, 2024 • 65
OmniFusion Technical Report

Paper • 2404.06212 • Published Apr 9, 2024 • 75
Adapting LLaMA Decoder to Vision Transformer

Paper • 2404.06773 • Published Apr 10, 2024 • 18
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10, 2024 • 19

Papers - Image - LLaVA

OmniFusion Technical Report

Paper • 2404.06212 • Published Apr 9, 2024 • 75
On Speculative Decoding for Multimodal Large Language Models

Paper • 2404.08856 • Published Apr 13, 2024 • 14
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Paper • 2404.16994 • Published Apr 25, 2024 • 36

Papers - Skoltech

OmniFusion Technical Report

Paper • 2404.06212 • Published Apr 9, 2024 • 75
Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

Paper • 2310.03502 • Published Oct 5, 2023 • 78

Papers - Sber AI

OmniFusion Technical Report

Paper • 2404.06212 • Published Apr 9, 2024 • 75
Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

Paper • 2310.03502 • Published Oct 5, 2023 • 78

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs