[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding
AI & ML interests
Computer Vision
Recent Activity
View all activity
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 280 -
OpenGVLab/InternVL3-1B
Image-Text-to-Text • 0.9B • Updated • 96.6k • 69 -
OpenGVLab/InternVL3-2B
Image-Text-to-Text • 2B • Updated • 48.8k • 36 -
OpenGVLab/InternVL3-8B
Image-Text-to-Text • 8B • Updated • 312k • 93
A Pioneering Monolithic MLLM
-
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Paper • 2410.08202 • Published • 4 -
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
Paper • 2507.12566 • Published • 14 -
OpenGVLab/Mono-InternVL-2B
Image-Text-to-Text • 3B • Updated • 12.3k • 36 -
OpenGVLab/Mono-InternVL-2B-S1-1
Image-Text-to-Text • 3B • Updated • 8
[NeurIPS 2024 Spotlight (Ranking Top 10), TPAMI 2025] Parameter-Inverted Image Pyramid Networks
-
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
Paper • 2501.07783 • Published • 7 -
OpenGVLab/PIIP
Object Detection • Updated • 5 -
OpenGVLab/PIIP-LLaVA_CLIP-BL_512-256_7B
Image-Text-to-Text • 7B • Updated • 2 -
OpenGVLab/PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B
Image-Text-to-Text • 7B • Updated • 6
-
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Paper • 2303.16727 • Published -
OpenGVLab/VideoMAEv2-Base
Video Classification • 0.1B • Updated • 5.72k • 7 -
OpenGVLab/VideoMAEv2-Large
Video Classification • 0.3B • Updated • 22k • 1 -
OpenGVLab/VideoMAEv2-Huge
Video Classification • 0.6B • Updated • 134 • 1
Better than InternVL 2.0
-
492
InternVL
⚡Chat with an AI that understands text and images
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Paper • 2412.05271 • Published • 160 -
OpenGVLab/InternVL2_5-78B
Image-Text-to-Text • 78B • Updated • 418 • 192 -
OpenGVLab/InternVL2_5-78B-AWQ
Image-Text-to-Text • Updated • 26 • 14
Expanding Performance Boundaries of Open-Source MLLM
Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Paper • 2312.14238 • Published • 20 -
OpenGVLab/InternViT-6B-224px
Image Feature Extraction • Updated • 337 • 24 -
OpenGVLab/InternVL-14B-224px
Image Feature Extraction • 14B • Updated • 352 • 35 -
OpenGVLab/InternVL-Chat-V1-2-Plus
Image-Text-to-Text • 40B • Updated • 29 • 34
Adaptation Models for Specific Domains
-
OpenGVLab/Mini-InternVL2-4B-DA-DriveLM
Image-Text-to-Text • 4B • Updated • 190 • 3 -
OpenGVLab/Mini-InternVL2-4B-DA-Medical
Image-Text-to-Text • 4B • Updated • 12 • 5 -
OpenGVLab/Mini-InternVL2-4B-DA-BDD
Image-Text-to-Text • 4B • Updated • 4 -
OpenGVLab/Mini-InternVL2-2B-DA-DriveLM
Image-Text-to-Text • 2B • Updated • 58
Chat-Centric Video Understanding
A Large-Scale Video-Text Dataset
Improved Baselines with Pyramid Vision Transformer
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text • 8B • Updated • 31.9k • 78 -
OpenGVLab/InternVL_2_5_HiCo_R16
Video-Text-to-Text • 8B • Updated • 1.74k • 4 -
OpenGVLab/InternVL_2_5_HiCo_R64
Video-Text-to-Text • 8B • Updated • 119 • 3 -
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Paper • 2501.12386 • Published • 1
Faster and more powerful VideoChat.
-
OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448
Video-Text-to-Text • 2B • Updated • 767 • 24 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res224
Video-Text-to-Text • 8B • Updated • 76 • 7 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
Video-Text-to-Text • 8B • Updated • 3.57k • 12 -
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Paper • 2501.00574 • Published • 6
Enhancing the Reasoning Ability of MLLMs via Mixed Preference Optimization
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 87 -
OpenGVLab/InternVL2_5-78B-MPO
Image-Text-to-Text • 78B • Updated • 554 • 54 -
OpenGVLab/InternVL2_5-38B-MPO
Image-Text-to-Text • 38B • Updated • 25.1k • 20 -
OpenGVLab/InternVL2_5-26B-MPO
Image-Text-to-Text • 26B • Updated • 80 • 14
A Pioneering Open-Source Alternative to GPT-4V
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Paper • 2404.16821 • Published • 58 -
OpenGVLab/InternVL-Chat-V1-5
Image-Text-to-Text • 26B • Updated • 2.31k • 413 -
OpenGVLab/InternViT-6B-448px-V1-5
Image Feature Extraction • 6B • Updated • 397 • 78 -
OpenGVLab/InternViT-300M-448px
Image Feature Extraction • 0.3B • Updated • 6.72k • 55
Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
InternVideo2
-
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Paper • 2403.15377 • Published • 27 -
OpenGVLab/InternVideo2-Chat-8B
Video-Text-to-Text • 8B • Updated • 448 • 23 -
OpenGVLab/InternVideo2_chat_8B_HD
Video-Text-to-Text • 8B • Updated • 111 • 18 -
OpenGVLab/InternVideo2_Chat_8B_InternLM2_5
Video-Text-to-Text • 9B • Updated • 21 • 7
State Space Model for Efficient Video Understanding
A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
-
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Paper • 2211.05778 • Published -
OpenGVLab/internimage_t_1k_224
Image Classification • 0.0B • Updated • 76 • 1 -
OpenGVLab/internimage_s_1k_224
Image Classification • 0.1B • Updated • 5 • 1 -
OpenGVLab/internimage_b_1k_224
Image Classification • 0.1B • Updated • 42 • 1
[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 280 -
OpenGVLab/InternVL3-1B
Image-Text-to-Text • 0.9B • Updated • 96.6k • 69 -
OpenGVLab/InternVL3-2B
Image-Text-to-Text • 2B • Updated • 48.8k • 36 -
OpenGVLab/InternVL3-8B
Image-Text-to-Text • 8B • Updated • 312k • 93
A Pioneering Monolithic MLLM
-
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Paper • 2410.08202 • Published • 4 -
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
Paper • 2507.12566 • Published • 14 -
OpenGVLab/Mono-InternVL-2B
Image-Text-to-Text • 3B • Updated • 12.3k • 36 -
OpenGVLab/Mono-InternVL-2B-S1-1
Image-Text-to-Text • 3B • Updated • 8
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
[NeurIPS 2024 Spotlight (Ranking Top 10), TPAMI 2025] Parameter-Inverted Image Pyramid Networks
-
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
Paper • 2501.07783 • Published • 7 -
OpenGVLab/PIIP
Object Detection • Updated • 5 -
OpenGVLab/PIIP-LLaVA_CLIP-BL_512-256_7B
Image-Text-to-Text • 7B • Updated • 2 -
OpenGVLab/PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B
Image-Text-to-Text • 7B • Updated • 6
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text • 8B • Updated • 31.9k • 78 -
OpenGVLab/InternVL_2_5_HiCo_R16
Video-Text-to-Text • 8B • Updated • 1.74k • 4 -
OpenGVLab/InternVL_2_5_HiCo_R64
Video-Text-to-Text • 8B • Updated • 119 • 3 -
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Paper • 2501.12386 • Published • 1
-
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Paper • 2303.16727 • Published -
OpenGVLab/VideoMAEv2-Base
Video Classification • 0.1B • Updated • 5.72k • 7 -
OpenGVLab/VideoMAEv2-Large
Video Classification • 0.3B • Updated • 22k • 1 -
OpenGVLab/VideoMAEv2-Huge
Video Classification • 0.6B • Updated • 134 • 1
Faster and more powerful VideoChat.
-
OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448
Video-Text-to-Text • 2B • Updated • 767 • 24 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res224
Video-Text-to-Text • 8B • Updated • 76 • 7 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
Video-Text-to-Text • 8B • Updated • 3.57k • 12 -
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Paper • 2501.00574 • Published • 6
Better than InternVL 2.0
-
492
InternVL
⚡Chat with an AI that understands text and images
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Paper • 2412.05271 • Published • 160 -
OpenGVLab/InternVL2_5-78B
Image-Text-to-Text • 78B • Updated • 418 • 192 -
OpenGVLab/InternVL2_5-78B-AWQ
Image-Text-to-Text • Updated • 26 • 14
Enhancing the Reasoning Ability of MLLMs via Mixed Preference Optimization
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 87 -
OpenGVLab/InternVL2_5-78B-MPO
Image-Text-to-Text • 78B • Updated • 554 • 54 -
OpenGVLab/InternVL2_5-38B-MPO
Image-Text-to-Text • 38B • Updated • 25.1k • 20 -
OpenGVLab/InternVL2_5-26B-MPO
Image-Text-to-Text • 26B • Updated • 80 • 14
Expanding Performance Boundaries of Open-Source MLLM
A Pioneering Open-Source Alternative to GPT-4V
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Paper • 2404.16821 • Published • 58 -
OpenGVLab/InternVL-Chat-V1-5
Image-Text-to-Text • 26B • Updated • 2.31k • 413 -
OpenGVLab/InternViT-6B-448px-V1-5
Image Feature Extraction • 6B • Updated • 397 • 78 -
OpenGVLab/InternViT-300M-448px
Image Feature Extraction • 0.3B • Updated • 6.72k • 55
Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Paper • 2312.14238 • Published • 20 -
OpenGVLab/InternViT-6B-224px
Image Feature Extraction • Updated • 337 • 24 -
OpenGVLab/InternVL-14B-224px
Image Feature Extraction • 14B • Updated • 352 • 35 -
OpenGVLab/InternVL-Chat-V1-2-Plus
Image-Text-to-Text • 40B • Updated • 29 • 34
Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Adaptation Models for Specific Domains
-
OpenGVLab/Mini-InternVL2-4B-DA-DriveLM
Image-Text-to-Text • 4B • Updated • 190 • 3 -
OpenGVLab/Mini-InternVL2-4B-DA-Medical
Image-Text-to-Text • 4B • Updated • 12 • 5 -
OpenGVLab/Mini-InternVL2-4B-DA-BDD
Image-Text-to-Text • 4B • Updated • 4 -
OpenGVLab/Mini-InternVL2-2B-DA-DriveLM
Image-Text-to-Text • 2B • Updated • 58
InternVideo2
-
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Paper • 2403.15377 • Published • 27 -
OpenGVLab/InternVideo2-Chat-8B
Video-Text-to-Text • 8B • Updated • 448 • 23 -
OpenGVLab/InternVideo2_chat_8B_HD
Video-Text-to-Text • 8B • Updated • 111 • 18 -
OpenGVLab/InternVideo2_Chat_8B_InternLM2_5
Video-Text-to-Text • 9B • Updated • 21 • 7
Chat-Centric Video Understanding
State Space Model for Efficient Video Understanding
A Large-Scale Video-Text Dataset
A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
-
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Paper • 2211.05778 • Published -
OpenGVLab/internimage_t_1k_224
Image Classification • 0.0B • Updated • 76 • 1 -
OpenGVLab/internimage_s_1k_224
Image Classification • 0.1B • Updated • 5 • 1 -
OpenGVLab/internimage_b_1k_224
Image Classification • 0.1B • Updated • 42 • 1
Improved Baselines with Pyramid Vision Transformer