2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper β’ 2501.00958 β’ Published Jan 1 β’ 99
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis Paper β’ 2412.19723 β’ Published Dec 27, 2024 β’ 82
VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper β’ 2412.04467 β’ Published Dec 5, 2024 β’ 107
PaliGemma 2: A Family of Versatile VLMs for Transfer Paper β’ 2412.03555 β’ Published Dec 4, 2024 β’ 128
ShowUI: One Vision-Language-Action Model for GUI Visual Agent Paper β’ 2411.17465 β’ Published Nov 26, 2024 β’ 80
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives Paper β’ 2501.04003 β’ Published Jan 7 β’ 25
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Paper β’ 2501.00599 β’ Published Dec 31, 2024 β’ 41
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs Paper β’ 2501.06186 β’ Published Jan 10 β’ 61
DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests Paper β’ 2501.04671 β’ Published Jan 8
The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering Paper β’ 2502.03628 β’ Published 18 days ago β’ 11