Submitted by Senqiao 107 VisionZip: Longer is Better but Not Necessary in Vision Language Models · 7 authors 13
Submitted by ranpox 64 Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction · 9 authors 6
Submitted by jiuhai 60 Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion · 7 authors 4
Submitted by JeffreyXiang 60 Structured 3D Latents for Scalable and Versatile 3D Generation · 9 authors 8
Submitted by Zhoues 38 Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection · 8 authors 3
Submitted by Crayon-Shinchan 23 AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models · 8 authors 2
Submitted by jsingh 23 Negative Token Merging: Image-based Adversarial Feature Guidance · 10 authors 6
Submitted by dvilasuero 18 Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation · 23 authors 2
Submitted by leo1117 18 Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis · 8 authors 2
Submitted by Franck-Dernoncourt 14 Personalized Multimodal Large Language Models: A Survey · 27 authors 2
Submitted by BryanW 14 HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing · 7 authors 2
Submitted by jacklishufan 13 OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows · 7 authors 2
Submitted by ltzheng 10 MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation · 9 authors 2
Submitted by akhaliq 10 Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement · 20 authors 2
Submitted by kpzhang996 9 ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality · 7 authors 2
Submitted by james371507 8 4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion · 10 authors 3
Submitted by JungleGym 7 p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay · 6 authors 2
Submitted by russwang 7 Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension · 9 authors 2
Submitted by haoningwu 6 MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities · 5 authors 2
Submitted by ethanbradley 5 SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction · 4 authors 2
Submitted by liujch1998 3 Establishing Task Scaling Laws via Compute-Efficient Model Ladders · 12 authors 2