Submitted by Zery 65 X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models · 9 authors 2
Submitted by dbaranchuk 35 Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis · 5 authors 3
Submitted by akhaliq 26 FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait · 3 authors 6
Submitted by wren93 26 VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation · 5 authors 2
Submitted by caizhongang 23 SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters · 10 authors 2
Submitted by HYeungLee 20 TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video · 6 authors 2
Submitted by kpzhang996 18 GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation · 18 authors 2
Submitted by mpatel57 16 Steering Rectified Flow Models in the Vector Field for Controlled Image Generation · 4 authors 8
Submitted by BK-Lee 15 VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models · 5 authors 2
Submitted by rubenohana 14 The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning · 26 authors 2
Submitted by atcbosselut 11 INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge · 59 authors 2
Submitted by BestWishYsh 11 WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model · 7 authors 2
Submitted by Cakeyan 9 Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation · 6 authors 2
Submitted by jomat 8 Art-Free Generative Models: Art Creation Without Graphic Art Knowledge · 5 authors 3
Submitted by ryokamoi 8 VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information · 5 authors 2
Submitted by zhangysk 6 PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos · 10 authors 2
Submitted by yanxi-chen 6 A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models · 5 authors 2
Submitted by amanchadha 4 Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting · 8 authors 2
Submitted by ftaioli 4 Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input · 7 authors 2
Submitted by hyzhou404 3 HUGSIM: A Real-Time, Photo-Realistic and Closed-Loop Simulator for Autonomous Driving · 9 authors 2
Submitted by amanchadha 2 Improving speaker verification robustness with synthetic emotional utterances · 6 authors 2
Submitted by callmesan 1 Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning · 3 authors 2