Submitted by wyu1 78 Self-Rewarding Vision-Language Model via Reasoning Decomposition · 11 authors 102 2
Submitted by Zery 35 CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning · 11 authors 29 2
Submitted by XingweiT 29 Analysing Chain of Thought Dynamics: Active Guidance or Unfaithful Post-hoc Rationalisation? · 4 authors 2
Submitted by Liang-ZX 28 Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies · 10 authors 3
Submitted by zParquet 27 MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time Autoregressive Video Generation · 8 authors 3
Submitted by wybertwang 20 AudioStory: Generating Long-Form Narrative Audio with Large Language Models · 7 authors 224 3
Submitted by zaydzuhri 20 Predicting the Order of Upcoming Tokens Improves Language Modeling · 3 authors 2
Submitted by blinoff 14 Gaze into the Heart: A Multi-View Video Dataset for rPPG and Health Biomarkers Estimation · 7 authors 7 2
Submitted by Jungang 11 Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents · 6 authors 7 6
Submitted by taesiri 9 MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment · 5 authors 2
Submitted by lilvjosephtang 9 SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models · 4 authors 10 2
Submitted by taesiri 7 DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis · 7 authors 2
Submitted by taesiri 5 Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference · 11 authors 2