Collections
Discover the best community collections!
Collections including paper arxiv:2508.20072
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 37 -
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Paper • 2507.04447 • Published • 43 -
A Survey on Vision-Language-Action Models for Autonomous Driving
Paper • 2506.24044 • Published • 14 -
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
Paper • 2507.10548 • Published • 36
-
Unified Vision-Language-Action Model
Paper • 2506.19850 • Published • 27 -
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper • 2506.01844 • Published • 131 -
3D-VLA: A 3D Vision-Language-Action Generative World Model
Paper • 2403.09631 • Published • 10 -
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
Paper • 2312.14457 • Published • 1
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 17 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 45
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 37 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 34 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 41 -
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Paper • 2508.20072 • Published • 28
-
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
Paper • 2311.12631 • Published • 15 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 56 -
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
Paper • 2504.01956 • Published • 41 -
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
Paper • 2506.23219 • Published • 7
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 17 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 45
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 37 -
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Paper • 2507.04447 • Published • 43 -
A Survey on Vision-Language-Action Models for Autonomous Driving
Paper • 2506.24044 • Published • 14 -
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
Paper • 2507.10548 • Published • 36
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 37 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 34 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 41 -
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Paper • 2508.20072 • Published • 28
-
Unified Vision-Language-Action Model
Paper • 2506.19850 • Published • 27 -
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper • 2506.01844 • Published • 131 -
3D-VLA: A 3D Vision-Language-Action Generative World Model
Paper • 2403.09631 • Published • 10 -
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
Paper • 2312.14457 • Published • 1
-
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
Paper • 2311.12631 • Published • 15 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 56 -
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
Paper • 2504.01956 • Published • 41 -
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
Paper • 2506.23219 • Published • 7