Collections
Discover the best community collections!
Collections including paper arxiv:2508.21112
-
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Paper • 2508.21113 • Published • 103 -
Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning
Paper • 2508.16949 • Published • 22 -
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control
Paper • 2508.21112 • Published • 72 -
UItron: Foundational GUI Agent with Advanced Perception and Planning
Paper • 2508.21767 • Published • 12
-
Group Sequence Policy Optimization
Paper • 2507.18071 • Published • 294 -
LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization
Paper • 2507.15758 • Published • 34 -
Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning
Paper • 2508.09726 • Published • 13 -
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
Paper • 2508.10975 • Published • 56
-
GRUtopia: Dream General Robots in a City at Scale
Paper • 2407.10943 • Published • 26 -
Make-An-Agent: A Generalizable Policy Network Generator with Behavior-Prompted Diffusion
Paper • 2407.10973 • Published • 11 -
Cross Anything: General Quadruped Robot Navigation through Complex Terrains
Paper • 2407.16412 • Published • 6 -
RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands
Paper • 2408.11048 • Published • 4
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 45 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 24
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 17 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 46
-
Gemini Robotics: Bringing AI into the Physical World
Paper • 2503.20020 • Published • 28 -
Magma: A Foundation Model for Multimodal AI Agents
Paper • 2502.13130 • Published • 58 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 51
-
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
Paper • 2311.12631 • Published • 15 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 56 -
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
Paper • 2504.01956 • Published • 41 -
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
Paper • 2506.23219 • Published • 7
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 45 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 24
-
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Paper • 2508.21113 • Published • 103 -
Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning
Paper • 2508.16949 • Published • 22 -
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control
Paper • 2508.21112 • Published • 72 -
UItron: Foundational GUI Agent with Advanced Perception and Planning
Paper • 2508.21767 • Published • 12
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 17 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 46
-
Group Sequence Policy Optimization
Paper • 2507.18071 • Published • 294 -
LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization
Paper • 2507.15758 • Published • 34 -
Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning
Paper • 2508.09726 • Published • 13 -
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
Paper • 2508.10975 • Published • 56
-
Gemini Robotics: Bringing AI into the Physical World
Paper • 2503.20020 • Published • 28 -
Magma: A Foundation Model for Multimodal AI Agents
Paper • 2502.13130 • Published • 58 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 51
-
GRUtopia: Dream General Robots in a City at Scale
Paper • 2407.10943 • Published • 26 -
Make-An-Agent: A Generalizable Policy Network Generator with Behavior-Prompted Diffusion
Paper • 2407.10973 • Published • 11 -
Cross Anything: General Quadruped Robot Navigation through Complex Terrains
Paper • 2407.16412 • Published • 6 -
RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands
Paper • 2408.11048 • Published • 4
-
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
Paper • 2311.12631 • Published • 15 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 56 -
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
Paper • 2504.01956 • Published • 41 -
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
Paper • 2506.23219 • Published • 7