Collections
Discover the best community collections!
Collections including paper arxiv:2501.12326
-
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Paper • 2501.12326 • Published • 51 -
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Paper • 2401.04081 • Published • 70 -
Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
Paper • 2501.16295 • Published • 8 -
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 24
-
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Paper • 2501.02955 • Published • 40 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 99 -
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Paper • 2501.12380 • Published • 82 -
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper • 2501.09781 • Published • 25
-
Video Creation by Demonstration
Paper • 2412.09551 • Published • 9 -
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Paper • 2412.07589 • Published • 45 -
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation
Paper • 2412.06531 • Published • 71 -
APOLLO: SGD-like Memory, AdamW-level Performance
Paper • 2412.05270 • Published • 38
-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper • 2412.04454 • Published • 64 -
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Paper • 2410.05243 • Published • 19 -
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Paper • 2501.12326 • Published • 51