AI-paper
updated
Describe What You See with Multimodal Large Language Models to Enhance
Video Recommendations
Paper
• 2508.09789
• Published
• 5
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper
• 2508.13186
• Published
• 19
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval
Driven LLM Agents
Paper
• 2508.04038
• Published
• 1
Prompt Orchestration Markup Language
Paper
• 2508.13948
• Published
• 48
MultiRef: Controllable Image Generation with Multiple Visual References
Paper
• 2508.06905
• Published
• 21
LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos
Paper
• 2508.14041
• Published
• 59
Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent
Distillation and Agentic RL
Paper
• 2508.13167
• Published
• 129
Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic
Thought Reward
Paper
• 2508.12800
• Published
• 6
Copyright Protection for Large Language Models: A Survey of Methods,
Challenges, and Trends
Paper
• 2508.11548
• Published
• 5
Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge
Paper
• 2508.08777
• Published
• 15
Training-Free Text-Guided Color Editing with Multi-Modal Diffusion
Transformer
Paper
• 2508.09131
• Published
• 16
MCP-Universe: Benchmarking Large Language Models with Real-World Model
Context Protocol Servers
Paper
• 2508.14704
• Published
• 43
From AI for Science to Agentic Science: A Survey on Autonomous
Scientific Discovery
Paper
• 2508.14111
• Published
• 33
RynnEC: Bringing MLLMs into Embodied World
Paper
• 2508.14160
• Published
• 20
Perception, Reason, Think, and Plan: A Survey on Large Multimodal
Reasoning Models
Paper
• 2505.04921
• Published
• 186
Evolving Deeper LLM Thinking
Paper
• 2501.09891
• Published
• 115
A Survey on Large Language Model Benchmarks
Paper
• 2508.15361
• Published
• 20
Deep Think with Confidence
Paper
• 2508.15260
• Published
• 90
ReFocus: Visual Editing as a Chain of Thought for Structured Image
Understanding
Paper
• 2501.05452
• Published
• 15
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal
Large Language Models
Paper
• 2504.15279
• Published
• 78
Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities
Paper
• 2406.14562
• Published
• 28
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper
• 2501.06186
• Published
• 65
Thinking with Generated Images
Paper
• 2505.22525
• Published
• 15
ChartMuseum: Testing Visual Reasoning Capabilities of Large
Vision-Language Models
Paper
• 2505.13444
• Published
• 17
We-Math: Does Your Large Multimodal Model Achieve Human-like
Mathematical Reasoning?
Paper
• 2407.01284
• Published
• 81
ComposeAnything: Composite Object Priors for Text-to-Image Generation
Paper
• 2505.24086
• Published
• 5
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and
Future Frontiers
Paper
• 2506.23918
• Published
• 90
Visual Planning: Let's Think Only with Images
Paper
• 2505.11409
• Published
• 57
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
• 2407.07053
• Published
• 47
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
Paper
• 2403.12884
• Published
• 1
CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography
Paper
• 2504.10090
• Published
Visual Programming: Compositional visual reasoning without training
Paper
• 2211.11559
• Published
• 1
ExoViP: Step-by-step Verification and Exploration with Exoskeleton
Modules for Compositional Visual Reasoning
Paper
• 2408.02210
• Published
• 9
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
Paper
• 2412.18072
• Published
• 18
Intern-S1: A Scientific Multimodal Foundation Model
Paper
• 2508.15763
• Published
• 269
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Paper
• 2504.06261
• Published
• 110
Star Attention: Efficient LLM Inference over Long Sequences
Paper
• 2411.17116
• Published
• 53
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday
Home Clusters
Paper
• 2504.08791
• Published
• 139
LLM Inference Unveiled: Survey and Roofline Model Insights
Paper
• 2402.16363
• Published
• 4
Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled
Architectures
Paper
• 2504.11750
• Published
Efficient Diffusion Models: A Comprehensive Survey from Principles to
Practices
Paper
• 2410.11795
• Published
• 18
Generative AI for Character Animation: A Comprehensive Survey of
Techniques, Applications, and Future Directions
Paper
• 2504.19056
• Published
• 18
Personalized Image Generation with Deep Generative Models: A Decade
Survey
Paper
• 2502.13081
• Published
Diffusion Models: A Comprehensive Survey of Methods and Applications
Paper
• 2209.00796
• Published
An Empirical Study of GPT-4o Image Generation Capabilities
Paper
• 2504.05979
• Published
• 64
ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation
Paper
• 2502.09411
• Published
• 22
A survey of Generative AI Applications
Paper
• 2306.02781
• Published
Text-to-image Diffusion Models in Generative AI: A Survey
Paper
• 2303.07909
• Published
Multi-Agent Collaboration Mechanisms: A Survey of LLMs
Paper
• 2501.06322
• Published
• 1
Multi-Agent Collaboration via Evolving Orchestration
Paper
• 2505.19591
• Published
• 7
GenMAC: Compositional Text-to-Video Generation with Multi-Agent
Collaboration
Paper
• 2412.04440
• Published
• 22
AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose
Task Solving
Paper
• 2506.12508
• Published
• 1
Internet of Agents: Weaving a Web of Heterogeneous Agents for
Collaborative Intelligence
Paper
• 2407.07061
• Published
• 28
VideoTetris: Towards Compositional Text-to-Video Generation
Paper
• 2406.04277
• Published
• 25
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video
Generation
Paper
• 2407.14505
• Published
• 26
DreamRunner: Fine-Grained Storytelling Video Generation with
Retrieval-Augmented Motion Adaptation
Paper
• 2411.16657
• Published
• 19
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
Paper
• 2411.10818
• Published
• 26
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Paper
• 2312.14125
• Published
• 47
PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices
Paper
• 2504.03664
• Published
FlexInfer: Breaking Memory Constraint via Flexible and Efficient
Offloading for On-Device LLM Inference
Paper
• 2503.03777
• Published
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs
Paper
• 2503.16163
• Published
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading
Paper
• 2502.12574
• Published
• 13
Seesaw: High-throughput LLM Inference via Model Re-sharding
Paper
• 2503.06433
• Published
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving
Under Resource Constraints
Paper
• 2504.09345
• Published
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
• 2504.10479
• Published
• 306
MV-RAG: Retrieval Augmented Multiview Diffusion
Paper
• 2508.16577
• Published
• 38
Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance
for Text-to-Image Generation
Paper
• 2508.18032
• Published
• 41
PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent
LLMs
Paper
• 2508.17188
• Published
• 17
Explain Before You Answer: A Survey on Compositional Visual Reasoning
Paper
• 2508.17298
• Published
• 4
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
Paper
• 2508.16153
• Published
• 160
AgentScope 1.0: A Developer-Centric Framework for Building Agentic
Applications
Paper
• 2508.16279
• Published
• 53
CineScale: Free Lunch in High-Resolution Cinematic Visual Generation
Paper
• 2508.15774
• Published
• 20
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Paper
• 2508.19652
• Published
• 84
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding
in Vision-Language-Action Policies
Paper
• 2508.20072
• Published
• 32
AudioStory: Generating Long-Form Narrative Audio with Large Language
Models
Paper
• 2508.20088
• Published
• 21
MotionFlux: Efficient Text-Guided Motion Generation through Rectified
Flow Matching and Preference Alignment
Paper
• 2508.19527
• Published
• 10
Taming the Chaos: Coordinated Autoscaling for Heterogeneous and
Disaggregated LLM Inference
Paper
• 2508.19559
• Published
• 6
Mixture of Contexts for Long Video Generation
Paper
• 2508.21058
• Published
• 35
rStar2-Agent: Agentic Reasoning Technical Report
Paper
• 2508.20722
• Published
• 117
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable
Text-to-Image Reinforcement Learning
Paper
• 2508.20751
• Published
• 89
AWorld: Orchestrating the Training Recipe for Agentic AI
Paper
• 2508.20404
• Published
• 38
Dress&Dance: Dress up and Dance as You Like It - Technical Preview
Paper
• 2508.21070
• Published
• 6
ROSE: Remove Objects with Side Effects in Videos
Paper
• 2508.18633
• Published
• 7
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for
General Robot Control
Paper
• 2508.21112
• Published
• 77
A.S.E: A Repository-Level Benchmark for Evaluating Security in
AI-Generated Code
Paper
• 2508.18106
• Published
• 348
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs
via Bi-Mode Annealing and Reinforce Learning
Paper
• 2508.21113
• Published
• 110
AHELM: A Holistic Evaluation of Audio-Language Models
Paper
• 2508.21376
• Published
• 9
Morae: Proactively Pausing UI Agents for User Choices
Paper
• 2508.21456
• Published
• 5
UItron: Foundational GUI Agent with Advanced Perception and Planning
Paper
• 2508.21767
• Published
• 12
Efficient Code Embeddings from Code Generation Models
Paper
• 2508.21290
• Published
• 19
TiKMiX: Take Data Influence into Dynamic Mixture for Language Model
Pre-training
Paper
• 2508.17677
• Published
• 14
CLIPSym: Delving into Symmetry Detection with CLIP
Paper
• 2508.14197
• Published
• 8
A Survey of Scientific Large Language Models: From Data Foundations to
Agent Frontiers
Paper
• 2508.21148
• Published
• 140
Continual Learning for Large Language Models: A Survey
Paper
• 2402.01364
• Published
• 1
Continual Learning with Pre-Trained Models: A Survey
Paper
• 2401.16386
• Published
• 1
Continual Learning: Applications and the Road Forward
Paper
• 2311.11908
• Published
• 1
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Paper
• 2509.02547
• Published
• 230
SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn
Tool-Integrated Reasoning
Paper
• 2509.02479
• Published
• 84
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long
Video Understanding
Paper
• 2508.21496
• Published
• 55
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
Paper
• 2509.01055
• Published
• 79
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models
for Document Conversion
Paper
• 2509.01215
• Published
• 51
GenCompositor: Generative Video Compositing with Diffusion Transformer
Paper
• 2509.02460
• Published
• 26
OpenVision 2: A Family of Generative Pretrained Visual Encoders for
Multimodal Learning
Paper
• 2509.01644
• Published
• 34
Mixture of Global and Local Experts with Diffusion Transformer for
Controllable Face Generation
Paper
• 2509.00428
• Published
• 18
From Editor to Dense Geometry Estimator
Paper
• 2509.04338
• Published
• 96
Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from
Vector Drawings
Paper
• 2508.18733
• Published
• 10
Towards a Unified View of Large Language Model Post-Training
Paper
• 2509.04419
• Published
• 76
RedStone: Curating General, Code, Math, and QA Data for Large Language
Models
Paper
• 2412.03398
• Published
• 2
RecAgent: A Novel Simulation Paradigm for Recommender Systems
Paper
• 2306.02552
• Published
• 1
Adversarial Data Collection: Human-Collaborative Perturbations for
Efficient and Robust Robotic Imitation Learning
Paper
• 2503.11646
• Published
• 34
How do language models learn facts? Dynamics, curricula and
hallucinations
Paper
• 2503.21676
• Published
• 1
Investigating Multi-source Active Learning for Natural Language
Inference
Paper
• 2302.06976
• Published
Targeted Data Acquisition for Evolving Negotiation Agents
Paper
• 2106.07728
• Published
UniVerse-1: Unified Audio-Video Generation via Stitching of Experts
Paper
• 2509.06155
• Published
• 14
Revolutionizing Reinforcement Learning Framework for Diffusion Large
Language Models
Paper
• 2509.06949
• Published
• 56
Reinforced Visual Perception with Tools
Paper
• 2509.01656
• Published
• 32
Reinforcement Learning Foundations for Deep Research Systems: A Survey
Paper
• 2509.06733
• Published
• 32
Visual Representation Alignment for Multimodal Large Language Models
Paper
• 2509.07979
• Published
• 84
F1: A Vision-Language-Action Model Bridging Understanding and Generation
to Actions
Paper
• 2509.06951
• Published
• 32
A Survey of Reinforcement Learning for Large Reasoning Models
Paper
• 2509.08827
• Published
• 190
EnvX: Agentize Everything with Agentic AI
Paper
• 2509.08088
• Published
• 8
HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI
Assistants
Paper
• 2509.08494
• Published
• 3
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action
Model
Paper
• 2509.09372
• Published
• 246
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal
Conditioning
Paper
• 2509.08519
• Published
• 128
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Paper
• 2509.09674
• Published
• 80
Kling-Avatar: Grounding Multimodal Instructions for Cascaded
Long-Duration Avatar Animation Synthesis
Paper
• 2509.09595
• Published
• 48
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Paper
• 2509.09676
• Published
• 35
Visual Programmability: A Guide for Code-as-Thought in Chart
Understanding
Paper
• 2509.09286
• Published
• 11
Agentic Software Engineering: Foundational Pillars and a Research
Roadmap
Paper
• 2509.06216
• Published
• 8
AI Agentic Programming: A Survey of Techniques, Challenges, and
Opportunities
Paper
• 2508.11126
• Published
Agentic AI Frameworks: Architectures, Protocols, and Design Challenges
Paper
• 2508.10146
• Published
Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question
Answering with LLMs
Paper
• 2509.15020
• Published
• 4
Developer-LLM Conversations: An Empirical Study of Interactions and
Generated Code Quality
Paper
• 2509.10402
• Published
• 6
Unleashing the Potential of Multimodal LLMs for Zero-Shot
Spatio-Temporal Video Grounding
Paper
• 2509.15178
• Published
• 6
RecoWorld: Building Simulated Environments for Agentic Recommender
Systems
Paper
• 2509.10397
• Published
• 7
MultiEdit: Advancing Instruction-based Image Editing on Diverse and
Challenging Tasks
Paper
• 2509.14638
• Published
• 13
AToken: A Unified Tokenizer for Vision
Paper
• 2509.14476
• Published
• 36
FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial
Search and Reasoning
Paper
• 2509.13160
• Published
• 29
Understand Before You Generate: Self-Guided Training for Autoregressive
Image Generation
Paper
• 2509.15185
• Published
• 29
Evolving Language Models without Labels: Majority Drives Selection,
Novelty Promotes Variation
Paper
• 2509.15194
• Published
• 33
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform
Data
Paper
• 2509.15221
• Published
• 111
FlowRL: Matching Reward Distributions for LLM Reasoning
Paper
• 2509.15207
• Published
• 116
Reasoning over Boundaries: Enhancing Specification Alignment via
Test-time Delibration
Paper
• 2509.14760
• Published
• 53
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid
Vision Tokenizer
Paper
• 2509.16197
• Published
• 58
Latent Zoning Network: A Unified Principle for Generative Modeling,
Representation Learning, and Classification
Paper
• 2509.15591
• Published
• 45
Lynx: Towards High-Fidelity Personalized Video Generation
Paper
• 2509.15496
• Published
• 13
LIMI: Less is More for Agency
Paper
• 2509.17567
• Published
• 104
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion
Transformer Models
Paper
• 2509.17627
• Published
• 66
Qwen3-Omni Technical Report
Paper
• 2509.17765
• Published
• 149
OnePiece: Bringing Context Engineering and Reasoning to Industrial
Cascade Ranking System
Paper
• 2509.18091
• Published
• 34
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning
for Video LLMs
Paper
• 2509.18056
• Published
• 27
GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric
Reasoning
Paper
• 2509.17437
• Published
• 17
EpiCache: Episodic KV Cache Management for Long Conversational Question
Answering
Paper
• 2509.17396
• Published
• 19
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering
Tasks?
Paper
• 2509.16941
• Published
• 21
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning
Models on Automatically Verifiable Textual and Visual Questions
Paper
• 2509.17177
• Published
• 13
Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from
Token and Parameter Levels
Paper
• 2509.16596
• Published
• 14
Reasoning Core: A Scalable RL Environment for LLM Symbolic Reasoning
Paper
• 2509.18083
• Published
• 5
Understanding Embedding Scaling in Collaborative Filtering
Paper
• 2509.15709
• Published
• 5
ContextFlow: Training-Free Video Object Editing via Adaptive Context
Enrichment
Paper
• 2509.17818
• Published
• 8
AuditoryBench++: Can Language Models Understand Auditory Knowledge
without Hearing?
Paper
• 2509.17641
• Published
• 4
DIWALI - Diversity and Inclusivity aWare cuLture specific Items for
India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian
Context
Paper
• 2509.17399
• Published
• 2
When Big Models Train Small Ones: Label-Free Model Parity Alignment for
Efficient Visual Question Answering using Small VLMs
Paper
• 2509.16633
• Published
• 2
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and
Training Recipe
Paper
• 2509.18154
• Published
• 55
Hyper-Bagel: A Unified Acceleration Framework for Multimodal
Understanding and Generation
Paper
• 2509.18824
• Published
• 23
What Characterizes Effective Reasoning? Revisiting Length, Review, and
Structure of CoT
Paper
• 2509.19284
• Published
• 23
VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via
Travel Video Itinerary Reconstruction
Paper
• 2509.19002
• Published
• 3
Video models are zero-shot learners and reasoners
Paper
• 2509.20328
• Published
• 100
SIM-CoT: Supervised Implicit Chain-of-Thought
Paper
• 2509.20317
• Published
• 42
EmbeddingGemma: Powerful and Lightweight Text Representations
Paper
• 2509.20354
• Published
• 47
EditVerse: Unifying Image and Video Editing and Generation with
In-Context Learning
Paper
• 2509.20360
• Published
• 18
PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video
Generation
Paper
• 2509.20358
• Published
• 15
Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal
Understanding and Generation
Paper
• 2509.19244
• Published
• 12
Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just
What They Say
Paper
• 2509.21164
• Published
• 9
VCRL: Variance-based Curriculum Reinforcement Learning for Large
Language Models
Paper
• 2509.19803
• Published
• 120
SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines
Paper
• 2509.21320
• Published
• 101
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and
Open Resources
Paper
• 2509.21268
• Published
• 104
Tree Search for LLM Agent Reinforcement Learning
Paper
• 2509.21240
• Published
• 92
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Paper
• 2509.20427
• Published
• 82
AutoIntent: AutoML for Text Classification
Paper
• 2509.21138
• Published
• 36
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
Paper
• 2509.21117
• Published
• 30
Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web
Reconnaissance, Tool Generation, and Task Execution
Paper
• 2509.21072
• Published
• 15
Does FLUX Already Know How to Perform Physically Plausible Image
Composition?
Paper
• 2509.21278
• Published
• 16
Thinking Augmented Pre-training
Paper
• 2509.20186
• Published
• 23
Understanding the Thinking Process of Reasoning Models: A Perspective
from Schoenfeld's Episode Theory
Paper
• 2509.14662
• Published
• 13
SD3.5-Flash: Distribution-Guided Distillation of Generative Flows
Paper
• 2509.21318
• Published
• 11
Interactive Recommendation Agent with Active User Commands
Paper
• 2509.21317
• Published
• 7
UserRL: Training Interactive User-Centric Agent via Reinforcement
Learning
Paper
• 2509.19736
• Published
• 12
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for
Video Temporal Reasoning
Paper
• 2509.21113
• Published
• 6
SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and
Self-Reflective Agent
Paper
• 2509.20414
• Published
• 10
Thinking While Listening: Simple Test Time Scaling For Audio
Classification
Paper
• 2509.19676
• Published
• 5
When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks
Silently Undermine Validity
Paper
• 2509.20293
• Published
• 8
Discrete Diffusion for Reflective Vision-Language-Action Models in
Autonomous Driving
Paper
• 2509.20109
• Published
• 4
CompLLM: Compression for Long Context Q&A
Paper
• 2509.19228
• Published
• 10
Blueprints of Trust: AI System Cards for End to End Transparency and
Governance
Paper
• 2509.20394
• Published
• 3
StyleBench: Evaluating thinking styles in Large Language Models
Paper
• 2509.20868
• Published
• 4
OverLayBench: A Benchmark for Layout-to-Image Generation with Dense
Overlaps
Paper
• 2509.19282
• Published
• 8
LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale
Diffusion Transformer
Paper
• 2509.22414
• Published
• 22
UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
Paper
• 2509.21760
• Published
• 15
VoiceAssistant-Eval: Benchmarking AI Assistants across Listening,
Speaking, and Viewing
Paper
• 2509.22651
• Published
• 23
Variational Reasoning for Language Models
Paper
• 2509.22637
• Published
• 69
LongLive: Real-time Interactive Long Video Generation
Paper
• 2509.22622
• Published
• 188
A Survey of Interactive Generative Video
Paper
• 2504.21853
• Published
• 46
Evaluating Very Long-Term Conversational Memory of LLM Agents
Paper
• 2402.17753
• Published
• 19
VBench: Comprehensive Benchmark Suite for Video Generative Models
Paper
• 2311.17982
• Published
• 9
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic
Faithfulness
Paper
• 2503.21755
• Published
• 33
VBench++: Comprehensive and Versatile Benchmark Suite for Video
Generative Models
Paper
• 2411.13503
• Published
• 34
DreamBench++: A Human-Aligned Benchmark for Personalized Image
Generation
Paper
• 2406.16855
• Published
• 57
VCBench: Benchmarking LLMs in Venture Capital
Paper
• 2509.14448
• Published
AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection
Paper
• 2504.20865
• Published
ConsumerBench: Benchmarking Generative AI Applications on End-User
Devices
Paper
• 2506.17538
• Published
• 7
Benchmarking AI Models in Software Engineering: A Review, Search Tool,
and Enhancement Protocol
Paper
• 2503.05860
• Published
• 11
MERA Code: A Unified Framework for Evaluating Code Generation Across
Tasks
Paper
• 2507.12284
• Published
• 12
Benchmarking Neural Network Training Algorithms
Paper
• 2306.07179
• Published
• 24
SpreadsheetBench: Towards Challenging Real World Spreadsheet
Manipulation
Paper
• 2406.14991
• Published
• 2
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM
Evaluation
Paper
• 2506.00482
• Published
• 8
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls
and Complex Instructions
Paper
• 2406.15877
• Published
• 48
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
Paper
• 2407.18961
• Published
• 40
ImgEdit: A Unified Image Editing Dataset and Benchmark
Paper
• 2505.20275
• Published
• 18
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image
Generation
Paper
• 2504.02782
• Published
• 57
7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models
Paper
• 2508.12919
• Published
• 1
Instruction-Following Evaluation in Function Calling for Large Language
Models
Paper
• 2509.18420
• Published
• 2
MinerU2.5: A Decoupled Vision-Language Model for Efficient
High-Resolution Document Parsing
Paper
• 2509.22186
• Published
• 146
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient
SpeechLLMs
Paper
• 2509.22220
• Published
• 65
RealUnify: Do Unified Models Truly Benefit from Unification? A
Comprehensive Benchmark
Paper
• 2509.24897
• Published
• 46
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation
and Editing
Paper
• 2509.24900
• Published
• 53
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture,
Training and Dataset
Paper
• 2505.09568
• Published
• 99
Unified Multimodal Understanding and Generation Models: Advances,
Challenges, and Opportunities
Paper
• 2505.02567
• Published
• 80
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for
Large Language Models
Paper
• 2406.12644
• Published
• 5
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing
via Compositional Dependencies
Paper
• 2506.12830
• Published
CompBench: Benchmarking Complex Instruction-guided Image Editing
Paper
• 2505.12200
• Published
Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought
Imagination
Paper
• 2509.01986
• Published
• 5
GenEval: An Object-Focused Framework for Evaluating Text-to-Image
Alignment
Paper
• 2310.11513
• Published
• 1
Visual Jigsaw Post-Training Improves MLLMs
Paper
• 2509.25190
• Published
• 37
SANA-Video: Efficient Video Generation with Block Linear Diffusion
Transformer
Paper
• 2509.24695
• Published
• 46
Democratizing AI scientists using ToolUniverse
Paper
• 2509.23426
• Published
• 40
EasySteer: A Unified Framework for High-Performance and Extensible LLM
Steering
Paper
• 2509.25175
• Published
• 31
Towards Personalized Deep Research: Benchmarks and Evaluations
Paper
• 2509.25106
• Published
• 30
VideoScore2: Think before You Score in Generative Video Evaluation
Paper
• 2509.22799
• Published
• 26
MMPB: It's Time for Multi-Modal Personalization
Paper
• 2509.22820
• Published
• 15
Personalization of Large Language Models: A Survey
Paper
• 2411.00027
• Published
• 33
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Paper
• 2509.25161
• Published
• 26
HunyuanImage 3.0 Technical Report
Paper
• 2509.23951
• Published
• 25
PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on
Structured Images
Paper
• 2509.25185
• Published
• 5
Local Success Does Not Compose: Benchmarking Large Language Models for
Compositional Formal Verification
Paper
• 2509.23061
• Published
• 7
UniVid: The Open-Source Unified Video Model
Paper
• 2509.24200
• Published
• 5
PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation
Paper
• 2509.23338
• Published
• 8
BPMN Assistant: An LLM-Based Approach to Business Process Modeling
Paper
• 2509.24592
• Published
• 3
Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large
Language Models
Paper
• 2509.23233
• Published
• 4
Advancing Reference-free Evaluation of Video Captions with Factual
Analysis
Paper
• 2509.16538
• Published
• 1
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP
Use
Paper
• 2509.24002
• Published
• 176
OceanGym: A Benchmark Environment for Underwater Embodied Agents
Paper
• 2509.26536
• Published
• 36
DC-VideoGen: Efficient Video Generation with Deep Compression Video
Autoencoder
Paper
• 2509.25182
• Published
• 39
Learning to See Before Seeing: Demystifying LLM Visual Priors from
Language Pre-training
Paper
• 2509.26625
• Published
• 43
VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in
Real-world Applications
Paper
• 2509.26490
• Published
• 20
dParallel: Learnable Parallel Decoding for dLLMs
Paper
• 2509.26488
• Published
• 19
DA^2: Depth Anything in Any Direction
Paper
• 2509.26618
• Published
• 26
TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics
Paper
• 2509.26329
• Published
• 3
Video Object Segmentation-Aware Audio Generation
Paper
• 2509.26604
• Published
• 1
BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source
Software
Paper
• 2509.25248
• Published
• 3
Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional
Video Generation
Paper
• 2509.26555
• Published
• 1
Regression Language Models for Code
Paper
• 2509.26476
• Published
• 17
The Pitfalls of KV Cache Compression
Paper
• 2510.00231
• Published
• 6
Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
Paper
• 2509.26539
• Published
• 10
LayerD: Decomposing Raster Graphic Designs into Layers
Paper
• 2509.25134
• Published
• 2
Improving Editability in Image Generation with Layer-wise Memory
Paper
• 2505.01079
• Published
• 29
Generative Image Layer Decomposition with Visual Effects
Paper
• 2411.17864
• Published
Edit Transfer: Learning Image Editing via Vision In-Context Relations
Paper
• 2503.13327
• Published
• 29
Text2Layer: Layered Image Generation using Latent Diffusion Model
Paper
• 2307.09781
• Published
• 16
Code2Video: A Code-centric Paradigm for Educational Video Generation
Paper
• 2510.01174
• Published
• 35
GEM: A Gym for Agentic LLMs
Paper
• 2510.01051
• Published
• 90
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model
Responses
Paper
• 2510.00232
• Published
• 16
In-Place Feedback: A New Paradigm for Guiding LLMs in Multi-Turn
Reasoning
Paper
• 2510.00777
• Published
• 2
An Empirical Study of Testing Practices in Open Source AI Agent
Frameworks and Agentic Applications
Paper
• 2509.19185
• Published
• 4
Can Large Multimodal Models Uncover Deep Semantics Behind Images?
Paper
• 2402.11281
• Published
• 1
Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models
Paper
• 2509.25162
• Published
• 3
BindWeave: Subject-Consistent Video Generation via Cross-Modal
Integration
Paper
• 2510.00438
• Published
• 10
BatonVoice: An Operationalist Framework for Enhancing Controllable
Speech Synthesis with Linguistic Intelligence from LLMs
Paper
• 2509.26514
• Published
• 4
Eliciting Secret Knowledge from Language Models
Paper
• 2510.01070
• Published
• 6
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Paper
• 2510.02283
• Published
• 96
StockBench: Can LLM Agents Trade Stocks Profitably In Real-world
Markets?
Paper
• 2510.02209
• Published
• 56
BloombergGPT: A Large Language Model for Finance
Paper
• 2303.17564
• Published
• 30
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
Paper
• 2510.01284
• Published
• 37
A Rigorous Benchmark with Multidimensional Evaluation for Deep Research
Agents: From Answers to Reports
Paper
• 2510.02190
• Published
• 19
VIRTUE: Visual-Interactive Text-Image Universal Embedder
Paper
• 2510.00523
• Published
• 7
Breaking the Modality Barrier: Universal Embedding Learning with
Multimodal LLMs
Paper
• 2504.17432
• Published
• 40
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Paper
• 2411.04997
• Published
• 39
Veagle: Advancements in Multimodal Representation Learning
Paper
• 2403.08773
• Published
• 10
CoDA: Agentic Systems for Collaborative Data Visualization
Paper
• 2510.03194
• Published
• 30
SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?
Paper
• 2510.03120
• Published
• 7
Paper2Video: Automatic Video Generation from Scientific Papers
Paper
• 2510.05096
• Published
• 119
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation
Paper
• 2510.05094
• Published
• 38
Agentic Context Engineering: Evolving Contexts for Self-Improving
Language Models
Paper
• 2510.04618
• Published
• 129
Hybrid Architectures for Language Models: Systematic Analysis and Design
Insights
Paper
• 2510.04800
• Published
• 37
Cache-to-Cache: Direct Semantic Communication Between Large Language
Models
Paper
• 2510.03215
• Published
• 98
Ming-UniVision: Joint Image Understanding and Generation with a Unified
Continuous Tokenizer
Paper
• 2510.06590
• Published
• 77
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal
Generation and Understanding
Paper
• 2510.06308
• Published
• 55
SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models
Paper
• 2510.06917
• Published
• 35
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with
Holistic Platform and Adaptive Hybrid Policy Optimization
Paper
• 2510.08540
• Published
• 109
MATRIX: Mask Track Alignment for Interaction-aware Video Generation
Paper
• 2510.07310
• Published
• 36
RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training
Paper
• 2510.06710
• Published
• 42
Vibe Checker: Aligning Code Evaluation with Human Preference
Paper
• 2510.07315
• Published
• 34
Online Generic Event Boundary Detection
Paper
• 2510.06855
• Published
• 4
Bridging Text and Video Generation: A Survey
Paper
• 2510.04999
• Published
• 6
U-Bench: A Comprehensive Understanding of U-Net through 100-Variant
Benchmarking
Paper
• 2510.07041
• Published
• 4
DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for
Autonomous Travel Planning Agents
Paper
• 2509.21842
• Published
• 3
Agent Learning via Early Experience
Paper
• 2510.08558
• Published
• 273
UniVideo: Unified Understanding, Generation, and Editing for Videos
Paper
• 2510.08377
• Published
• 81
UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video
Super-Resolution
Paper
• 2510.08143
• Published
• 20
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
Paper
• 2510.03663
• Published
• 16
NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM
Agents
Paper
• 2510.07172
• Published
• 28
VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal
Patches via In-Context Conditioning
Paper
• 2510.08555
• Published
• 64
Recycling Pretrained Checkpoints: Orthogonal Growth of
Mixture-of-Experts for Efficient Large Language Model Pre-Training
Paper
• 2510.08008
• Published
• 6
Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs
Paper
• 2510.07429
• Published
• 4
Beyond Turn Limits: Training Deep Search Agents with Dynamic Context
Window
Paper
• 2510.08276
• Published
• 10
SciVideoBench: Benchmarking Scientific Video Reasoning in Large
Multimodal Models
Paper
• 2510.08559
• Published
• 9
Character Mixing for Video Generation
Paper
• 2510.05093
• Published
• 7
WithAnyone: Towards Controllable and ID Consistent Image Generation
Paper
• 2510.14975
• Published
• 85
From Pixels to Words -- Towards Native Vision-Language Primitives at
Scale
Paper
• 2510.14979
• Published
• 67
Attention Is All You Need for KV Cache in Diffusion LLMs
Paper
• 2510.14973
• Published
• 42
LLM-guided Hierarchical Retrieval
Paper
• 2510.13217
• Published
• 21
Qwen3Guard Technical Report
Paper
• 2510.14276
• Published
• 15
Learning an Image Editing Model without Image Editing Pairs
Paper
• 2510.14978
• Published
• 9
pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation
Paper
• 2510.14974
• Published
• 10
RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval
Augmented Generation Systems
Paper
• 2510.13910
• Published
• 2
DeepAgent: A General Reasoning Agent with Scalable Toolsets
Paper
• 2510.21618
• Published
• 101
Video-As-Prompt: Unified Semantic Control for Video Generation
Paper
• 2510.20888
• Published
• 50
UI-Ins: Enhancing GUI Grounding with Multi-Perspective
Instruction-as-Reasoning
Paper
• 2510.20286
• Published
• 24
From Denoising to Refining: A Corrective Framework for Vision-Language
Diffusion Model
Paper
• 2510.19871
• Published
• 30
RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via
Hierarchical Model Merging
Paper
• 2510.20479
• Published
• 12
Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs
Paper
• 2510.13251
• Published
• 14
Model Merging with Functional Dual Anchors
Paper
• 2510.21223
• Published
• 13
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via
Data Alignment and Test-Time Scaling
Paper
• 2510.20206
• Published
• 12
Paper
• 2510.18212
• Published
• 36
Visual Diffusion Models are Geometric Solvers
Paper
• 2510.21697
• Published
• 20
AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research
Suite
Paper
• 2510.21652
• Published
• 4
ARC-Encoder: learning compressed text representations for large language
models
Paper
• 2510.20535
• Published
• 8
Taming Modality Entanglement in Continual Audio-Visual Segmentation
Paper
• 2510.17234
• Published
• 5
MemOS: A Memory OS for AI System
Paper
• 2507.03724
• Published
• 159
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world
APIs
Paper
• 2307.16789
• Published
• 102
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
Paper
• 2304.08244
• Published
• 1
ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models
in Multi-Hop Tool Use
Paper
• 2501.02506
• Published
• 10
WebShop: Towards Scalable Real-World Web Interaction with Grounded
Language Agents
Paper
• 2207.01206
• Published
• 3
GAIA: a benchmark for General AI Assistants
Paper
• 2311.12983
• Published
• 245
Task Vectors are Cross-Modal
Paper
• 2410.22330
• Published
• 11
In-Context Learning Creates Task Vectors
Paper
• 2310.15916
• Published
• 44
Group Relative Attention Guidance for Image Editing
Paper
• 2510.24657
• Published
• 26
OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents
Paper
• 2510.24563
• Published
• 23
WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling
Info-Rich Seeking
Paper
• 2510.24697
• Published
• 21
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing
Actions
Paper
• 2510.10666
• Published
• 28
WideSearch: Benchmarking Agentic Broad Info-Seeking
Paper
• 2508.07999
• Published
• 110
SealQA: Raising the Bar for Reasoning in Search-Augmented Language
Models
Paper
• 2506.01062
• Published
• 5
Routing Matters in MoE: Scaling Diffusion Transformers with Explicit
Routing Guidance
Paper
• 2510.24711
• Published
• 20
VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations
Paper
• 2510.22373
• Published
• 15
PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text
Embedding
Paper
• 2510.22264
• Published
• 2
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with
the MME-CoF Benchmark
Paper
• 2510.26802
• Published
• 34
AMO-Bench: Large Language Models Still Struggle in High School Math
Competitions
Paper
• 2510.26768
• Published
• 34
The Era of Agentic Organization: Learning to Organize with Language
Models
Paper
• 2510.26658
• Published
• 29
OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal
Document Layout Generation
Paper
• 2510.26213
• Published
• 10
Magentic Marketplace: An Open-Source Environment for Studying Agentic
Markets
Paper
• 2510.25779
• Published
• 11
CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark
Paper
• 2510.26160
• Published
• 17
ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Paper
• 2510.26781
• Published
• 1
Emu3.5: Native Multimodal Models are World Learners
Paper
• 2510.26583
• Published
• 111
The End of Manual Decoding: Towards Truly End-to-End Language Models
Paper
• 2510.26697
• Published
• 117
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement
Learning
Paper
• 2510.23473
• Published
• 85
JanusCoder: Towards a Foundational Visual-Programmatic Interface for
Code Intelligence
Paper
• 2510.23538
• Published
• 97
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic,
and Long-Horizon Task Execution
Paper
• 2510.25726
• Published
• 46
VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context
Learning
Paper
• 2510.25772
• Published
• 33
The Principles of Diffusion Models
Paper
• 2510.21890
• Published
• 62
RegionE: Adaptive Region-Aware Generation for Efficient Image Editing
Paper
• 2510.25590
• Published
• 28
Multimodal Spatial Reasoning in the Large Model Era: A Survey and
Benchmarks
Paper
• 2510.25760
• Published
• 17
SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In
Text-only LLMs
Paper
• 2510.25092
• Published
• 8
Reasoning Language Model Inference Serving Unveiled: An Empirical Study
Paper
• 2510.18672
• Published
• 8
InteractComp: Evaluating Search Agents With Ambiguous Queries
Paper
• 2510.24668
• Published
• 98
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization
Formats
Paper
• 2510.25602
• Published
• 78
ThinkMorph: Emergent Properties in Multimodal Interleaved
Chain-of-Thought Reasoning
Paper
• 2510.27492
• Published
• 86
Defeating the Training-Inference Mismatch via FP16
Paper
• 2510.26788
• Published
• 31
Revisiting Multimodal Positional Encoding in Vision-Language Models
Paper
• 2510.23095
• Published
• 22
Higher-order Linear Attention
Paper
• 2510.27258
• Published
• 15
The Denario project: Deep knowledge AI agents for scientific discovery
Paper
• 2510.26887
• Published
• 8
UniLumos: Fast and Unified Image and Video Relighting with
Physics-Plausible Feedback
Paper
• 2511.01678
• Published
• 38
Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open
Language Foundation
Paper
• 2510.22115
• Published
• 85
The Underappreciated Power of Vision Models for Graph Structural
Understanding
Paper
• 2510.24788
• Published
• 36
UniREditBench: A Unified Reasoning-based Image Editing Benchmark
Paper
• 2511.01295
• Published
• 39
ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool
Use
Paper
• 2510.27363
• Published
• 23
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal
Generation
Paper
• 2511.01163
• Published
• 32
Towards Universal Video Retrieval: Generalizing Video Embedding via
Synthesized Multimodal Pyramid Curriculum
Paper
• 2510.27571
• Published
• 19
TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images
Reasoning
Paper
• 2511.01833
• Published
• 16
LongCat-Flash-Omni Technical Report
Paper
• 2511.00279
• Published
• 26
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement
Reading with MeasureBench
Paper
• 2510.26865
• Published
• 12
Actial: Activate Spatial Reasoning Ability of Multimodal Large Language
Models
Paper
• 2511.01618
• Published
• 11
Trove: A Flexible Toolkit for Dense Retrieval
Paper
• 2511.01857
• Published
• 12
Towards Robust Mathematical Reasoning
Paper
• 2511.01846
• Published
• 10
MotionStream: Real-Time Video Generation with Interactive Motion
Controls
Paper
• 2511.01266
• Published
• 31
UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings
Paper
• 2511.00405
• Published
• 6
Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers
Paper
• 2511.01617
• Published
• 3
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual
Representation
Paper
• 2511.02778
• Published
• 102
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for
Visual Chain-of-Thought
Paper
• 2511.02779
• Published
• 59
LTD-Bench: Evaluating Large Language Models by Letting Them Draw
Paper
• 2511.02347
• Published
• 9
TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System
Paper
• 2511.02832
• Published
• 10
Can Visual Input Be Compressed? A Visual Token Compression Benchmark for
Large Multimodal Models
Paper
• 2511.02650
• Published
• 10
CodeClash: Benchmarking Goal-Oriented Software Engineering
Paper
• 2511.00839
• Published
• 10
iFlyBot-VLA Technical Report
Paper
• 2511.01914
• Published
• 7
TabDSR: Decompose, Sanitize, and Reason for Complex Numerical Reasoning
in Tabular Data
Paper
• 2511.02219
• Published
• 2
LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for
LLMs in Chinese Context
Paper
• 2511.02366
• Published
• 4
VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation
Models
Paper
• 2511.02712
• Published
• 5
MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive
Capacity
Paper
• 2511.03146
• Published
• 8
TabTune: A Unified Library for Inference and Fine-Tuning Tabular
Foundation Models
Paper
• 2511.02802
• Published
• 16
V-Thinker: Interactive Thinking with Images
Paper
• 2511.04460
• Published
• 97
Thinking with Video: Video Generation as a Promising Multimodal
Reasoning Paradigm
Paper
• 2511.04570
• Published
• 240
Scaling Agent Learning via Experience Synthesis
Paper
• 2511.03773
• Published
• 82
NVIDIA Nemotron Nano V2 VL
Paper
• 2511.03929
• Published
• 30
GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents
Paper
• 2511.04307
• Published
• 15
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable
Non-Visual Shortcuts
Paper
• 2511.04655
• Published
• 8
Diffusion Language Models are Super Data Learners
Paper
• 2511.03276
• Published
• 129
A Survey of LLM-Driven AI Agent Communication: Protocols, Security
Risks, and Defense Countermeasures
Paper
• 2506.19676
• Published
MCP-AgentBench: Evaluating Real-World Language Agent Performance with
MCP-Mediated Tools
Paper
• 2509.09734
• Published
• 16
DeepEyesV2: Toward Agentic Multimodal Model
Paper
• 2511.05271
• Published
• 45
Paper
• 2511.05491
• Published
• 52
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
Paper
• 2511.04962
• Published
• 57
Towards Mitigating Hallucinations in Large Vision-Language Models by
Refining Textual Embeddings
Paper
• 2511.05017
• Published
• 9
Paper
• 2511.05369
• Published
• 10
Real-Time Reasoning Agents in Evolving Environments
Paper
• 2511.04898
• Published
• 13
EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation
Paper
• 2511.11002
• Published
• 4
Experience-Guided Adaptation of Inference-Time Reasoning Strategies
Paper
• 2511.11519
• Published
• 4
SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?
Paper
• 2511.06090
• Published
• 5
Generating an Image From 1,000 Words: Enhancing Text-to-Image With
Structured Captions
Paper
• 2511.06876
• Published
• 28
Agentic Refactoring: An Empirical Study of AI Coding Agents
Paper
• 2511.04824
• Published
• 5
Motif 2 12.7B technical report
Paper
• 2511.07464
• Published
• 39
Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising
Paper
• 2511.08633
• Published
• 55
Optimizing Diversity and Quality through Base-Aligned Model Collaboration
Paper
• 2511.05650
• Published
• 6
Intelligence per Watt: Measuring Intelligence Efficiency of Local AI
Paper
• 2511.07885
• Published
• 10
Walking the Tightrope of LLMs for Software Development: A Practitioners' Perspective
Paper
• 2511.06428
• Published
• 5
Adaptive Multi-Agent Response Refinement in Conversational Systems
Paper
• 2511.08319
• Published
• 42
Music Flamingo: Scaling Music Understanding in Audio Language Models
Paper
• 2511.10289
• Published
• 17
Depth Anything 3: Recovering the Visual Space from Any Views
Paper
• 2511.10647
• Published
• 99
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
Paper
• 2511.08521
• Published
• 38
Instella: Fully Open Language Models with Stellar Performance
Paper
• 2511.10628
• Published
• 5
WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance
Paper
• 2511.12997
• Published
• 11
MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation
Paper
• 2511.09611
• Published
• 70
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models
Paper
• 2511.13704
• Published
• 43
Workload Schedulers -- Genesis, Algorithms and Differences
Paper
• 2511.10258
• Published
• 2
SAM 3D: 3Dfy Anything in Images
Paper
• 2511.16624
• Published
• 113
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Paper
• 2511.16334
• Published
• 93
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
Paper
• 2511.15705
• Published
• 97
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination
Paper
• 2511.17490
• Published
• 22
WorldGen: From Text to Traversable and Interactive 3D Worlds
Paper
• 2511.16825
• Published
• 24
OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists
Paper
• 2511.16931
• Published
• 8
RynnVLA-002: A Unified Vision-Language-Action and World Model
Paper
• 2511.17502
• Published
• 28
SAM 3: Segment Anything with Concepts
Paper
• 2511.16719
• Published
• 129
O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents
Paper
• 2511.13593
• Published
• 27
In-Video Instructions: Visual Signals as Generative Control
Paper
• 2511.19401
• Published
• 32
HunyuanVideo 1.5 Technical Report
Paper
• 2511.18870
• Published
• 28
Controllable Layer Decomposition for Reversible Multi-Layer Image Generation
Paper
• 2511.16249
• Published
• 9
TradingAgents: Multi-Agents LLM Financial Trading Framework
Paper
• 2412.20138
• Published
• 18
Canvas-to-Image: Compositional Image Generation with Multimodal Controls
Paper
• 2511.21691
• Published
• 36
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Paper
• 2511.18538
• Published
• 299
DeepCode: Open Agentic Coding
Paper
• 2512.07921
• Published
• 33
RecGPT-V2 Technical Report
Paper
• 2512.14503
• Published
• 18
Reveal Hidden Pitfalls and Navigate Next Generation of Vector Similarity Search from Task-Centric Views
Paper
• 2512.12980
• Published
• 28
Paper
• 2512.13961
• Published
• 29
Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets
Paper
• 2512.15110
• Published
• 10
SemanticGen: Video Generation in Semantic Space
Paper
• 2512.20619
• Published
• 93
LongVideoAgent: Multi-Agent Reasoning with Long Videos
Paper
• 2512.20618
• Published
• 55
Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization
Paper
• 2512.24615
• Published
• 119
SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning
Paper
• 2512.24330
• Published
• 35
ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas
Paper
• 2601.21558
• Published
• 58
THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
Paper
• 2601.23143
• Published
• 38
FireRed-Image-Edit-1.0 Techinical Report
Paper
• 2602.13344
• Published
• 4
Discovering Multiagent Learning Algorithms with Large Language Models
Paper
• 2602.16928
• Published
• 12
MMA: Multimodal Memory Agent
Paper
• 2602.16493
• Published
• 7
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
Paper
• 2602.12670
• Published
• 51
Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models
Paper
• 2602.15772
• Published
• 6
On Surprising Effectiveness of Masking Updates in Adaptive Optimizers
Paper
• 2602.15322
• Published
• 9
GLM-5: from Vibe Coding to Agentic Engineering
Paper
• 2602.15763
• Published
• 94