AI-paper - a shankars Collection

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Paper • 2508.09789 • Published Aug 13, 2025 • 5

MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

Paper • 2508.13186 • Published Aug 14, 2025 • 19

ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents

Paper • 2508.04038 • Published Aug 6, 2025 • 1

Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL

Paper • 2508.13167 • Published Aug 6, 2025 • 129

Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward

Paper • 2508.12800 • Published Aug 18, 2025 • 6

Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends

Paper • 2508.11548 • Published Aug 15, 2025 • 5

Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge

Paper • 2508.08777 • Published Aug 12, 2025 • 15

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

Paper • 2508.09131 • Published Aug 12, 2025 • 16

MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers

Paper • 2508.14704 • Published Aug 20, 2025 • 43

From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery

Paper • 2508.14111 • Published Aug 18, 2025 • 33

RynnEC: Bringing MLLMs into Embodied World

Paper • 2508.14160 • Published Aug 19, 2025 • 20

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Paper • 2505.04921 • Published May 8, 2025 • 186

Evolving Deeper LLM Thinking

Paper • 2501.09891 • Published Jan 17, 2025 • 115

A Survey on Large Language Model Benchmarks

Paper • 2508.15361 • Published Aug 21, 2025 • 20

Deep Think with Confidence

Paper • 2508.15260 • Published Aug 21, 2025 • 90

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Paper • 2501.05452 • Published Jan 9, 2025 • 15

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

Paper • 2504.15279 • Published Apr 21, 2025 • 78

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Paper • 2406.14562 • Published Jun 20, 2024 • 28

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Paper • 2501.06186 • Published Jan 10, 2025 • 65

Thinking with Generated Images

Paper • 2505.22525 • Published May 28, 2025 • 15

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Paper • 2505.13444 • Published May 19, 2025 • 17

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Paper • 2407.01284 • Published Jul 1, 2024 • 81

ComposeAnything: Composite Object Priors for Text-to-Image Generation

Paper • 2505.24086 • Published May 30, 2025 • 5

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Paper • 2506.23918 • Published Jun 30, 2025 • 90

Visual Planning: Let's Think Only with Images

Paper • 2505.11409 • Published May 16, 2025 • 57

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Paper • 2407.07053 • Published Jul 9, 2024 • 47

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

Paper • 2403.12884 • Published Mar 19, 2024 • 1

CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography

Paper • 2504.10090 • Published Apr 14, 2025

Visual Programming: Compositional visual reasoning without training

Paper • 2211.11559 • Published Nov 18, 2022 • 1

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning

Paper • 2408.02210 • Published Aug 5, 2024 • 9

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

Paper • 2412.18072 • Published Dec 24, 2024 • 18

Intern-S1: A Scientific Multimodal Foundation Model

Paper • 2508.15763 • Published Aug 21, 2025 • 269

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

Paper • 2504.06261 • Published Apr 8, 2025 • 110

Star Attention: Efficient LLM Inference over Long Sequences

Paper • 2411.17116 • Published Nov 26, 2024 • 53

PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

Paper • 2504.08791 • Published Apr 7, 2025 • 139

LLM Inference Unveiled: Survey and Roofline Model Insights

Paper • 2402.16363 • Published Feb 26, 2024 • 4

Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

Paper • 2504.11750 • Published Apr 16, 2025

Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices

Paper • 2410.11795 • Published Oct 15, 2024 • 18

Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions

Paper • 2504.19056 • Published Apr 27, 2025 • 18

Personalized Image Generation with Deep Generative Models: A Decade Survey

Paper • 2502.13081 • Published Feb 18, 2025

Diffusion Models: A Comprehensive Survey of Methods and Applications

Paper • 2209.00796 • Published Sep 2, 2022

An Empirical Study of GPT-4o Image Generation Capabilities

Paper • 2504.05979 • Published Apr 8, 2025 • 64

ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation

Paper • 2502.09411 • Published Feb 13, 2025 • 22

A survey of Generative AI Applications

Paper • 2306.02781 • Published Jun 5, 2023

Text-to-image Diffusion Models in Generative AI: A Survey

Paper • 2303.07909 • Published Mar 14, 2023

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

Paper • 2501.06322 • Published Jan 10, 2025 • 1

Multi-Agent Collaboration via Evolving Orchestration

Paper • 2505.19591 • Published May 26, 2025 • 7

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Paper • 2412.04440 • Published Dec 5, 2024 • 22

AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose Task Solving

Paper • 2506.12508 • Published Jun 14, 2025 • 1

Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence

Paper • 2407.07061 • Published Jul 9, 2024 • 28

VideoTetris: Towards Compositional Text-to-Video Generation

Paper • 2406.04277 • Published Jun 6, 2024 • 25

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

Paper • 2407.14505 • Published Jul 19, 2024 • 26

DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation

Paper • 2411.16657 • Published Nov 25, 2024 • 19

FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations

Paper • 2411.10818 • Published Nov 16, 2024 • 26

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Paper • 2312.14125 • Published Dec 21, 2023 • 47

PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices

Paper • 2504.03664 • Published Mar 15, 2025

FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference

Paper • 2503.03777 • Published Mar 4, 2025

SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs

Paper • 2503.16163 • Published Mar 20, 2025

HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading

Paper • 2502.12574 • Published Feb 18, 2025 • 13

Seesaw: High-throughput LLM Inference via Model Re-sharding

Paper • 2503.06433 • Published Mar 9, 2025

MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints

Paper • 2504.09345 • Published Apr 12, 2025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 306

MV-RAG: Retrieval Augmented Multiview Diffusion

Paper • 2508.16577 • Published Aug 22, 2025 • 38

Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

Paper • 2508.18032 • Published Aug 25, 2025 • 41

PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent LLMs

Paper • 2508.17188 • Published Aug 24, 2025 • 17

Explain Before You Answer: A Survey on Compositional Visual Reasoning

Paper • 2508.17298 • Published Aug 24, 2025 • 4

AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs

Paper • 2508.16153 • Published Aug 22, 2025 • 160

AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

Paper • 2508.16279 • Published Aug 22, 2025 • 53

CineScale: Free Lunch in High-Resolution Cinematic Visual Generation

Paper • 2508.15774 • Published Aug 21, 2025 • 20

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Paper • 2508.19652 • Published Aug 27, 2025 • 84

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Paper • 2508.20072 • Published Aug 27, 2025 • 32

AudioStory: Generating Long-Form Narrative Audio with Large Language Models

Paper • 2508.20088 • Published Aug 27, 2025 • 21

MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment

Paper • 2508.19527 • Published Aug 27, 2025 • 10

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

Paper • 2508.19559 • Published Aug 27, 2025 • 6

Mixture of Contexts for Long Video Generation

Paper • 2508.21058 • Published Aug 28, 2025 • 35

rStar2-Agent: Agentic Reasoning Technical Report

Paper • 2508.20722 • Published Aug 28, 2025 • 117

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Paper • 2508.20751 • Published Aug 28, 2025 • 89

AWorld: Orchestrating the Training Recipe for Agentic AI

Paper • 2508.20404 • Published Aug 28, 2025 • 38

Dress&Dance: Dress up and Dance as You Like It - Technical Preview

Paper • 2508.21070 • Published Aug 28, 2025 • 6

ROSE: Remove Objects with Side Effects in Videos

Paper • 2508.18633 • Published Aug 26, 2025 • 7

EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

Paper • 2508.21112 • Published Aug 28, 2025 • 77

A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

Paper • 2508.18106 • Published Aug 25, 2025 • 348

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

Paper • 2508.21113 • Published Aug 28, 2025 • 110

AHELM: A Holistic Evaluation of Audio-Language Models

Paper • 2508.21376 • Published Aug 29, 2025 • 9

Morae: Proactively Pausing UI Agents for User Choices

Paper • 2508.21456 • Published Aug 29, 2025 • 5

UItron: Foundational GUI Agent with Advanced Perception and Planning

Paper • 2508.21767 • Published Aug 29, 2025 • 12

Efficient Code Embeddings from Code Generation Models

Paper • 2508.21290 • Published Aug 29, 2025 • 19

TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training

Paper • 2508.17677 • Published Aug 25, 2025 • 14

CLIPSym: Delving into Symmetry Detection with CLIP

Paper • 2508.14197 • Published Aug 19, 2025 • 8

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Paper • 2508.21148 • Published Aug 28, 2025 • 140

Continual Learning for Large Language Models: A Survey

Paper • 2402.01364 • Published Feb 2, 2024 • 1

Continual Learning with Pre-Trained Models: A Survey

Paper • 2401.16386 • Published Jan 29, 2024 • 1

Continual Learning: Applications and the Road Forward

Paper • 2311.11908 • Published Nov 20, 2023 • 1

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Paper • 2509.02547 • Published Sep 2, 2025 • 230

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Paper • 2509.02479 • Published Sep 2, 2025 • 84

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

Paper • 2508.21496 • Published Aug 29, 2025 • 55

VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use

Paper • 2509.01055 • Published Sep 1, 2025 • 79

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

Paper • 2509.01215 • Published Sep 1, 2025 • 51

GenCompositor: Generative Video Compositing with Diffusion Transformer

Paper • 2509.02460 • Published Sep 2, 2025 • 26

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

Paper • 2509.01644 • Published Sep 1, 2025 • 34

Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation

Paper • 2509.00428 • Published Aug 30, 2025 • 18

From Editor to Dense Geometry Estimator

Paper • 2509.04338 • Published Sep 4, 2025 • 96

Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings

Paper • 2508.18733 • Published Aug 26, 2025 • 10

Towards a Unified View of Large Language Model Post-Training

Paper • 2509.04419 • Published Sep 4, 2025 • 76

RedStone: Curating General, Code, Math, and QA Data for Large Language Models

Paper • 2412.03398 • Published Dec 4, 2024 • 2

RecAgent: A Novel Simulation Paradigm for Recommender Systems

Paper • 2306.02552 • Published Jun 5, 2023 • 1

Adversarial Data Collection: Human-Collaborative Perturbations for Efficient and Robust Robotic Imitation Learning

Paper • 2503.11646 • Published Mar 14, 2025 • 34

How do language models learn facts? Dynamics, curricula and hallucinations

Paper • 2503.21676 • Published Mar 27, 2025 • 1

Investigating Multi-source Active Learning for Natural Language Inference

Paper • 2302.06976 • Published Feb 14, 2023

Targeted Data Acquisition for Evolving Negotiation Agents

Paper • 2106.07728 • Published Jun 14, 2021

UniVerse-1: Unified Audio-Video Generation via Stitching of Experts

Paper • 2509.06155 • Published Sep 7, 2025 • 14

Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

Paper • 2509.06949 • Published Sep 8, 2025 • 56

Reinforced Visual Perception with Tools

Paper • 2509.01656 • Published Sep 1, 2025 • 32

Reinforcement Learning Foundations for Deep Research Systems: A Survey

Paper • 2509.06733 • Published Sep 8, 2025 • 32

Visual Representation Alignment for Multimodal Large Language Models

Paper • 2509.07979 • Published Sep 9, 2025 • 84

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Paper • 2509.06951 • Published Sep 8, 2025 • 32

A Survey of Reinforcement Learning for Large Reasoning Models

Paper • 2509.08827 • Published Sep 10, 2025 • 190

EnvX: Agentize Everything with Agentic AI

Paper • 2509.08088 • Published Sep 9, 2025 • 8

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants

Paper • 2509.08494 • Published Sep 10, 2025 • 3

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Paper • 2509.09372 • Published Sep 11, 2025 • 246

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

Paper • 2509.08519 • Published Sep 10, 2025 • 128

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Paper • 2509.09674 • Published Sep 11, 2025 • 80

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Paper • 2509.09595 • Published Sep 11, 2025 • 48

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Paper • 2509.09676 • Published Sep 11, 2025 • 35

Visual Programmability: A Guide for Code-as-Thought in Chart Understanding

Paper • 2509.09286 • Published Sep 11, 2025 • 11

Agentic Software Engineering: Foundational Pillars and a Research Roadmap

Paper • 2509.06216 • Published Sep 7, 2025 • 8

AI Agentic Programming: A Survey of Techniques, Challenges, and Opportunities

Paper • 2508.11126 • Published Aug 15, 2025

Agentic AI Frameworks: Architectures, Protocols, and Design Challenges

Paper • 2508.10146 • Published Aug 13, 2025

Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs

Paper • 2509.15020 • Published Sep 18, 2025 • 4

Developer-LLM Conversations: An Empirical Study of Interactions and Generated Code Quality

Paper • 2509.10402 • Published Sep 12, 2025 • 6

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

Paper • 2509.15178 • Published Sep 18, 2025 • 6

RecoWorld: Building Simulated Environments for Agentic Recommender Systems

Paper • 2509.10397 • Published Sep 12, 2025 • 7

MultiEdit: Advancing Instruction-based Image Editing on Diverse and Challenging Tasks

Paper • 2509.14638 • Published Sep 18, 2025 • 13

AToken: A Unified Tokenizer for Vision

Paper • 2509.14476 • Published Sep 17, 2025 • 36

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

Paper • 2509.13160 • Published Sep 16, 2025 • 29

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

Paper • 2509.15185 • Published Sep 18, 2025 • 29

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Paper • 2509.15194 • Published Sep 18, 2025 • 33

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

Paper • 2509.15221 • Published Sep 18, 2025 • 111

FlowRL: Matching Reward Distributions for LLM Reasoning

Paper • 2509.15207 • Published Sep 18, 2025 • 116

Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration

Paper • 2509.14760 • Published Sep 18, 2025 • 53

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Paper • 2509.16197 • Published Sep 19, 2025 • 58

Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification

Paper • 2509.15591 • Published Sep 19, 2025 • 45

Lynx: Towards High-Fidelity Personalized Video Generation

Paper • 2509.15496 • Published Sep 19, 2025 • 13

LIMI: Less is More for Agency

Paper • 2509.17567 • Published Sep 22, 2025 • 104

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

Paper • 2509.17627 • Published Sep 22, 2025 • 66

Qwen3-Omni Technical Report

Paper • 2509.17765 • Published Sep 22, 2025 • 149

OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System

Paper • 2509.18091 • Published Sep 22, 2025 • 34

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Paper • 2509.18056 • Published Sep 22, 2025 • 27

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning

Paper • 2509.17437 • Published Sep 22, 2025 • 17

EpiCache: Episodic KV Cache Management for Long Conversational Question Answering

Paper • 2509.17396 • Published Sep 22, 2025 • 19

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Paper • 2509.16941 • Published Sep 21, 2025 • 21

FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

Paper • 2509.17177 • Published Sep 21, 2025 • 13

Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels

Paper • 2509.16596 • Published Sep 20, 2025 • 14

Reasoning Core: A Scalable RL Environment for LLM Symbolic Reasoning

Paper • 2509.18083 • Published Sep 22, 2025 • 5

Understanding Embedding Scaling in Collaborative Filtering

Paper • 2509.15709 • Published Sep 19, 2025 • 5

ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment

Paper • 2509.17818 • Published Sep 22, 2025 • 8

AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

Paper • 2509.17641 • Published Sep 22, 2025 • 4

DIWALI - Diversity and Inclusivity aWare cuLture specific Items for India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian Context

Paper • 2509.17399 • Published Sep 22, 2025 • 2

When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs

Paper • 2509.16633 • Published Sep 20, 2025 • 2

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Paper • 2509.18154 • Published Sep 16, 2025 • 55

Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation

Paper • 2509.18824 • Published Sep 23, 2025 • 23

What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT

Paper • 2509.19284 • Published Sep 23, 2025 • 23

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Paper • 2509.19002 • Published Sep 23, 2025 • 3

Video models are zero-shot learners and reasoners

Paper • 2509.20328 • Published Sep 24, 2025 • 100

SIM-CoT: Supervised Implicit Chain-of-Thought

Paper • 2509.20317 • Published Sep 24, 2025 • 42

EmbeddingGemma: Powerful and Lightweight Text Representations

Paper • 2509.20354 • Published Sep 24, 2025 • 47

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Paper • 2509.20360 • Published Sep 24, 2025 • 18

PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation

Paper • 2509.20358 • Published Sep 24, 2025 • 15

Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

Paper • 2509.19244 • Published Sep 23, 2025 • 12

Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say

Paper • 2509.21164 • Published Sep 25, 2025 • 9

VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models

Paper • 2509.19803 • Published Sep 24, 2025 • 120

SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

Paper • 2509.21320 • Published Sep 25, 2025 • 101

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

Paper • 2509.21268 • Published Sep 25, 2025 • 104

Tree Search for LLM Agent Reinforcement Learning

Paper • 2509.21240 • Published Sep 25, 2025 • 92

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Paper • 2509.20427 • Published Sep 24, 2025 • 82

AutoIntent: AutoML for Text Classification

Paper • 2509.21138 • Published Sep 25, 2025 • 36

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

Paper • 2509.21117 • Published Sep 25, 2025 • 30

Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution

Paper • 2509.21072 • Published Sep 25, 2025 • 15

Does FLUX Already Know How to Perform Physically Plausible Image Composition?

Paper • 2509.21278 • Published Sep 25, 2025 • 16

Thinking Augmented Pre-training

Paper • 2509.20186 • Published Sep 24, 2025 • 23

Understanding the Thinking Process of Reasoning Models: A Perspective from Schoenfeld's Episode Theory

Paper • 2509.14662 • Published Sep 18, 2025 • 13

SD3.5-Flash: Distribution-Guided Distillation of Generative Flows

Paper • 2509.21318 • Published Sep 25, 2025 • 11

Interactive Recommendation Agent with Active User Commands

Paper • 2509.21317 • Published Sep 25, 2025 • 7

UserRL: Training Interactive User-Centric Agent via Reinforcement Learning

Paper • 2509.19736 • Published Sep 24, 2025 • 12

MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning

Paper • 2509.21113 • Published Sep 25, 2025 • 6

SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

Paper • 2509.20414 • Published Sep 24, 2025 • 10

Thinking While Listening: Simple Test Time Scaling For Audio Classification

Paper • 2509.19676 • Published Sep 24, 2025 • 5

When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

Paper • 2509.20293 • Published Sep 24, 2025 • 8

Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving

Paper • 2509.20109 • Published Sep 24, 2025 • 4

CompLLM: Compression for Long Context Q&A

Paper • 2509.19228 • Published Sep 23, 2025 • 10

Blueprints of Trust: AI System Cards for End to End Transparency and Governance

Paper • 2509.20394 • Published Sep 23, 2025 • 3

StyleBench: Evaluating thinking styles in Large Language Models

Paper • 2509.20868 • Published Sep 25, 2025 • 4

OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps

Paper • 2509.19282 • Published Sep 23, 2025 • 8

LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer

Paper • 2509.22414 • Published Sep 26, 2025 • 22

UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models

Paper • 2509.21760 • Published Sep 26, 2025 • 15

VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

Paper • 2509.22651 • Published Sep 26, 2025 • 23

Variational Reasoning for Language Models

Paper • 2509.22637 • Published Sep 26, 2025 • 69

LongLive: Real-time Interactive Long Video Generation

Paper • 2509.22622 • Published Sep 26, 2025 • 188

A Survey of Interactive Generative Video

Paper • 2504.21853 • Published Apr 30, 2025 • 46

Evaluating Very Long-Term Conversational Memory of LLM Agents

Paper • 2402.17753 • Published Feb 27, 2024 • 19

VBench: Comprehensive Benchmark Suite for Video Generative Models

Paper • 2311.17982 • Published Nov 29, 2023 • 9

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Paper • 2503.21755 • Published Mar 27, 2025 • 33

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models

Paper • 2411.13503 • Published Nov 20, 2024 • 34

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

Paper • 2406.16855 • Published Jun 24, 2024 • 57

VCBench: Benchmarking LLMs in Venture Capital

Paper • 2509.14448 • Published Sep 17, 2025

AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection

Paper • 2504.20865 • Published Apr 29, 2025

ConsumerBench: Benchmarking Generative AI Applications on End-User Devices

Paper • 2506.17538 • Published Jun 21, 2025 • 7

Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol

Paper • 2503.05860 • Published Mar 7, 2025 • 11

MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks

Paper • 2507.12284 • Published Jul 16, 2025 • 12

Benchmarking Neural Network Training Algorithms

Paper • 2306.07179 • Published Jun 12, 2023 • 24

SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation

Paper • 2406.14991 • Published Jun 21, 2024 • 2

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

Paper • 2506.00482 • Published May 31, 2025 • 8

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Paper • 2406.15877 • Published Jun 22, 2024 • 48

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

Paper • 2407.18961 • Published Jul 18, 2024 • 40

ImgEdit: A Unified Image Editing Dataset and Benchmark

Paper • 2505.20275 • Published May 26, 2025 • 18

GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

Paper • 2504.02782 • Published Apr 3, 2025 • 57

7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models

Paper • 2508.12919 • Published Aug 18, 2025 • 1

Instruction-Following Evaluation in Function Calling for Large Language Models

Paper • 2509.18420 • Published Sep 22, 2025 • 2

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Paper • 2509.22186 • Published Sep 26, 2025 • 146

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

Paper • 2509.22220 • Published Sep 26, 2025 • 65

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

Paper • 2509.24897 • Published Sep 29, 2025 • 46

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Paper • 2509.24900 • Published Sep 29, 2025 • 53

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Paper • 2505.09568 • Published May 14, 2025 • 99

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

Paper • 2505.02567 • Published May 5, 2025 • 80

Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models

Paper • 2406.12644 • Published Jun 18, 2024 • 5

ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies

Paper • 2506.12830 • Published Jun 15, 2025

CompBench: Benchmarking Complex Instruction-guided Image Editing

Paper • 2505.12200 • Published May 18, 2025

Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination

Paper • 2509.01986 • Published Sep 2, 2025 • 5

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

Paper • 2310.11513 • Published Oct 17, 2023 • 1

Visual Jigsaw Post-Training Improves MLLMs

Paper • 2509.25190 • Published Sep 29, 2025 • 37

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Paper • 2509.24695 • Published Sep 29, 2025 • 46

Democratizing AI scientists using ToolUniverse

Paper • 2509.23426 • Published Sep 27, 2025 • 40

EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering

Paper • 2509.25175 • Published Sep 29, 2025 • 31

Towards Personalized Deep Research: Benchmarks and Evaluations

Paper • 2509.25106 • Published Sep 29, 2025 • 30

VideoScore2: Think before You Score in Generative Video Evaluation

Paper • 2509.22799 • Published Sep 26, 2025 • 26

MMPB: It's Time for Multi-Modal Personalization

Paper • 2509.22820 • Published Sep 26, 2025 • 15

Personalization of Large Language Models: A Survey

Paper • 2411.00027 • Published Oct 29, 2024 • 33

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Paper • 2509.25161 • Published Sep 29, 2025 • 26

HunyuanImage 3.0 Technical Report

Paper • 2509.23951 • Published Sep 28, 2025 • 25

PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images

Paper • 2509.25185 • Published Sep 29, 2025 • 5

Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification

Paper • 2509.23061 • Published Sep 27, 2025 • 7

UniVid: The Open-Source Unified Video Model

Paper • 2509.24200 • Published Sep 29, 2025 • 5

PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation

Paper • 2509.23338 • Published Sep 27, 2025 • 8

BPMN Assistant: An LLM-Based Approach to Business Process Modeling

Paper • 2509.24592 • Published Sep 29, 2025 • 3

Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models

Paper • 2509.23233 • Published Sep 27, 2025 • 4

Advancing Reference-free Evaluation of Video Captions with Factual Analysis

Paper • 2509.16538 • Published Sep 20, 2025 • 1

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Paper • 2509.24002 • Published Sep 28, 2025 • 176

OceanGym: A Benchmark Environment for Underwater Embodied Agents

Paper • 2509.26536 • Published Sep 30, 2025 • 36

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

Paper • 2509.25182 • Published Sep 29, 2025 • 39

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

Paper • 2509.26625 • Published Sep 30, 2025 • 43

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

Paper • 2509.26490 • Published Sep 30, 2025 • 20

dParallel: Learnable Parallel Decoding for dLLMs

Paper • 2509.26488 • Published Sep 30, 2025 • 19

DA^2: Depth Anything in Any Direction

Paper • 2509.26618 • Published Sep 30, 2025 • 26

TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics

Paper • 2509.26329 • Published Sep 30, 2025 • 3

Video Object Segmentation-Aware Audio Generation

Paper • 2509.26604 • Published Sep 30, 2025 • 1

BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software

Paper • 2509.25248 • Published Sep 27, 2025 • 3

Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation

Paper • 2509.26555 • Published Sep 30, 2025 • 1

Regression Language Models for Code

Paper • 2509.26476 • Published Sep 30, 2025 • 17

The Pitfalls of KV Cache Compression

Paper • 2510.00231 • Published Sep 30, 2025 • 6

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Paper • 2509.26539 • Published Sep 30, 2025 • 10

LayerD: Decomposing Raster Graphic Designs into Layers

Paper • 2509.25134 • Published Sep 29, 2025 • 2

Improving Editability in Image Generation with Layer-wise Memory

Paper • 2505.01079 • Published May 2, 2025 • 29

Generative Image Layer Decomposition with Visual Effects

Paper • 2411.17864 • Published Nov 26, 2024

Edit Transfer: Learning Image Editing via Vision In-Context Relations

Paper • 2503.13327 • Published Mar 17, 2025 • 29

Text2Layer: Layered Image Generation using Latent Diffusion Model

Paper • 2307.09781 • Published Jul 19, 2023 • 16

Code2Video: A Code-centric Paradigm for Educational Video Generation

Paper • 2510.01174 • Published Oct 1, 2025 • 35

GEM: A Gym for Agentic LLMs

Paper • 2510.01051 • Published Oct 1, 2025 • 90

BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses

Paper • 2510.00232 • Published Sep 30, 2025 • 16

In-Place Feedback: A New Paradigm for Guiding LLMs in Multi-Turn Reasoning

Paper • 2510.00777 • Published Oct 1, 2025 • 2

An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

Paper • 2509.19185 • Published Sep 23, 2025 • 4

Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Paper • 2402.11281 • Published Feb 17, 2024 • 1

Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

Paper • 2509.25162 • Published Sep 29, 2025 • 3

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

Paper • 2510.00438 • Published Oct 1, 2025 • 10

BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs

Paper • 2509.26514 • Published Sep 30, 2025 • 4

Eliciting Secret Knowledge from Language Models

Paper • 2510.01070 • Published Oct 1, 2025 • 6

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Paper • 2510.02283 • Published Oct 2, 2025 • 96

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

Paper • 2510.02209 • Published Oct 2, 2025 • 56

BloombergGPT: A Large Language Model for Finance

Paper • 2303.17564 • Published Mar 30, 2023 • 30

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Paper • 2510.01284 • Published Sep 30, 2025 • 37

A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports

Paper • 2510.02190 • Published Oct 2, 2025 • 19

VIRTUE: Visual-Interactive Text-Image Universal Embedder

Paper • 2510.00523 • Published Oct 1, 2025 • 7

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

Paper • 2504.17432 • Published Apr 24, 2025 • 40

LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

Paper • 2411.04997 • Published Nov 7, 2024 • 39

Veagle: Advancements in Multimodal Representation Learning

Paper • 2403.08773 • Published Jan 18, 2024 • 10

CoDA: Agentic Systems for Collaborative Data Visualization

Paper • 2510.03194 • Published Oct 3, 2025 • 30

SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?

Paper • 2510.03120 • Published Oct 3, 2025 • 7

Paper2Video: Automatic Video Generation from Scientific Papers

Paper • 2510.05096 • Published Oct 6, 2025 • 119

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

Paper • 2510.05094 • Published Oct 6, 2025 • 38

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Paper • 2510.04618 • Published Oct 6, 2025 • 129

Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

Paper • 2510.04800 • Published Oct 6, 2025 • 37

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Paper • 2510.03215 • Published Oct 3, 2025 • 98

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Paper • 2510.06590 • Published Oct 8, 2025 • 77

Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

Paper • 2510.06308 • Published Oct 7, 2025 • 55

SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

Paper • 2510.06917 • Published Oct 8, 2025 • 35

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

Paper • 2510.08540 • Published Oct 9, 2025 • 109

MATRIX: Mask Track Alignment for Interaction-aware Video Generation

Paper • 2510.07310 • Published Oct 8, 2025 • 36

RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training

Paper • 2510.06710 • Published Oct 8, 2025 • 42

Vibe Checker: Aligning Code Evaluation with Human Preference

Paper • 2510.07315 • Published Oct 8, 2025 • 34

Online Generic Event Boundary Detection

Paper • 2510.06855 • Published Oct 8, 2025 • 4

Bridging Text and Video Generation: A Survey

Paper • 2510.04999 • Published Oct 6, 2025 • 6

U-Bench: A Comprehensive Understanding of U-Net through 100-Variant Benchmarking

Paper • 2510.07041 • Published Oct 8, 2025 • 4

DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents

Paper • 2509.21842 • Published Sep 26, 2025 • 3

Agent Learning via Early Experience

Paper • 2510.08558 • Published Oct 9, 2025 • 273

UniVideo: Unified Understanding, Generation, and Editing for Videos

Paper • 2510.08377 • Published Oct 9, 2025 • 81

UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution

Paper • 2510.08143 • Published Oct 9, 2025 • 20

UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG

Paper • 2510.03663 • Published Oct 4, 2025 • 16

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

Paper • 2510.07172 • Published Oct 8, 2025 • 28

VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

Paper • 2510.08555 • Published Oct 9, 2025 • 64

Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training

Paper • 2510.08008 • Published Oct 9, 2025 • 6

Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs

Paper • 2510.07429 • Published Oct 8, 2025 • 4

Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window

Paper • 2510.08276 • Published Oct 9, 2025 • 10

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

Paper • 2510.08559 • Published Oct 9, 2025 • 9

Character Mixing for Video Generation

Paper • 2510.05093 • Published Oct 6, 2025 • 7

WithAnyone: Towards Controllable and ID Consistent Image Generation

Paper • 2510.14975 • Published Oct 16, 2025 • 85

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Paper • 2510.14979 • Published Oct 16, 2025 • 67

Attention Is All You Need for KV Cache in Diffusion LLMs

Paper • 2510.14973 • Published Oct 16, 2025 • 42

LLM-guided Hierarchical Retrieval

Paper • 2510.13217 • Published Oct 15, 2025 • 21

Qwen3Guard Technical Report

Paper • 2510.14276 • Published Oct 16, 2025 • 15

Learning an Image Editing Model without Image Editing Pairs

Paper • 2510.14978 • Published Oct 16, 2025 • 9

pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation

Paper • 2510.14974 • Published Oct 16, 2025 • 10

RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

Paper • 2510.13910 • Published Oct 15, 2025 • 2

DeepAgent: A General Reasoning Agent with Scalable Toolsets

Paper • 2510.21618 • Published Oct 24, 2025 • 101

Video-As-Prompt: Unified Semantic Control for Video Generation

Paper • 2510.20888 • Published Oct 23, 2025 • 50

UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

Paper • 2510.20286 • Published Oct 23, 2025 • 24

From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

Paper • 2510.19871 • Published Oct 22, 2025 • 30

RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging

Paper • 2510.20479 • Published Oct 23, 2025 • 12

Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

Paper • 2510.13251 • Published Oct 15, 2025 • 14

Model Merging with Functional Dual Anchors

Paper • 2510.21223 • Published Oct 24, 2025 • 13

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

Paper • 2510.20206 • Published Oct 23, 2025 • 12

A Definition of AGI

Paper • 2510.18212 • Published Oct 21, 2025 • 36

Visual Diffusion Models are Geometric Solvers

Paper • 2510.21697 • Published Oct 24, 2025 • 20

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Paper • 2510.21652 • Published Oct 24, 2025 • 4

ARC-Encoder: learning compressed text representations for large language models

Paper • 2510.20535 • Published Oct 23, 2025 • 8

Taming Modality Entanglement in Continual Audio-Visual Segmentation

Paper • 2510.17234 • Published Oct 20, 2025 • 5

MemOS: A Memory OS for AI System

Paper • 2507.03724 • Published Jul 4, 2025 • 159

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Paper • 2307.16789 • Published Jul 31, 2023 • 102

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Paper • 2304.08244 • Published Apr 14, 2023 • 1

ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

Paper • 2501.02506 • Published Jan 5, 2025 • 10

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents

Paper • 2207.01206 • Published Jul 4, 2022 • 3

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 245

Task Vectors are Cross-Modal

Paper • 2410.22330 • Published Oct 29, 2024 • 11

In-Context Learning Creates Task Vectors

Paper • 2310.15916 • Published Oct 24, 2023 • 44

Group Relative Attention Guidance for Image Editing

Paper • 2510.24657 • Published Oct 28, 2025 • 26

OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents

Paper • 2510.24563 • Published Oct 28, 2025 • 23

WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking

Paper • 2510.24697 • Published Oct 28, 2025 • 21

BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions

Paper • 2510.10666 • Published Oct 12, 2025 • 28

WideSearch: Benchmarking Agentic Broad Info-Seeking

Paper • 2508.07999 • Published Aug 11, 2025 • 110

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Paper • 2506.01062 • Published Jun 1, 2025 • 5

Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

Paper • 2510.24711 • Published Oct 28, 2025 • 20

VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

Paper • 2510.22373 • Published Oct 25, 2025 • 15

PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding

Paper • 2510.22264 • Published Oct 25, 2025 • 2

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Paper • 2510.26802 • Published Oct 30, 2025 • 34

AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

Paper • 2510.26768 • Published Oct 30, 2025 • 34

The Era of Agentic Organization: Learning to Organize with Language Models

Paper • 2510.26658 • Published Oct 30, 2025 • 29

OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation

Paper • 2510.26213 • Published Oct 30, 2025 • 10

Magentic Marketplace: An Open-Source Environment for Studying Agentic Markets

Paper • 2510.25779 • Published Oct 27, 2025 • 11

CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Paper • 2510.26160 • Published Oct 30, 2025 • 17

ChartAB: A Benchmark for Chart Grounding & Dense Alignment

Paper • 2510.26781 • Published Oct 30, 2025 • 1

Emu3.5: Native Multimodal Models are World Learners

Paper • 2510.26583 • Published Oct 30, 2025 • 111

The End of Manual Decoding: Towards Truly End-to-End Language Models

Paper • 2510.26697 • Published Oct 30, 2025 • 117

Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

Paper • 2510.23473 • Published Oct 27, 2025 • 85

JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

Paper • 2510.23538 • Published Oct 27, 2025 • 97

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Paper • 2510.25726 • Published Oct 29, 2025 • 46

VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning

Paper • 2510.25772 • Published Oct 29, 2025 • 33

The Principles of Diffusion Models

Paper • 2510.21890 • Published Oct 24, 2025 • 62

RegionE: Adaptive Region-Aware Generation for Efficient Image Editing

Paper • 2510.25590 • Published Oct 29, 2025 • 28

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Paper • 2510.25760 • Published Oct 29, 2025 • 17

SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs

Paper • 2510.25092 • Published Oct 29, 2025 • 8

Reasoning Language Model Inference Serving Unveiled: An Empirical Study

Paper • 2510.18672 • Published Oct 21, 2025 • 8

InteractComp: Evaluating Search Agents With Ambiguous Queries

Paper • 2510.24668 • Published Oct 28, 2025 • 98

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

Paper • 2510.25602 • Published Oct 29, 2025 • 78

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

Paper • 2510.27492 • Published Oct 30, 2025 • 86

Defeating the Training-Inference Mismatch via FP16

Paper • 2510.26788 • Published Oct 30, 2025 • 31

Revisiting Multimodal Positional Encoding in Vision-Language Models

Paper • 2510.23095 • Published Oct 27, 2025 • 22

Higher-order Linear Attention

Paper • 2510.27258 • Published Oct 31, 2025 • 15

The Denario project: Deep knowledge AI agents for scientific discovery

Paper • 2510.26887 • Published Oct 30, 2025 • 8

UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback

Paper • 2511.01678 • Published Nov 3, 2025 • 38

Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation

Paper • 2510.22115 • Published Oct 25, 2025 • 85

The Underappreciated Power of Vision Models for Graph Structural Understanding

Paper • 2510.24788 • Published Oct 27, 2025 • 36

UniREditBench: A Unified Reasoning-based Image Editing Benchmark

Paper • 2511.01295 • Published Nov 3, 2025 • 39

ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use

Paper • 2510.27363 • Published Oct 31, 2025 • 23

ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation

Paper • 2511.01163 • Published Nov 3, 2025 • 32

Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

Paper • 2510.27571 • Published Oct 31, 2025 • 19

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

Paper • 2511.01833 • Published Nov 3, 2025 • 16

LongCat-Flash-Omni Technical Report

Paper • 2511.00279 • Published Oct 31, 2025 • 26

Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

Paper • 2510.26865 • Published Oct 30, 2025 • 12

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

Paper • 2511.01618 • Published Nov 3, 2025 • 11

Trove: A Flexible Toolkit for Dense Retrieval

Paper • 2511.01857 • Published Nov 3, 2025 • 12

Towards Robust Mathematical Reasoning

Paper • 2511.01846 • Published Nov 3, 2025 • 10

MotionStream: Real-Time Video Generation with Interactive Motion Controls

Paper • 2511.01266 • Published Nov 3, 2025 • 31

UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

Paper • 2511.00405 • Published Nov 1, 2025 • 6

Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

Paper • 2511.01617 • Published Nov 3, 2025 • 3

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Paper • 2511.02778 • Published Nov 4, 2025 • 102

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

Paper • 2511.02779 • Published Nov 4, 2025 • 59

LTD-Bench: Evaluating Large Language Models by Letting Them Draw

Paper • 2511.02347 • Published Nov 4, 2025 • 9

TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System

Paper • 2511.02832 • Published Nov 4, 2025 • 10

Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

Paper • 2511.02650 • Published Nov 4, 2025 • 10

CodeClash: Benchmarking Goal-Oriented Software Engineering

Paper • 2511.00839 • Published Nov 2, 2025 • 10

iFlyBot-VLA Technical Report

Paper • 2511.01914 • Published Nov 1, 2025 • 7

TabDSR: Decompose, Sanitize, and Reason for Complex Numerical Reasoning in Tabular Data

Paper • 2511.02219 • Published Nov 4, 2025 • 2

LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context

Paper • 2511.02366 • Published Nov 4, 2025 • 4

VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

Paper • 2511.02712 • Published Nov 4, 2025 • 5

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Paper • 2511.03146 • Published Nov 5, 2025 • 8

TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models

Paper • 2511.02802 • Published Nov 4, 2025 • 16

V-Thinker: Interactive Thinking with Images

Paper • 2511.04460 • Published Nov 6, 2025 • 97

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Paper • 2511.04570 • Published Nov 6, 2025 • 240

Scaling Agent Learning via Experience Synthesis

Paper • 2511.03773 • Published Nov 5, 2025 • 82

NVIDIA Nemotron Nano V2 VL

Paper • 2511.03929 • Published Nov 6, 2025 • 30

GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents

Paper • 2511.04307 • Published Nov 6, 2025 • 15

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

Paper • 2511.04655 • Published Nov 6, 2025 • 8

Diffusion Language Models are Super Data Learners

Paper • 2511.03276 • Published Nov 5, 2025 • 129

A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures

Paper • 2506.19676 • Published Jun 24, 2025

MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools

Paper • 2509.09734 • Published Sep 10, 2025 • 16

DeepEyesV2: Toward Agentic Multimodal Model

Paper • 2511.05271 • Published Nov 7, 2025 • 45

Visual Spatial Tuning

Paper • 2511.05491 • Published Nov 7, 2025 • 52

Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

Paper • 2511.04962 • Published Nov 7, 2025 • 57

Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

Paper • 2511.05017 • Published Nov 7, 2025 • 9

Dense Motion Captioning

Paper • 2511.05369 • Published Nov 7, 2025 • 10

Real-Time Reasoning Agents in Evolving Environments

Paper • 2511.04898 • Published Nov 7, 2025 • 13

EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation

Paper • 2511.11002 • Published Nov 14, 2025 • 4

Experience-Guided Adaptation of Inference-Time Reasoning Strategies

Paper • 2511.11519 • Published Nov 14, 2025 • 4

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Paper • 2511.06090 • Published Nov 8, 2025 • 5

Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions

Paper • 2511.06876 • Published Nov 10, 2025 • 28

Agentic Refactoring: An Empirical Study of AI Coding Agents

Paper • 2511.04824 • Published Nov 6, 2025 • 5

Motif 2 12.7B technical report

Paper • 2511.07464 • Published Nov 7, 2025 • 39

Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

Paper • 2511.08633 • Published Nov 9, 2025 • 55

Optimizing Diversity and Quality through Base-Aligned Model Collaboration

Paper • 2511.05650 • Published Nov 7, 2025 • 6

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

Paper • 2511.07885 • Published Nov 11, 2025 • 10

Walking the Tightrope of LLMs for Software Development: A Practitioners' Perspective

Paper • 2511.06428 • Published Nov 9, 2025 • 5

Adaptive Multi-Agent Response Refinement in Conversational Systems

Paper • 2511.08319 • Published Nov 11, 2025 • 42

Music Flamingo: Scaling Music Understanding in Audio Language Models

Paper • 2511.10289 • Published Nov 13, 2025 • 17

Depth Anything 3: Recovering the Visual Space from Any Views

Paper • 2511.10647 • Published Nov 13, 2025 • 99

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

Paper • 2511.08521 • Published Nov 11, 2025 • 38

Instella: Fully Open Language Models with Stellar Performance

Paper • 2511.10628 • Published Nov 13, 2025 • 5

WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance

Paper • 2511.12997 • Published Nov 17, 2025 • 11

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Paper • 2511.09611 • Published Nov 12, 2025 • 70

TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

Paper • 2511.13704 • Published Nov 17, 2025 • 43

Workload Schedulers -- Genesis, Algorithms and Differences

Paper • 2511.10258 • Published Nov 13, 2025 • 2

SAM 3D: 3Dfy Anything in Images

Paper • 2511.16624 • Published Nov 20, 2025 • 113

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Paper • 2511.16334 • Published Nov 20, 2025 • 93

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

Paper • 2511.15705 • Published Nov 19, 2025 • 97

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Paper • 2511.17490 • Published Nov 21, 2025 • 22

WorldGen: From Text to Traversable and Interactive 3D Worlds

Paper • 2511.16825 • Published Nov 20, 2025 • 24

OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists

Paper • 2511.16931 • Published Nov 21, 2025 • 8

RynnVLA-002: A Unified Vision-Language-Action and World Model

Paper • 2511.17502 • Published Nov 21, 2025 • 28

SAM 3: Segment Anything with Concepts

Paper • 2511.16719 • Published Nov 20, 2025 • 129

O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents

Paper • 2511.13593 • Published Nov 17, 2025 • 27

In-Video Instructions: Visual Signals as Generative Control

Paper • 2511.19401 • Published Nov 24, 2025 • 32

HunyuanVideo 1.5 Technical Report

Paper • 2511.18870 • Published Nov 24, 2025 • 28

Controllable Layer Decomposition for Reversible Multi-Layer Image Generation

Paper • 2511.16249 • Published Nov 20, 2025 • 9

TradingAgents: Multi-Agents LLM Financial Trading Framework

Paper • 2412.20138 • Published Dec 28, 2024 • 18

Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Paper • 2511.21691 • Published Nov 26, 2025 • 36

From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

Paper • 2511.18538 • Published Nov 23, 2025 • 299

DeepCode: Open Agentic Coding

Paper • 2512.07921 • Published Dec 8, 2025 • 33

RecGPT-V2 Technical Report

Paper • 2512.14503 • Published Dec 16, 2025 • 18

Reveal Hidden Pitfalls and Navigate Next Generation of Vector Similarity Search from Task-Centric Views

Paper • 2512.12980 • Published Dec 15, 2025 • 28

Olmo 3

Paper • 2512.13961 • Published Dec 15, 2025 • 29

Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets

Paper • 2512.15110 • Published Dec 17, 2025 • 10

SemanticGen: Video Generation in Semantic Space

Paper • 2512.20619 • Published Dec 23, 2025 • 93

LongVideoAgent: Multi-Agent Reasoning with Long Videos

Paper • 2512.20618 • Published Dec 23, 2025 • 55

Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization

Paper • 2512.24615 • Published Dec 31, 2025 • 119

SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Paper • 2512.24330 • Published Dec 30, 2025 • 35

ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

Paper • 2601.21558 • Published 27 days ago • 58

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Paper • 2601.23143 • Published 26 days ago • 38

FireRed-Image-Edit-1.0 Techinical Report

Paper • 2602.13344 • Published 13 days ago • 4

Discovering Multiagent Learning Algorithms with Large Language Models

Paper • 2602.16928 • Published 7 days ago • 12

MMA: Multimodal Memory Agent

Paper • 2602.16493 • Published 7 days ago • 7

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Paper • 2602.12670 • Published 12 days ago • 51

Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models

Paper • 2602.15772 • Published 8 days ago • 6

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

Paper • 2602.15322 • Published 8 days ago • 9

GLM-5: from Vibe Coding to Agentic Engineering

Paper • 2602.15763 • Published 8 days ago • 94