arxiv:2509.02547

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Published on Sep 2

· Submitted by

henggg on Sep 3

#1 Paper of the day

Upvote

138

Authors:

Xiaohang Yu ,

Zhenfei Yin ,

Zelin Tan ,

Heng Zhou ,

Yijiang Li ,

Yifan Zhou ,

Abstract

Agentic reinforcement learning transforms large language models into autonomous decision-making agents by leveraging temporally extended POMDPs, enhancing capabilities like planning and reasoning through reinforcement learning.

AI-generated summary

The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.

View arXiv page View PDF GitHub 250 Add to collection

Community

henggg

Paper author Paper submitter 3 days ago

This survey has charted the emergence of Agentic Reinforcement Learning (Agentic RL), a paradigm that
elevates LLMs from passive text generators to autonomous, decision-making agents situated in complex,
dynamic worlds. Our journey began by formalizing this conceptual shift, distinguishing the temporally
extended and partially observable MDPs (POMDPs) that characterize agentic RL from the single-step decision
processes of conventional LLM-RL. From this foundation, we constructed a comprehensive, twofold taxonomy
to systematically map the field: one centered on core agentic capabilities (planning, tool use, memory,
reasoning, self-improvement, perception, etc.) and the other on their application across a diverse array of
task domains. Throughout this analysis, our central thesis has been that RL provides the critical mechanism
for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior.
By consolidating the landscape of open-source environments, benchmarks, and frameworks, we have also
provided a practical compendium to ground and accelerate future research in this burgeoning field.