TRL documentation

OpenEnv Integration for Training LLMs with Environments

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.25.1).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

OpenEnv Integration for Training LLMs with Environments

OpenEnv is an open-source framework from Meta’s PyTorch team for defining, deploying, and interacting with environments in reinforcement learning (RL) and agentic workflows. It offers Gymnasium-style APIs (e.g., reset() and step()) to interface with environments in a standard manner, and supports running these environments as backend servers (for example, via HTTP or containerised execution). You can find a collection of ready-to-use OpenEnv environments on the Hugging Face Hub.

In this guide, we’ll focus on how to integrate OpenEnv with TRL, but feel free to explore the links above to dive deeper into OpenEnv itself.

You can explore ready-to-use example scripts and notebooks in the Examples Overview.

Explore the OpenEnv docs for more details.

Installation

To use OpenEnv with TRL, install the framework:

pip install git+https://github.com/meta-pytorch/OpenEnv.git

Using rollout_func with OpenEnv environments

TRL’s GRPOTrainer supports custom rollout logic through the rollout_func argument. This lets you override the trainer’s default text-generation loop and directly interact with OpenEnv environments — for instance, to compute environment-driven rewards instead of relying solely on model-based signals.

Rollout Function Signature

A rollout function must have the following signature:

def rollout_func(
    prompts: list[str],
    trainer: GRPOTrainer,
) -> dict[str, list]:
    """
    Custom rollout function for generation and reward computation.

    Args:
        prompts: List of prompts routed to the current process
        trainer: Active GRPOTrainer (gives access to tokenizer, config and helper utilities)

    Returns:
        Dictionary containing:
        - prompt_ids: List of token IDs for each prompt
        - completion_ids: List of token IDs for each completion
        - logprobs: List of log probabilities for each token
        - Any additional fields are forwarded to reward functions as kwargs
    """
    pass

Any extra fields in the returned dictionary (beyond the required three) are automatically forwarded to your reward functions. This makes it easy to propagate signals such as environment rewards or auxiliary metrics from the rollout step.

Integration pattern

The typical pattern when combining OpenEnv with TRL looks like this:

  1. Start or connect to an OpenEnv environment (e.g., a Dockerized env or HTTP endpoint).
  2. Generate completions from your model — either via trl.experimental.openenv.generate_rollout_completions when using colocated vLLM, or by hitting your inference server when using vLLM in server mode.
  3. Step through the environment using each completion to compute rewards or metrics.
  4. Add environment results (e.g., env_reward) to the rollout result dict.
  5. Access those rewards inside your reward function via **kwargs.

By using OpenEnv in this loop, you can:

  • Train with realistic or interactive feedback (not just static reward functions).
  • Plug in custom simulators, web APIs, or evaluators as environments.
  • Pass structured reward signals back into RL training seamlessly.

vLLM Modes

TRL supports two vLLM execution modes for generation:

  • colocate mode (default): vLLM runs in the same process as training. Requires 1 GPU. Use trl.experimental.openenv.generate_rollout_completions for generation.
  • server mode: vLLM runs as a separate server process. Requires at least 2 GPUs (one for vLLM server, one for training), but is highly scalable:
    • You can allocate multiple GPUs to the vLLM server for tensor parallelism (faster inference)
    • You can run multiple training processes that share the same vLLM server
    • You can use different GPU types for inference vs training (e.g., A100 for vLLM, H100 for training)
    • The vLLM server can serve multiple experiments simultaneously
    • Use trl.experimental.openenv.generate_rollout_completions which will communicate with the server via vllm_server_url

Configure the mode via GRPOConfig:

# Colocate mode (1 GPU)
args = GRPOConfig(
    use_vllm=True,
    vllm_mode="colocate",
    # ... other args
)

# Server mode (2+ GPUs, scalable)
args = GRPOConfig(
    use_vllm=True,
    vllm_mode="server",
    vllm_server_base_url="http://localhost:8000",
    # ... other args
)

# Example: Start vLLM server with multiple GPUs for tensor parallelism
# CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model Qwen/Qwen3-1.7B --tensor-parallel-size 4

Running the Environments

You can run OpenEnv environments in three different ways:

  • We can load the environment from the Hugging Face Hub and execute it as a Docker container.
  • We can connect to a hosted environment running on the Hugging Face Hub.
  • We can launch the environment directly using Uvicorn in Python.
docker
space
local

Load from Hugging Face Hub (recommended)

We can use the from_hub method to load the environment from the hub. This method will automatically start a Docker container for the environment on your local machine. openenv/echo-env is the repo_id of the space on the hub.

env = EchoEnv.from_hub("openenv/echo-env")

If you want to launch the environment manually, you can use the following command to pull and run the Docker container:

docker run -d -p 8001:8000 --platform linux/amd64  registry.hf.space/openenv-echo-env:latest

And then you can connect to the environment using the following code:

env = EchoEnv(base_url="http://0.0.0.0:8001")

Here, we map the ports from 8001 to 8000 to make space for a vLLM server, but you will need to manage the ports for your local machine.

You can find the Docker container for any space on the hub.

  • Open the space page on the hub.
  • Click the ⋮ (three dots) menu.
  • Select “Run locally.”
  • Copy and execute the provided command in your terminal.

open_env_launch_docker

You can also use the Docker option with from_docker_image by providing the image name.. For more details, refer to the official OpenEnv documentation.

Environments Catalog

Environment development is active and evolving. The best way to explore the current catalog of maintained environments is by visiting the official OpenEnv catalog.

Custom environments are also supported. To learn how to create your own, check out the guide on Building Your Own Environment with OpenEnv.

Environments are tightly integrated with the Hub, allowing you to push new environments directly so the community can easily pull, reuse, and adapt them for their own use cases.

A simple example

You can explore more ready-to-use example scripts in the examples/scripts/openenv/ directory.

The echo.py script demonstrates a minimal, end-to-end integration between TRL and OpenEnv. In this example, the Echo environment rewards completions based on their text length, encouraging the model to generate longer outputs. This pattern can be extended to any custom environment that provides structured feedback or task-based rewards:

from envs.echo_env import EchoEnv, EchoAction
from trl import GRPOConfig, GRPOTrainer
from trl.experimental.openenv import generate_rollout_completions

# Create HTTP client for Echo Environment
client = EchoEnv.from_hub("openenv/echo-env")

"""
Alternatively, you can start the environment manually with Docker and connect to it:

# Step 1: Start the Echo environment
docker run -d -p 8001:8001 registry.hf.space/openenv-echo-env:latest

# Step 2: Connect the client to the running container
client = EchoEnv(base_url="http://0.0.0.0:8001")
"""

def rollout_func(prompts: list[str], trainer: GRPOTrainer):
    # 1. Generate completions using TRL's helper (works for colocated vLLM)
    outputs = generate_rollout_completions(trainer, prompts)
    tokenizer = trainer.processing_class
    completions_text = [
        tokenizer.decode(out["completion_ids"], skip_special_tokens=True) for out in outputs
    ]

    # 2. Step through the environment to get rewards
    client.reset()
    env_rewards = []
    for msg in completions_text:
        env_result = client.step(EchoAction(message=msg))
        env_rewards.append(env_result.reward)

    # 3. Add environment rewards as extra field
    return {
        "prompt_ids": [out["prompt_ids"] for out in outputs],
        "completion_ids": [out["completion_ids"] for out in outputs],
        "logprobs": [out["logprobs"] for out in outputs],
        "env_reward": env_rewards,
    }

def reward_from_env(completions, **kwargs):
    """Extract environment rewards passed via rollout_func kwargs."""
    env_rewards = kwargs.get("env_reward", [])
    return [float(reward) for reward in env_rewards] if env_rewards else [0.0] * len(completions)

dataset = Dataset.from_dict({"prompt": ["You are an AI that interacts with an *Echo* environment. Word to echo:"] * 64})

# Setup trainer with custom rollout
trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=reward_from_env,
    train_dataset=dataset,
    rollout_func=rollout_func,  # Use custom rollout
    args=GRPOConfig(
        use_vllm=True,
        vllm_mode="colocate",  # Use colocate mode (default)
        num_train_epochs=1,
        num_generations=8,
        max_completion_length=2048,
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
    ),
)
trainer.train()

That’s it! Now that you’ve seen the full example, let’s unpack how the main pieces fit together.

  1. Environment Client: EchoEnv implements an HTTP interface to interact with the environment server.
  2. Custom rollout: The rollout_func generates completions and steps through the environment to collect rewards.
  3. Extra fields: The rollout adds env_reward to the result dictionary, which is automatically passed to reward functions.
  4. Reward function: Extracts env_reward from kwargs to apply environment-computed rewards during training.

The trainer-aware rollout hook works in both vLLM server and colocate modes. Use trl.experimental.openenv.generate_rollout_completions so you reuse TRL’s sampling configuration automatically.

Running the Example

You can run the example in either colocate mode (1 GPU) or server mode (2 GPUs):

colocate
server

Colocate mode (1 GPU, recommended)

python examples/scripts/openenv/echo.py --vllm-mode colocate

This runs vLLM in the same process as training, requiring only a single GPU.

Alternatively, you can manually start the Echo environment in a Docker container before running the training:

# Launch the Echo environment
docker run -d -p 8001:8001 registry.hf.space/openenv-echo-env:latest

Then, initialize the client using:

client = EchoEnv(base_url="http://0.0.0.0:8001")

instead of:

client = EchoEnv.from_docker_image("echo-env:latest").

Below is the reward curve from training:

Advanced Example

Let’s level this up a bit by training a model to interact with a more complex environment. We’ll use the game word guessing game wordle from the TextArena environment.


You can explore the notebook version of this example here.

The TextArena Environment

TextArena is an open-source collection of competitive text-based games designed to evaluate reasoning skills in LLMs using textual games like Wordle, Snake, Tic-Tac-Toe, and more. Research has shown that such games improve model performance on reasoning tasks.

image of TextArena

We will use the TextArena environment to train a model to play Wordle. The environment is a simple text based response environment that allows the model to interact with the game by making guesses and receive feedback on them.

Wordle

Wordle is a useful game to train a model on because it requires the model to reason about the word and the feedback provided by the environment. Also, it is a purely language based game that requires no external tools or knowledge. Furthermore, we found that models from 1 billion parameters and up are able to improve on wordle and only require 8 tokens to generate a guess, which makes the game a good benchmark to experiment with Reinforcement Learning environments without significant compute requirements.

How does Wordle work? Wordle is a word guessing game where the player has to guess a 5-letter word. The player can make 6 guesses, and for each guess, the environment will provide feedback on the correctness of the guess. The player wins if they guess the word in 6 guesses or fewer. It challenges the model to generate words that are likely to be correct, and to learn from the feedback provided by the environment.

For example, if the wordle environment returns the following feedback:

G U E S S
X G Y X X

The model has guessed the word “GUESS” and the environment has provided feedback as the letters X, G, and Y. Referring to colors in the original game as blank, green, and yellow. From this feedback, the model should learn that the word “GUESS” is incorrect. The letter “E” is in the word, but in the wrong position. The letter “U” is correct and in the correct position.

In the TextArena environment, a reward is only given when the model wins the game. The reward is 1.0 if the model wins, and 0.0 otherwise. This is not a very efficient reward signal for the model, so we have added a number of custom reward functions to the script to help the model learn to play the game. The extensible nature of reward_funcs and rollout_func allows you to add any custom reward function you want to the script.

Rollout Function

The rollout function runs one full Wordle episode, prompting the model for a guess each turn and capturing both environment rewards and auxiliary signals such as letter coverage and repetition penalties.

def rollout_once(
    trainer: GRPOTrainer,
    env: TextArenaEnv,
    tokenizer: AutoTokenizer,
    dataset_prompt: str,
    system_prompt: str,
    max_turns: int,
) -> dict[str, list]:
    result = env.reset()
    observation = result.observation

    prompt_ids: list[int] = []
    completion_ids: list[int] = []
    logprobs: list[float] = []
    raw_rewards: list[float] = []
    green_scores: list[float] = []
    yellow_scores: list[float] = []
    repetition_scores: list[float] = []
    correct_scores: list[float] = []
    guess_counts: dict[str, int] = {}

    for _turn in range(max_turns):
        # when the game is over the environment will return a done=True
        if result.done:
            break

        # set up the prompt for the model
        base_prompt = observation.prompt or dataset_prompt
        user_prompt = make_user_prompt(base_prompt, observation.messages)
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ]
        prompt_text = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,
            enable_thinking=False,
        )

        # Generate completion using trainer (works for both colocate and server modes)
        rollout_outputs = generate_rollout_completions(trainer, [prompt_text])[0]
        prompt_ids.extend(rollout_outputs["prompt_ids"])
        completion_ids.extend(rollout_outputs["completion_ids"])
        logprobs.extend(rollout_outputs["logprobs"])
        completion_text = rollout_outputs.get("text") or tokenizer.decode(
            rollout_outputs["completion_ids"], skip_special_tokens=True
        )

        # extract the guess from the completion
        guess = extract_guess(completion_text)

        # step the environment with the guess
        result = env.step(TextArenaAction(message=guess))
        raw_rewards.append(float(result.reward or 0.0))
        observation = result.observation
        correct_score = float(result.reward or 0.0)
        feedback = extract_wordle_feedback(observation)

        # Update guess counts
        previous_occurrences = guess_counts.get(guess, 0)
        repetition_score = scale_repetition_score(previous_occurrences, len(guess_counts))
        guess_counts[guess] = previous_occurrences + 1

        # calculate custom reward signals from the feedback
        if not feedback:
            green_score = 0.0
            yellow_score = 0.0
        else:
            green_count, yellow_count = extract_feedback_counts(feedback)
            green_score = green_count / 5.0
            yellow_score = yellow_count / 5.0

        repetition_scores.append(repetition_score)
        green_scores.append(green_score)
        yellow_scores.append(yellow_score)
        correct_scores.append(correct_score)

    correct_reward_value = correct_scores[-1] if correct_scores else (raw_rewards[-1] if raw_rewards else 0.0)

    return {
        "prompt_ids": prompt_ids,
        "completion_ids": completion_ids,
        "logprobs": logprobs,
        "raw_rewards": raw_rewards,
        "correct_reward": correct_reward_value,
        "green_reward": green_scores[-1] if green_scores else 0.0,
        "yellow_reward": yellow_scores[-1] if yellow_scores else 0.0,
        "repetition_reward": repetition_scores[-1] if repetition_scores else 0.0,
    }

The environment has a reward signal based on the completion of the game. We found that most models struggle to ever win the game, so we have added a number of custom reward functions to the script to help the model learn to play the game more iteratively. At first, the model will learn to cover new letters and avoid repeating guesses. As it improves, it will learn to win the game.

Reward Functions

We log four reward streams that encourage the model to solve the puzzle, cover new letters, and avoid repeating guesses:

  • reward_correct: final win/loss signal from the environment.
  • reward_greens: density of green letters in the last feedback.
  • reward_yellows: density of yellow letters in the last feedback.
  • reward_repetition: penalty for guessing the same token multiple times.
def reward_correct(completions: List[str], **kwargs: Optional[Dict]) -> List[float]:
    rewards = kwargs.get("correct_reward") if kwargs else None
    return [float(r) for r in rewards] if rewards is not None else [0.0] * len(completions)


def reward_greens(completions: List[str], **kwargs: Optional[Dict]) -> List[float]:
    rewards = kwargs.get("green_reward") if kwargs else None
    return [float(r) for r in rewards] if rewards is not None else [0.0] * len(completions)


def reward_yellows(completions: List[str], **kwargs: Optional[Dict]) -> List[float]:
    rewards = kwargs.get("yellow_reward") if kwargs else None
    return [float(r) for r in rewards] if rewards is not None else [0.0] * len(completions)


def reward_repetition(completions: List[str], **kwargs: Optional[Dict]) -> List[float]:
    rewards = kwargs.get("repetition_reward") if kwargs else None
    return [float(r) for r in rewards] if rewards is not None else [0.0] * len(completions)

Training the Model

The training script wires the custom rollout and rewards into GRPOTrainer. The CLI exposes the configuration used during development as defaults, so you can override endpoints or hyperparameters at launch time.

parser = argparse.ArgumentParser()
# ... add CLI arguments with sensible defaults ...
cli_args = parser.parse_args()

trainer = GRPOTrainer(
    model=cli_args.model_id,
    processing_class=tokenizer,
    reward_funcs=[
        reward_correct,
        reward_greens,
        reward_yellows,
        reward_repetition,
    ],
    train_dataset=dataset,
    args=grpo_config,
    rollout_func=lambda prompts, trainer: rollout_func(
        env=env,
        tokenizer=tokenizer,
        prompts=prompts,
        trainer=trainer,
        cli_args=cli_args,
        system_prompt=system_prompt,
    ),
)
trainer.train()

Running the Advanced Example

You can run the Wordle example in either colocate mode (1 GPU) or server mode (2 GPUs):

colocate
server

Colocate mode (1 GPU, recommended)

python examples/scripts/openenv/wordle.py --vllm-mode colocate

This runs vLLM in the same process as training, requiring only a single GPU.

You can also manually start the TextArena environment in a Docker container before running the training:

# Launch the TextArena environment
docker run -d -p 8001:8001 registry.hf.space/burtenshaw-textarena:latest

Then connect to it using --env-mode docker-local--env-host localhost --env-port 8001.

Results

The resulting model improves its performance on the game, both by reducing the number of repetitions and by increasing the number of correct guesses. However, the Qwen3-1.7B model we trained is not able to consistently win the game. The following reward curve shows the coverage of the model’s guesses and the coverage of correct Y and G letters.

We experimented with larger models like gpt-oss-20b and found that the model was able to consistently win the game. However, this requires a lot of compute to train the model. Why not try this out yourself?

Update on GitHub