> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/primeintellect-ai/verifiers/llms.txt
> Use this file to discover all available pages before exploring further.

# GymEnv

> Universal runner for Gym-compatible environments

# GymEnv

A universal adapter for running OpenAI Gym-compatible environments with language models.

<Warning>
  GymEnv is experimental and subject to breaking changes. The API may change in future releases.
</Warning>

## Overview

`GymEnv` bridges the gap between Gym's step-based API and Verifiers' message-based rollout system. It:

* Converts Gym observations to text prompts
* Parses model completions into actions
* Manages episode lifecycle (reset/step/done)
* Computes episodic rewards

## Inheritance

```
Environment
└── MultiTurnEnv
    └── GymEnv
```

## Constructor

```python theme={null}
GymEnv(
    env_cls: type[StepResetEnv],
    env_kwargs: dict[str, Any] | None = None,
    action_parser: Callable[[str], Any] | None = None,
    obs_to_text: Callable[[Any], str] | None = None,
    num_train_episodes: int = 1000,
    num_eval_episodes: int = 20,
    max_episode_steps: int | None = None,
    seed: int = 0,
    system_prompt: str | None = None,
    few_shot: list[dict[str, Any]] | None = None,
    parser: vf.Parser | None = None,
    rubric: Rubric | None = None,
    message_type: MessageType = "chat",
)
```

<ParamField path="env_cls" type="type[StepResetEnv]" required>
  Gym environment class with `reset(seed)` and `step(action)` methods.
</ParamField>

<ParamField path="env_kwargs" type="dict[str, Any] | None" default="None">
  Keyword arguments passed to `env_cls()` constructor.
</ParamField>

<ParamField path="action_parser" type="Callable[[str], Any] | None" default="None">
  Function to parse model output into an action. Defaults to identity (string actions).
</ParamField>

<ParamField path="obs_to_text" type="Callable[[Any], str] | None" default="None">
  Function to convert observations to text. Defaults to `str(obs)`.
</ParamField>

<ParamField path="num_train_episodes" type="int" default="1000">
  Number of episodes in training dataset.
</ParamField>

<ParamField path="num_eval_episodes" type="int" default="20">
  Number of episodes in eval dataset.
</ParamField>

<ParamField path="max_episode_steps" type="int | None" default="None">
  Maximum steps per episode. If None, uses 1000.
</ParamField>

<ParamField path="seed" type="int" default="0">
  Random seed for episode generation.
</ParamField>

<ParamField path="system_prompt" type="str | None">
  System prompt explaining the task.
</ParamField>

<ParamField path="rubric" type="Rubric | None">
  Custom rubric for scoring. Defaults to `EpisodicSumRubric()`.
</ParamField>

## StepResetEnv Protocol

Gym environments must implement:

```python theme={null}
class StepResetEnv(Protocol):
    def reset(self, seed: int) -> obs | tuple[obs, dict]:
        """Reset environment with seed."""
        ...
    
    def step(self, action) -> tuple[obs, reward, done, info] | tuple[obs, reward, done, truncated, info]:
        """Take action and return new state."""
        ...
```

Supports both old (4-tuple) and new (5-tuple) Gym APIs.

## Example Usage

### CartPole Example

```python theme={null}
import verifiers as vf
import gymnasium as gym

def load_environment():
    def action_parser(text: str) -> int:
        """Parse '0' or '1' from model output."""
        text = text.strip().lower()
        if '0' in text:
            return 0
        elif '1' in text:
            return 1
        else:
            raise ValueError(f"Invalid action: {text}")
    
    def obs_to_text(obs) -> str:
        """Format CartPole observation."""
        cart_pos, cart_vel, pole_angle, pole_vel = obs
        return f"""CartPole State:
- Cart position: {cart_pos:.3f}
- Cart velocity: {cart_vel:.3f}
- Pole angle: {pole_angle:.3f}
- Pole velocity: {pole_vel:.3f}

Choose action (0=left, 1=right):"""
    
    return vf.GymEnv(
        env_cls=gym.make,
        env_kwargs={"id": "CartPole-v1"},
        action_parser=action_parser,
        obs_to_text=obs_to_text,
        num_train_episodes=100,
        num_eval_episodes=10,
        max_episode_steps=500,
        system_prompt="You are controlling a CartPole. Balance the pole by moving left (0) or right (1).",
    )

# Run evaluation
env = load_environment()
results = await env.evaluate(
    client=vf.ClientConfig(api_key="..."),
    model="gpt-4",
    num_examples=10
)

print(f"Average episode reward: {results['metadata']['avg_reward']}")
```

### Custom Text Game

```python theme={null}
import verifiers as vf

class TextAdventure:
    """Simple text-based game."""
    
    def __init__(self):
        self.location = "start"
        self.inventory = []
        self.steps = 0
    
    def reset(self, seed: int):
        self.location = "start"
        self.inventory = []
        self.steps = 0
        return "You are in a dark room. Exits: north, south", {}
    
    def step(self, action: str):
        self.steps += 1
        
        action = action.lower().strip()
        
        if action == "north":
            self.location = "treasure_room"
            obs = "You found treasure! You win!"
            reward = 1.0
            done = True
        elif action == "south":
            obs = "You fell into a pit. Game over."
            reward = 0.0
            done = True
        else:
            obs = f"Invalid action '{action}'. Try: north, south"
            reward = 0.0
            done = False
        
        truncated = self.steps >= 10
        return obs, reward, done, truncated, {}

def load_environment():
    return vf.GymEnv(
        env_cls=TextAdventure,
        num_train_episodes=50,
        num_eval_episodes=10,
        max_episode_steps=10,
        system_prompt="Navigate the dungeon by typing commands.",
    )
```

### With Custom Parser

```python theme={null}
import verifiers as vf
import gymnasium as gym
import re

def load_environment():
    def action_parser(text: str) -> int:
        """Extract numeric action from verbose output."""
        # Model might say "I choose action 2"
        match = re.search(r'\b([0-3])\b', text)
        if match:
            return int(match.group(1))
        raise ValueError(f"No valid action in: {text}")
    
    return vf.GymEnv(
        env_cls=gym.make,
        env_kwargs={"id": "LunarLander-v2"},
        action_parser=action_parser,
        num_train_episodes=100,
        max_episode_steps=1000,
    )
```

## Built-in Rubric

### EpisodicSumRubric

Default rubric that sums step rewards:

```python theme={null}
class EpisodicSumRubric(Rubric):
    def __init__(self, weight: float = 1.0, **kwargs):
        super().__init__(funcs=[sum_step_rewards], weights=[weight], **kwargs)
```

Accesses `state["trajectory"]` to sum per-step rewards from the environment.

## Key Methods

### gym\_to\_hf

```python theme={null}
def gym_to_hf(self) -> tuple[Dataset, Dataset | None]
```

Generates HuggingFace datasets by running `reset()` on each episode:

```python theme={null}
# Each row:
{
    "question": obs_to_text(initial_obs),
    "answer": str(seed),  # Stored for reproducibility
}
```

### obs\_to\_text

```python theme={null}
def obs_to_text(self, obs: Any) -> str
```

Converts observation to text. Override for custom formatting:

```python theme={null}
class CustomGymEnv(vf.GymEnv):
    def obs_to_text(self, obs):
        # Custom observation rendering
        return f"Custom format: {obs}"
```

### env\_response

```python theme={null}
async def env_response(
    self,
    messages: vf.Messages,
    state: State,
    **kwargs
) -> vf.Messages | str
```

Executes `env.step(action)` and returns observation as user message.

## State Keys

GymEnv adds:

<ParamField path="gym_env" type="StepResetEnv">
  Active Gym environment instance (created per rollout).
</ParamField>

<ParamField path="gym_done" type="bool">
  Whether episode has terminated.
</ParamField>

<ParamField path="trajectory[i]['reward']" type="float">
  Step-level reward from `env.step()`.
</ParamField>

<ParamField path="trajectory[i]['extras']['gym_info']" type="dict">
  Info dict returned by `env.step()`.
</ParamField>

## Error Handling

Action parsing errors:

* Set `state["gym_done"] = True`
* Return error message to model
* Assign 0.0 reward to that step

```python theme={null}
# If action_parser raises:
"Action Parsing Error: Invalid action: 'foo'"
```

## Stop Conditions

Episode ends when:

1. Gym returns `done=True` or `truncated=True`
2. `max_episode_steps` is reached
3. Action parsing fails

## Advanced: Custom Reward

```python theme={null}
import verifiers as vf
import gymnasium as gym

def custom_reward(state: vf.State) -> float:
    """Bonus for efficiency."""
    total_reward = sum(
        step.get("reward", 0.0) for step in state["trajectory"]
    )
    num_steps = len(state["trajectory"])
    efficiency_bonus = max(0, 1.0 - num_steps / 100)
    return total_reward + efficiency_bonus

def load_environment():
    return vf.GymEnv(
        env_cls=gym.make,
        env_kwargs={"id": "CartPole-v1"},
        rubric=vf.Rubric(custom_reward),
    )
```

## Limitations

* **Text-only**: Observation must be convertible to text (no vision support)
* **Synchronous**: Gym envs are not async
* **Single episode per rollout**: Each rollout is one episode

## When to Use

Use GymEnv for:

* Existing Gym environments
* Sequential decision-making tasks
* Reinforcement learning benchmarks
* Text-based games

For tool-based tasks, use [ToolEnv](/api/tool-env) instead.

## See Also

* [MultiTurnEnv](/api/multi-turn-env) - Parent class
* [Rubric](/api/rubric) - Custom reward functions
* [Gymnasium Documentation](https://gymnasium.farama.org/) - Gym API reference
