> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/primeintellect-ai/verifiers/llms.txt
> Use this file to discover all available pages before exploring further.

# Reasoning Gym Integration

> Use Reasoning Gym procedural datasets in Verifiers environments

The `ReasoningGymEnv` integration wraps [Reasoning Gym](https://github.com/reasoning-gym/reasoning-gym) procedural datasets for use in Verifiers environments.

Reasoning Gym provides a collection of procedurally generated reasoning tasks designed to test various cognitive abilities of language models.

## Features

* **Procedural generation** - Infinite task variations via seeds
* **Multiple datasets** - Access to all Reasoning Gym tasks
* **Composite datasets** - Mix multiple tasks with custom weights
* **Automatic scoring** - Uses task-specific scoring from Reasoning Gym
* **XML formatting** - Built-in parser for structured outputs

## Installation

Install with Reasoning Gym support:

```bash theme={null}
uv add 'verifiers[rg]'
```

This installs the `reasoning-gym` package.

## Quick Start

<Steps>
  <Step title="Create environment">
    Create a basic Reasoning Gym environment:

    ```python theme={null}
    import verifiers as vf
    from verifiers.envs.integrations.reasoninggym_env import ReasoningGymEnv

    def load_environment():
        return ReasoningGymEnv(
            gym="arc_1d",
            num_train_examples=1000,
            num_eval_examples=100,
            seed=0,
        )
    ```
  </Step>

  <Step title="Evaluate">
    Run an evaluation:

    ```bash theme={null}
    prime eval run my-rg-env -m openai/gpt-4.1-mini -n 20
    ```
  </Step>
</Steps>

## Available Datasets

Reasoning Gym provides tasks across multiple categories:

### Pattern Recognition

* `arc_1d` - ARC-like 1D pattern completion
* `pattern_induction` - Identify and continue patterns

### Mathematics

* `elementary_algebra` - Solve algebraic equations
* `number_theory` - Number properties and relationships
* `arithmetic` - Basic arithmetic operations

### Logic

* `propositional_logic` - Logical reasoning with propositions
* `spatial_reasoning` - Reason about spatial relationships

### Other

* `zebra_puzzle` - Logic grid puzzles
* `word_problems` - Text-based reasoning

See the [Reasoning Gym repository](https://github.com/reasoning-gym/reasoning-gym) for the full list.

## Configuration

### Single Dataset

```python theme={null}
env = ReasoningGymEnv(
    gym="arc_1d",
    num_train_examples=2000,
    num_eval_examples=500,
    seed=0,
)
```

### Multiple Datasets

Combine multiple datasets with equal weights:

```python theme={null}
env = ReasoningGymEnv(
    gym=["arc_1d", "elementary_algebra", "propositional_logic"],
    num_train_examples=3000,  # 1000 per dataset
    num_eval_examples=300,    # 100 per dataset
    seed=0,
)
```

### Weighted Composite

Mix datasets with custom weights:

```python theme={null}
env = ReasoningGymEnv(
    gym=[
        {"name": "arc_1d", "weight": 0.5, "config": {}},
        {"name": "elementary_algebra", "weight": 0.3, "config": {}},
        {"name": "propositional_logic", "weight": 0.2, "config": {}},
    ],
    num_train_examples=1000,
    seed=0,
)
```

### Custom Parser

By default, `ReasoningGymEnv` uses `XMLParser` with `<think>` and `<answer>` fields. Override with a custom parser:

```python theme={null}
custom_parser = vf.XMLParser(
    fields=["reasoning", "solution"],
    answer_field="solution"
)

env = ReasoningGymEnv(
    gym="arc_1d",
    parser=custom_parser,
    num_train_examples=1000,
)
```

### Custom System Prompt

```python theme={null}
from reasoning_gym.utils import SYSTEM_PROMPTS

env = ReasoningGymEnv(
    gym="arc_1d",
    system_prompt=SYSTEM_PROMPTS["chain_of_thought"],
    num_train_examples=1000,
)
```

Available system prompts:

* `"default"` - Basic reasoning prompt
* `"chain_of_thought"` - Encourage step-by-step reasoning
* `"concise"` - Encourage brief responses

## Scoring

Reasoning Gym tasks have built-in scoring functions. `ReasoningGymEnv` automatically:

1. Parses the model's answer field
2. Calls the task-specific `score_answer()` function
3. Returns a score (typically 0.0 or 1.0)

Format reward (XML compliance) is tracked separately with weight 0.

## Full Example

```python theme={null}
import verifiers as vf
from verifiers.envs.integrations.reasoninggym_env import ReasoningGymEnv

def load_environment(
    gym: str | list = "arc_1d",
    num_train_examples: int = 2000,
    num_eval_examples: int = 500,
    seed: int = 0,
) -> vf.Environment:
    """Load a Reasoning Gym environment.
    
    Args:
        gym: Dataset name, list of names, or composite spec
        num_train_examples: Number of training examples
        num_eval_examples: Number of eval examples
        seed: Random seed for generation
    """
    return ReasoningGymEnv(
        gym=gym,
        num_train_examples=num_train_examples,
        num_eval_examples=num_eval_examples,
        seed=seed,
    )
```

With composite datasets:

```python theme={null}
def load_environment():
    return ReasoningGymEnv(
        gym=[
            {"name": "arc_1d", "weight": 0.4},
            {"name": "elementary_algebra", "weight": 0.3},
            {"name": "propositional_logic", "weight": 0.2},
            {"name": "zebra_puzzle", "weight": 0.1},
        ],
        num_train_examples=2000,
        num_eval_examples=400,
        seed=0,
    )
```

## Expected Format

Models should respond with XML-formatted answers:

```xml theme={null}
<think>
Let me analyze the pattern:
- First element: 1
- Second element: 2
- Third element: 4

This appears to be powers of 2.
</think>

<answer>
8
</answer>
```

The `<answer>` field is extracted and passed to the task scorer.

## Metrics

| Metric          | Meaning                          |
| --------------- | -------------------------------- |
| `reward`        | Task-specific score (0.0 or 1.0) |
| `format_reward` | XML format compliance (weight 0) |

## Best Practices

<Note>
  Start with a single dataset to understand task difficulty before mixing multiple datasets.
</Note>

* **Validate baseline** - Test with a strong model first to ensure tasks are solvable
* **Match difficulty** - Mix tasks of similar difficulty for stable training
* **Use composite carefully** - Large differences in task difficulty can hurt training
* **Set appropriate seeds** - Different seeds generate different task variations

## Comparison with Raw Reasoning Gym

**Using Reasoning Gym directly:**

```python theme={null}
import reasoning_gym as rg

dataset = rg.create_dataset("arc_1d", size=100, seed=0)
for entry in dataset:
    question = entry["question"]
    # ... run model, parse answer
    score = dataset.score_answer(answer, entry)
```

**Using ReasoningGymEnv:**

```python theme={null}
env = ReasoningGymEnv(gym="arc_1d", num_train_examples=100, seed=0)
# Handles dataset creation, parsing, scoring automatically
```

## Examples

See the [reasoning-gym-env](https://github.com/PrimeIntellect-ai/verifiers/tree/main/environments/reasoning_gym_env) example in the Verifiers repository.

## Further Reading

* [Reasoning Gym Repository](https://github.com/reasoning-gym/reasoning-gym)
* [Reasoning Gym Paper](https://arxiv.org/abs/2410.06436)
