> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/primeintellect-ai/verifiers/llms.txt
> Use this file to discover all available pages before exploring further.

# RL Training Guide

> Train models with your Verifiers environments using RL

Verifiers environments are designed for reinforcement learning training. This guide covers training with Hosted Training (recommended), the open-source `prime-rl` trainer, and prompt optimization with GEPA.

## Training Options

Three primary approaches:

| Method              | Best For                                   | Infrastructure             |
| ------------------- | ------------------------------------------ | -------------------------- |
| **Hosted Training** | Production training, no GPU management     | Managed by Prime Intellect |
| **prime-rl**        | Self-hosted, large-scale training          | Your GPU cluster           |
| **GEPA**            | Prompt optimization (no gradient training) | CPU/single GPU             |

## Hosted Training

Hosted Training provides fully managed RL training infrastructure. You provide an environment and config, we handle the rest.

### Getting Started

<Steps>
  ### Setup workspace

  ```bash theme={null}
  prime lab setup
  ```

  This downloads example configs to `configs/rl/`.

  ### Choose a base config

  Example configs:

  * `gsm8k.toml` - Math reasoning
  * `math-python.toml` - Code-based math
  * `wordle.toml` - Game playing
  * `wiki-search.toml` - Tool use

  ### Configure your training run

  Edit or create a config:

  ```toml configs/rl/my-training.toml theme={null}
  model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
  max_steps = 500
  batch_size = 256
  rollouts_per_example = 8

  [sampling]
  max_tokens = 512

  [[env]]
  id = "my-environment"
  args = { difficulty = "medium" }

  [wandb]
  project = "my-project"
  name = "my-run"
  ```

  ### Submit training job

  Submit via the Prime Lab UI or CLI:

  ```bash theme={null}
  prime train submit configs/rl/my-training.toml
  ```
</Steps>

### Configuration Reference

```toml theme={null}
# Model and training
model = "Qwen/Qwen3-30B-A3B-Instruct-2507"  # Base model
max_steps = 500                             # Training steps
batch_size = 256                            # Samples per gradient update
rollouts_per_example = 8                    # Rollouts per example for advantage

# Sampling parameters
[sampling]
max_tokens = 512
temperature = 1.0

# Environment configuration
[[env]]
id = "primeintellect/alphabet-sort"
args = { min_turns = 3, max_turns = 5 }

# W&B logging
[wandb]
project = "alphabet-sort"
name = "qwen3-30b-alphabet-sort"

# Optional: environment variables for API keys
env_file = ["secrets.env"]
```

### Supported Models

Hosted Training currently supports:

* `Qwen/Qwen3-4B-Instruct-2507`
* `Qwen/Qwen3-4B-Thinking-2507`
* `Qwen/Qwen3-30B-Instruct-2507`
* `Qwen/Qwen3-30B-Thinking-2507`
* `Qwen/Qwen3-235B-Instruct-2507`
* `Qwen/Qwen3-235B-Thinking-2507`
* `PrimeIntellect/INTELLECT-3`

<Info>
  Hosted Training is currently in Private Beta. [Request access](https://form.typeform.com/to/iYn9UliG).
</Info>

### Environment Variables

For environments requiring API keys (e.g., judge models):

1. Create a secrets file:

```bash secrets.env theme={null}
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
```

2. Reference in config:

```toml theme={null}
env_file = ["secrets.env"]
```

3. Or set via Lab UI when submitting the job

## prime-rl: Self-Hosted Training

`prime-rl` is our production-ready async RL trainer for self-managed GPU infrastructure.

### Setup

<Steps>
  ### Clone and install

  ```bash theme={null}
  prime lab setup --prime-rl
  ```

  This:

  * Clones the `prime-rl` repository
  * Installs dependencies
  * Sets up example configs in `configs/prime-rl/`

  ### Configure training

  Example config:

  ```toml configs/prime-rl/my-training.toml theme={null}
  # Model and infrastructure
  model_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"
  num_gpus = 8
  tensor_parallel_size = 4

  # Training parameters
  max_steps = 1000
  batch_size = 512
  rollouts_per_example = 16
  learning_rate = 1e-5

  # Environment
  [env]
  id = "wiki-search"
  args = { max_turns = 10, judge_model = "gpt-4.1-mini" }

  # Sampling
  [sampling]
  max_tokens = 1024
  temperature = 1.0

  # W&B
  [wandb]
  project = "wiki-search"
  name = "qwen3-30b-wiki"
  ```

  ### Launch training

  ```bash theme={null}
  uv run prime-rl configs/prime-rl/my-training.toml
  ```

  This launches a tmux session with:

  * Inference server (vLLM)
  * Orchestrator (coordinates training)
  * Trainer (updates model weights)
</Steps>

### Key Features

* **Async rollout generation**: Non-blocking inference for maximum throughput
* **Continuous batching**: Efficient GPU utilization
* **In-flight weight updates**: Models update during rollout generation
* **Online difficulty filtering**: Focus on appropriately challenging examples
* **LoRA support**: Efficient fine-tuning for large models
* **MoE support**: Mixture-of-Experts architectures

### Configuration Options

```toml theme={null}
# Core training
max_steps = 1000
batch_size = 512
rollouts_per_example = 16
learning_rate = 1e-5

# LoRA (optional)
use_lora = true
lora_rank = 64
lora_alpha = 128

# Difficulty filtering (optional)
use_difficulty_filtering = true
difficulty_threshold = 0.3  # Min reward variance required

# Sampling
[sampling]
max_tokens = 1024
temperature = 1.0
top_p = 0.9

# Infrastructure
num_gpus = 8
tensor_parallel_size = 4
pipeline_parallel_size = 1
```

For full documentation: [prime-rl docs](https://docs.primeintellect.ai/prime-rl)

## GEPA: Prompt Optimization

GEPA (Genetic-Pareto) optimizes system prompts without gradient-based training, using a teacher LLM to iteratively improve prompts based on evaluation results.

### Basic Usage

```bash theme={null}
prime gepa run my-env --model google/gemini-3-flash-preview
```

This:

1. Runs initial evaluation with current prompt
2. Uses teacher model to propose improvements
3. Evaluates new prompts
4. Selects best prompts (Pareto frontier)
5. Repeats until budget exhausted

### Configuration

| Flag                        | Description                         | Default           |
| --------------------------- | ----------------------------------- | ----------------- |
| `--model` / `-m`            | Model for rollouts                  | Required          |
| `--reflection-model` / `-M` | Teacher model for prompt refinement | Same as `--model` |
| `--max-calls` / `-B`        | Evaluation budget                   | 500               |
| `--num-train` / `-n`        | Training examples                   | 100               |
| `--num-val` / `-N`          | Validation examples                 | 50                |
| `--minibatch-size`          | Examples per reflection             | 3                 |
| `--perfect-score`           | Max reward (skip if achieved)       | None              |
| `--state-columns`           | Extra state fields for reflection   | None              |

### Example Workflow

<Steps>
  ### Run optimization

  ```bash theme={null}
  prime gepa run wordle \
    --model google/gemini-3-flash-preview \
    --reflection-model google/gemini-3-exp-ultra-preview \
    --max-calls 1000 \
    --num-train 200 \
    --num-val 100
  ```

  ### Check output

  Results saved to `environments/wordle/outputs/gepa/`:

  * `best_prompt.txt` - Optimized system prompt
  * `pareto_frontier.jsonl` - Best prompts per validation example
  * `metadata.json` - Run configuration and summary

  ### Use the optimized prompt

  Copy the best prompt to your environment:

  ```python theme={null}
  DEFAULT_SYSTEM_PROMPT = """<content from best_prompt.txt>"""

  def load_environment(
      system_prompt: str = DEFAULT_SYSTEM_PROMPT,
      **kwargs
  ):
      return vf.SingleTurnEnv(
          dataset=dataset,
          system_prompt=system_prompt,
          rubric=rubric,
      )
  ```

  ### Verify improvement

  ```bash theme={null}
  # Before optimization
  prime eval run wordle -m google/gemini-3-flash-preview -n 100

  # After optimization
  prime eval run wordle \
    -m google/gemini-3-flash-preview \
    -n 100 \
    -a '{"system_prompt": "<optimized prompt>"}'
  ```
</Steps>

### GEPA Configuration Files

Use TOML configs for reproducible optimization:

```toml configs/gepa/my-optimization.toml theme={null}
env_id = "my-env"
model = "google/gemini-3-flash-preview"
reflection_model = "google/gemini-3-exp-ultra-preview"
max_calls = 1000
num_train = 200
num_val = 100
minibatch_size = 5
perfect_score = 1.0
state_columns = ["parsed_answer", "tool_calls"]
```

Run:

```bash theme={null}
prime gepa run configs/gepa/my-optimization.toml
```

## RL Best Practices

### Before Training

<Steps>
  ### Evaluate baseline performance

  Run evaluation to establish baseline:

  ```bash theme={null}
  prime eval run my-env -m base-model -n 100 -r 5
  ```

  Target baselines:

  * **Too easy**: >80% success → task may be too simple
  * **Good range**: 10-70% success → ideal for RL
  * **Too hard**: Less than 5% success → model may need stronger base

  ### Check reward diversity

  Ensure varied rewards within groups:

  ```bash theme={null}
  prime eval run my-env -m base-model -n 20 -r 8 -s
  ```

  Analyze results:

  ```python theme={null}
  import json
  import numpy as np

  with open("results.jsonl") as f:
      rollouts = [json.load(line) for line in f]

  # Group by example
  examples = {}
  for r in rollouts:
      ex = r["example_id"]
      if ex not in examples:
          examples[ex] = []
      examples[ex].append(r["reward"])

  # Check variance
  for ex, rewards in examples.items():
      print(f"Example {ex}: std={np.std(rewards):.3f}, rewards={rewards}")
  ```

  Low variance within groups indicates rewards may need tuning.

  ### Verify environment correctness

  ```bash theme={null}
  prime eval run my-env -m gpt-4.1-mini -n 5 -v
  ```

  Manually inspect:

  * Reward functions give expected scores
  * Stop conditions trigger correctly
  * Tool calls execute properly
  * Error handling works
</Steps>

### Training Hyperparameters

#### For More Aggressive Training

⚠️ Higher risk of instability/collapse:

* Increase learning rate: `1e-5` → `1e-4` (LoRA), `1e-6` → `1e-5` (full)
* Decrease `rollouts_per_example`: `16` → `8`
* Decrease `batch_size`: `512` → `256`

#### For More Stable Training

✅ Slower progress but safer:

* Increase `rollouts_per_example`: `8` → `16` or `32`
* Increase `batch_size`: `256` → `512` or `1024`
* Use larger models: `4B` → `30B` or `235B`
* Enable difficulty filtering (prime-rl)

### During Training

**Monitor W\&B metrics**:

* `reward/mean` - Should increase steadily
* `reward/std` - Should remain stable (not collapse to 0)
* `policy/entropy` - Should decrease but not collapse
* `policy/kl` - Should stay within bounds

**Watch for instability**:

* Sudden reward drops
* Loss divergence
* Degenerate outputs (repetition, incoherence)

**Checkpoint frequently**:

```toml theme={null}
[checkpointing]
save_every_n_steps = 50
keep_n_checkpoints = 10
```

### Common Issues

#### OOM During Generation

* Reduce `rollouts_per_example`
* Reduce `batch_size`
* Use LoRA instead of full finetuning
* Increase `tensor_parallel_size`

#### Training Instability

* Decrease learning rate
* Increase `rollouts_per_example` (better advantage estimates)
* Increase `batch_size` (more stable gradients)
* Enable gradient clipping
* Use reward clipping/normalization

#### Slow Training

* Increase learning rate (if stable)
* Use continuous rewards instead of binary
* Enable online difficulty filtering
* Use appropriate task difficulty
* Check GPU utilization

#### Model Collapse

Symptoms: All outputs become identical, entropy → 0

Fixes:

* Restart from earlier checkpoint
* Decrease learning rate
* Increase KL penalty
* Increase entropy bonus
* Increase rollout diversity (temperature, top\_p)

## Advanced Topics

### Multi-Task Training

Train on multiple environments:

```toml theme={null}
[[env]]
id = "gsm8k"
weight = 1.0

[[env]]
id = "math-python"
weight = 1.0

[[env]]
id = "reverse-text"
weight = 0.5
```

Or use `EnvGroup` in your environment:

```python theme={null}
import verifiers as vf

def load_environment():
    math_env = load_math_env()
    code_env = load_code_env()
    reasoning_env = load_reasoning_env()
    
    return vf.EnvGroup(
        envs=[math_env, code_env, reasoning_env],
        env_names=["math", "code", "reasoning"],
    )
```

### Curriculum Learning

Progressively increase difficulty:

```python theme={null}
def load_environment(difficulty_level: int = 1):
    if difficulty_level == 1:
        dataset = easy_dataset
    elif difficulty_level == 2:
        dataset = medium_dataset
    else:
        dataset = hard_dataset
    
    return vf.SingleTurnEnv(dataset=dataset, rubric=rubric)
```

Update config between training runs:

```bash theme={null}
# Start easy
prime train submit configs/rl/my-env-level1.toml

# Progress to medium
prime train submit configs/rl/my-env-level2.toml

# Final hard
prime train submit configs/rl/my-env-level3.toml
```

### Continuous Rewards

Prefer continuous over binary rewards:

```python theme={null}
# Binary (less informative)
async def binary_reward(completion, answer) -> float:
    return 1.0 if exact_match(completion, answer) else 0.0

# Continuous (more informative)
async def continuous_reward(completion, answer) -> float:
    from difflib import SequenceMatcher
    response = parser.parse_answer(completion)
    return SequenceMatcher(None, response, answer).ratio()
```

Continuous rewards provide better gradient signal.

### Chat Template Issues

<Warning>
  **Non-Increasing Chat Templates**: Some models (Qwen3, DeepSeek-R1) remove `<think>` sections when processing multi-turn conversations, violating the increasing context requirement for RL.

  Use modified versions with fixed templates: [Modified Models](https://huggingface.co/collections/willcb/qwen3-68434f4883925bfdb4570ee5)
</Warning>

## Other Trainers

Verifiers environments work with multiple training frameworks:

### Tinker

[Tinker](https://thinkingmachines.ai/tinker/) supports Verifiers via recipes:

```bash theme={null}
git clone https://github.com/thinking-machines-lab/tinker-cookbook
cd tinker-cookbook/recipes/verifiers_rl
# Follow setup instructions
```

### SkyRL

[SkyRL](https://github.com/NovaSky-AI/SkyRL) integrates Verifiers:

```bash theme={null}
git clone https://github.com/NovaSky-AI/SkyRL
cd SkyRL/skyrl-train/integrations/verifiers
# Follow setup instructions
```

### rLLM

[rLLM](https://github.com/rllm-project/rllm) supports both verl and Tinker backends:

```bash theme={null}
pip install rllm
# See documentation: https://rllm-project.readthedocs.io/en/latest/examples/verifiers/
```

## Next Steps

* **Evaluation**: Monitor training progress with evaluations → [Evaluation Guide](/guides/evaluation)
* **Environment improvements**: Iterate on reward functions and task design
* **Scaling**: Move from small experiments to full training runs
* **Model selection**: Experiment with different base models