> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/primeintellect-ai/verifiers/llms.txt
> Use this file to discover all available pages before exploring further.

# Prime-RL Integration

> Train models with the prime-rl framework using Verifiers environments

The [`prime-rl`](https://github.com/PrimeIntellect-ai/prime-rl) trainer is a production-ready async RL training framework that supports large-scale multi-node training, agentic rollouts with Verifiers environments, Mixture-of-Experts (MoE) models, LoRA adapters, and training algorithms including SFT and online distillation.

We recommend using `prime-rl` for training with Verifiers environments on self-managed GPU infrastructure.

## Features

The default configuration distills best practices from our research team's experience and the broader community into a stable, easy-to-use recipe:

* **Async rollout generation** with continuous batching
* **Online difficulty filtering** to ensure training diversity
* **In-flight weight updates** for faster convergence
* **Importance sampling** and logprob clipping for stability
* **Multi-node training** with distributed data parallelism
* **LoRA and full finetuning** support
* **MoE model support** for efficient scaling
* **SFT and online distillation** in addition to RL

## Setup

<Steps>
  <Step title="Install prime-rl">
    Set up your workspace for training with `prime-rl`:

    ```bash theme={null}
    prime lab setup --prime-rl
    ```

    This will:

    * Clone and install the `prime-rl` trainer and its dependencies
    * Set up a default TOML config for training
    * Configure the included `wiki-search` environment for 8 GPUs
  </Step>

  <Step title="Configure your training">
    Edit the generated config file at `configs/prime-rl/wiki-search.toml`:

    ```toml theme={null}
    model = "Qwen/Qwen3-4B-Instruct-2507"
    max_steps = 500
    batch_size = 256
    rollouts_per_example = 8

    [sampling]
    max_tokens = 512

    [[env]]
    id = "primeintellect/wiki-search"
    args = { max_turns = 5 }

    [wandb]
    project = "wiki-search"
    name = "qwen3-4b-wiki-search"
    ```

    Key parameters:

    * `model` - Model to train (can be a HuggingFace model ID or local path)
    * `max_steps` - Number of training steps
    * `batch_size` - Rollouts per training batch
    * `rollouts_per_example` - Multiple rollouts per dataset example for advantage estimation
    * `env.id` - Environment to train on (local or from Environments Hub)
    * `env.args` - Environment-specific arguments passed to `load_environment()`
  </Step>

  <Step title="Start training">
    Launch training with:

    ```bash theme={null}
    uv run prime-rl configs/prime-rl/wiki-search.toml
    ```

    This launches a tmux session with separate panes for:

    * **Trainer** - Handles gradient updates and optimization
    * **Orchestrator** - Manages rollout generation and batching
    * **Inference server** - Serves the model for rollout generation
  </Step>
</Steps>

## Training Configuration

### Model Selection

```toml theme={null}
model = "Qwen/Qwen3-4B-Instruct-2507"
```

Supports any HuggingFace model or local checkpoint. For LoRA training:

```toml theme={null}
[lora]
enabled = true
r = 64
alpha = 16
dropout = 0.05
target_modules = ["q_proj", "v_proj", "k_proj", "o_proj"]
```

### Environment Configuration

Train on a single environment:

```toml theme={null}
[[env]]
id = "primeintellect/math-python"
args = { max_turns = 10, difficulty = "hard" }
```

Or train on multiple environments:

```toml theme={null}
[[env]]
id = "primeintellect/math-python"
args = { max_turns = 10 }
weight = 0.5

[[env]]
id = "primeintellect/gsm8k"
args = { max_turns = 5 }
weight = 0.3

[[env]]
id = "primeintellect/wiki-search"
args = { max_turns = 8 }
weight = 0.2
```

The `weight` parameter controls the sampling probability for each environment.

### Sampling Configuration

```toml theme={null}
[sampling]
max_tokens = 512
temperature = 0.7
top_p = 0.9
stop = ["<|endoftext|>"]
```

### Training Hyperparameters

```toml theme={null}
learning_rate = 1e-5  # 1e-5 for LoRA, 1e-6 for full finetuning
max_steps = 1000
batch_size = 256
rollouts_per_example = 8
gradient_accumulation_steps = 1
warmup_steps = 100
```

### Online Difficulty Filtering

Ensure training diversity by filtering rollout groups:

```toml theme={null}
[difficulty_filter]
enabled = true
min_reward_variance = 0.1  # Require some diversity in rewards
max_reward = 0.95  # Skip groups that are too easy
min_reward = 0.05  # Skip groups that are too hard
```

### Weights & Biases Integration

```toml theme={null}
[wandb]
project = "my-training-runs"
name = "qwen3-4b-math-rl"
entity = "my-team"  # optional
```

## Multi-Node Training

For distributed training across multiple nodes:

1. Set up `prime-rl` on each node
2. Configure the same training config on all nodes
3. Launch with distributed settings:

```bash theme={null}
# Node 0 (master)
export MASTER_ADDR=node0.example.com
export MASTER_PORT=29500
export WORLD_SIZE=4
export RANK=0
uv run prime-rl configs/prime-rl/wiki-search.toml

# Node 1
export MASTER_ADDR=node0.example.com
export MASTER_PORT=29500
export WORLD_SIZE=4
export RANK=1
uv run prime-rl configs/prime-rl/wiki-search.toml

# ... and so on for nodes 2-3
```

## Monitoring Training

Training metrics are logged to Weights & Biases:

* `train/reward` - Average reward per rollout
* `train/loss` - Policy gradient loss
* `train/learning_rate` - Current learning rate
* `train/kl_divergence` - KL divergence from reference policy
* `rollout/mean_length` - Average rollout length
* `rollout/generation_time` - Time to generate rollouts

## Best Practices

<Note>
  Before training, validate your environment with `prime eval run` to ensure:

  * Baseline performance is > 0% (task isn't too hard)
  * Baseline performance is \< 80% (task isn't too easy)
  * Rewards show diversity across rollouts
</Note>

### For Faster Training

* Increase `learning_rate` (1e-5 to 1e-4 for LoRA)
* Decrease `rollouts_per_example` (4-8)
* Decrease `batch_size` (128-256)
* Use smaller models

### For More Stable Training

* Increase `rollouts_per_example` (16-32)
* Increase `batch_size` (512-1024)
* Use larger models (14B+)
* Enable online difficulty filtering
* Use KL penalty:

```toml theme={null}
[kl_penalty]
enabled = true
target_kl = 0.01
beta = 0.1
```

## Common Issues

### OOM During Generation

* Reduce `rollouts_per_example` or `micro_batch_size`
* Use LoRA instead of full finetuning
* Ensure vLLM server has sufficient memory

### Training Instability

* Decrease learning rate
* Increase `rollouts_per_example`
* Increase `batch_size`
* Enable KL penalty

### Slow Training

* Increase learning rate
* Use continuous rewards (not sparse binary rewards)
* Enable online difficulty filtering
* Use easier tasks or smarter models

## Further Documentation

For advanced configuration options and troubleshooting, see the [prime-rl documentation](https://docs.primeintellect.ai/prime-rl).
