> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/primeintellect-ai/verifiers/llms.txt
> Use this file to discover all available pages before exploring further.

# Running Evaluations

> Evaluate environments with models and analyze results

The `prime eval run` command executes rollouts against any supported model provider and reports aggregate metrics. Use it to test environments during development, benchmark models, and validate training progress.

## Quick Start

<Steps>
  ### Install your environment

  ```bash theme={null}
  prime env install my-env
  ```

  ### Run evaluation

  ```bash theme={null}
  prime eval run my-env -m gpt-4.1-mini -n 10
  ```

  This runs 10 examples with 3 rollouts each (default) for a total of 30 rollouts.

  ### View results

  Expected output:

  ```
  Loading environment: my-env
  Dataset: 10 examples, 3 rollouts per example = 30 total rollouts
  Model: gpt-4.1-mini

  Progress: ████████████████████ 30/30 (100%)

  Results:
    Reward: 0.73 ± 0.15
    correct_answer: 0.73 ± 0.15
    num_turns: 1.0 ± 0.0
    response_length: 142.3 ± 45.2
  ```
</Steps>

## Basic Usage

The basic command structure:

```bash theme={null}
prime eval run <env-id> [options]
```

### Essential Options

| Flag                     | Short | Description                | Default        |
| ------------------------ | ----- | -------------------------- | -------------- |
| `--model`                | `-m`  | Model to evaluate          | `gpt-4.1-mini` |
| `--num-examples`         | `-n`  | Number of dataset examples | `5`            |
| `--rollouts-per-example` | `-r`  | Rollouts per example       | `3`            |
| `--save-results`         | `-s`  | Save results to disk       | `false`        |
| `--verbose`              | `-v`  | Enable debug logging       | `false`        |

### Examples

<CodeGroup>
  ```bash Basic Eval theme={null}
  prime eval run gsm8k -m gpt-4.1-mini -n 20
  ```

  ```bash Save Results theme={null}
  prime eval run math-python -m gpt-4.1-mini -n 100 -r 5 -s
  ```

  ```bash Debug Mode theme={null}
  prime eval run my-env -m gpt-4.1-mini -n 2 -v
  ```

  ```bash Custom State Columns theme={null}
  prime eval run my-env -m gpt-4.1-mini -n 10 -s \
    -C "judge_response,parsed_answer,reasoning"
  ```
</CodeGroup>

## Model Configuration

### Using Model Aliases

Define model endpoints in `configs/endpoints.toml`:

```toml configs/endpoints.toml theme={null}
[[endpoint]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"

[[endpoint]]
endpoint_id = "qwen3-235b-i"
model = "qwen/qwen3-235b-a22b-instruct-2507"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"

[[endpoint]]
endpoint_id = "claude-sonnet"
model = "claude-sonnet-4-5-20250929"
url = "https://api.anthropic.com"
key = "ANTHROPIC_API_KEY"
api_client_type = "anthropic_messages"
```

Then use the alias directly:

```bash theme={null}
prime eval run my-env -m qwen3-235b-i
```

### Direct Configuration

Override configuration without using aliases:

```bash theme={null}
prime eval run my-env \
  -m custom-model \
  -b https://my-api.example.com/v1 \
  -k MY_API_KEY
```

### Sampling Parameters

Control model generation:

```bash theme={null}
# Temperature and max tokens
prime eval run my-env -m gpt-4.1-mini -T 0.7 -t 1024

# Additional parameters via JSON
prime eval run my-env -m gpt-4.1-mini \
  -S '{"temperature": 0.7, "top_p": 0.9, "presence_penalty": 0.1}'
```

## Environment Configuration

### Passing Arguments to load\_environment()

Use `--env-args` to pass arguments:

```python my_env.py theme={null}
def load_environment(
    difficulty: str = "easy",
    num_examples: int = -1,
    seed: int = 42,
):
    # Your environment implementation
    ...
```

```bash theme={null}
prime eval run my-env \
  -a '{"difficulty": "hard", "num_examples": 100, "seed": 123}'
```

### Overriding Environment Constructor Args

Use `--extra-env-kwargs` to pass arguments directly to the constructor:

```bash theme={null}
# Override max_turns
prime eval run my-env -x '{"max_turns": 20}'

# Override multiple parameters
prime eval run my-env \
  -x '{"max_turns": 15, "system_prompt": "Custom prompt"}'
```

## Evaluation Scope

### Examples and Rollouts

Control the evaluation size:

```bash theme={null}
# 50 examples × 5 rollouts = 250 total rollouts
prime eval run my-env -n 50 -r 5
```

Why multiple rollouts per example?

* Measure variance in model responses
* Compute pass\@k metrics
* Enable advantage-based RL training

### Concurrency

Control parallel execution:

```bash theme={null}
# 64 concurrent requests
prime eval run my-env -m gpt-4.1-mini -n 100 -c 64

# Separate generation and scoring concurrency
prime eval run my-env -n 100 \
  --max-concurrent-generation 32 \
  --max-concurrent-scoring 64
```

## Saving and Resuming

### Saving Results

Enable checkpointing with `-s`:

```bash theme={null}
prime eval run my-env -m gpt-4.1-mini -n 100 -s
```

Results saved to:

```
./environments/my_env/outputs/evals/my-env--openai--gpt-4.1-mini/{run_id}/
├── results.jsonl      # One rollout per line
└── metadata.json      # Configuration and metrics
```

### Resuming Evaluations

<Steps>
  ### Start a long evaluation

  ```bash theme={null}
  prime eval run math-python -m gpt-4.1-mini -n 500 -r 3 -s
  ```

  ### If interrupted, resume from checkpoint

  ```bash theme={null}
  # Auto-detect latest incomplete run
  prime eval run math-python -m gpt-4.1-mini -n 500 -r 3 -s --resume

  # Or specify exact path
  prime eval run math-python -m gpt-4.1-mini -n 500 -r 3 -s \
    --resume ./environments/math_python/outputs/evals/math-python--openai--gpt-4.1-mini/abc12345
  ```
</Steps>

When resuming:

* Existing completed rollouts are loaded
* Only incomplete rollouts are executed
* Results are appended to the existing checkpoint
* If all rollouts complete, returns immediately

<Warning>
  **Configuration compatibility**: When resuming, use the same configuration (model, env-args, rollouts-per-example). Mismatches can lead to undefined behavior.
</Warning>

### Saving Custom State Columns

Save environment-specific state fields:

```bash theme={null}
prime eval run my-env -s -C "judge_response,parsed_answer,reasoning_trace"
```

Default columns: `query`, `completion`, `expected_answer`, `reward`, `error`

Access in results.jsonl:

```json theme={null}
{
  "example_id": 0,
  "reward": 0.8,
  "completion": [...],
  "judge_response": "Correct",
  "parsed_answer": "42",
  "reasoning_trace": "First I..."
}
```

## Multi-Environment Evaluation

Evaluate multiple environments with a single command using TOML configs.

### Basic Multi-Env Config

```toml configs/eval/my-benchmark.toml theme={null}
# Global defaults
model = "gpt-4.1-mini"
num_examples = 50
rollouts_per_example = 3

[[eval]]
env_id = "gsm8k"
num_examples = 100  # Override global

[[eval]]
env_id = "math-python"

[[eval]]
env_id = "reverse-text"
rollouts_per_example = 5  # Override global
```

Run all evaluations:

```bash theme={null}
prime eval run configs/eval/my-benchmark.toml
```

### Per-Environment Configuration

```toml configs/eval/detailed.toml theme={null}
model = "gpt-4.1-mini"

[[eval]]
env_id = "math-python"
num_examples = 50

[eval.env_args]
difficulty = "hard"
split = "test"

[[eval]]
env_id = "wiki-search"
num_examples = 30

[eval.env_args]
max_turns = 15
judge_model = "gpt-4.1-mini"
```

### Using Endpoint Registry

```toml configs/eval/multi-model.toml theme={null}
endpoints_path = "./configs/endpoints.toml"

[[eval]]
env_id = "gsm8k"
endpoint_id = "gpt-4.1-mini"
num_examples = 100

[[eval]]
env_id = "gsm8k"
endpoint_id = "qwen3-235b-i"
num_examples = 100

[[eval]]
env_id = "gsm8k"
endpoint_id = "claude-sonnet"
num_examples = 100
```

This runs the same environment with different models for comparison.

## Configuration Precedence

When using **CLI only**:

1. CLI arguments (highest priority)
2. Environment defaults from `pyproject.toml`
3. Built-in defaults (lowest priority)

When using **TOML config**:

1. Per-eval settings in `[[eval]]` sections (highest priority)
2. Global settings at top of TOML
3. Environment defaults from `pyproject.toml`
4. Built-in defaults (lowest priority)

<Info>
  When using a TOML config file, all CLI arguments are ignored.
</Info>

## Output and Display

### Standard Output

Default display shows progress and summary:

```
Loading environment: my-env
Dataset: 50 examples, 3 rollouts per example = 150 total rollouts
Model: gpt-4.1-mini

Progress: ████████████████████ 150/150 (100%)

Results:
  Reward: 0.82 ± 0.11
  correct_answer: 0.82 ± 0.11
  num_turns: 6.3 ± 2.1
  total_tool_calls: 8.7 ± 3.2
  search_pages_calls: 2.1 ± 1.0
  read_section_calls: 6.6 ± 2.8
```

### Verbose Mode

Enable detailed logging:

```bash theme={null}
prime eval run my-env -m gpt-4.1-mini -n 2 -v
```

Shows:

* Model requests and responses
* Tool calls and results
* Reward function execution
* State updates
* Timing information

### TUI Mode

Use alternate screen mode for cleaner display:

```bash theme={null}
prime eval run my-env -m gpt-4.1-mini -n 100 -u
```

### Debug Mode

Disable Rich display, use standard logging:

```bash theme={null}
prime eval run my-env -m gpt-4.1-mini -n 10 -d
```

Useful for:

* CI/CD environments
* Piping output to files
* Debugging display issues

## Advanced Features

### Independent Rollout Scoring

By default, rollouts are scored in groups (all rollouts for the same example together). This enables group-based reward functions and pass\@k metrics.

Score rollouts independently instead:

```bash theme={null}
prime eval run my-env -m gpt-4.1-mini -n 10 -i
```

Use when:

* You don't need group-based metrics
* You want faster completion (scoring happens as rollouts finish)
* Your reward functions don't use plural arguments (completions, prompts, etc.)

### Disable Interleaved Scoring

By default, scoring happens as rollouts complete. Disable to score all at once:

```bash theme={null}
prime eval run my-env -m gpt-4.1-mini -n 10 -N
```

### Automatic Retries

Enable retries for transient infrastructure errors:

```bash theme={null}
prime eval run my-env -m gpt-4.1-mini -n 100 --max-retries 3
```

Retries with exponential backoff on:

* Sandbox timeouts
* API failures
* Network errors
* Other `vf.InfraError` instances

### Heartbeat Monitoring

Send heartbeat pings during evaluation:

```bash theme={null}
prime eval run my-env -m gpt-4.1-mini -n 1000 \
  --heartbeat-url https://status.example.com/ping/abc123
```

Useful for monitoring long-running evaluations.

### Push to Hugging Face Hub

Publish results to HF Hub:

```bash theme={null}
prime eval run my-env -m gpt-4.1-mini -n 100 -s \
  --save-to-hf-hub \
  --hf-hub-dataset-name my-org/my-env-results
```

## Analyzing Results

After saving results with `-s`, analyze the output files.

### Results Structure

```
outputs/evals/my-env--openai--gpt-4.1-mini/{run_id}/
├── results.jsonl      # One rollout per line
└── metadata.json      # Configuration and metrics
```

### Reading Results

```python theme={null}
import json

# Load metadata
with open("metadata.json") as f:
    metadata = json.load(f)
    print(f"Reward: {metadata['reward']} ± {metadata['reward_std']}")

# Load individual rollouts
rollouts = []
with open("results.jsonl") as f:
    for line in f:
        rollouts.append(json.load(line))

# Analyze
for rollout in rollouts:
    print(f"Example {rollout['example_id']}: reward={rollout['reward']}")
    print(f"Completion: {rollout['completion'][-1]['content'][:100]}...")
```

### Computing Metrics

```python theme={null}
import numpy as np

rewards = [r["reward"] for r in rollouts]
print(f"Mean reward: {np.mean(rewards):.3f}")
print(f"Median reward: {np.median(rewards):.3f}")
print(f"Min/Max: {np.min(rewards):.3f} / {np.max(rewards):.3f}")

# Pass@k
def pass_at_k(rollouts, k):
    examples = {}
    for r in rollouts:
        ex_id = r["example_id"]
        if ex_id not in examples:
            examples[ex_id] = []
        examples[ex_id].append(r["reward"])
    
    successes = sum(max(rewards[:k]) > 0.5 for rewards in examples.values())
    return successes / len(examples)

print(f"Pass@1: {pass_at_k(rollouts, 1):.3f}")
print(f"Pass@3: {pass_at_k(rollouts, 3):.3f}")
```

## Common Workflows

### Quick Environment Test

```bash theme={null}
# Fast sanity check during development
prime eval run my-env -m gpt-4.1-mini -n 2 -r 1 -v
```

### Benchmark Evaluation

```bash theme={null}
# Full evaluation with saves
prime eval run my-env -m gpt-4.1-mini -n 100 -r 5 -s
```

### Model Comparison

Create a config:

```toml configs/eval/compare.toml theme={null}
num_examples = 50
rollouts_per_example = 3

[[eval]]
env_id = "gsm8k"
endpoint_id = "gpt-4.1-mini"

[[eval]]
env_id = "gsm8k"
endpoint_id = "qwen3-235b-i"

[[eval]]
env_id = "gsm8k"
endpoint_id = "claude-sonnet"
```

Run:

```bash theme={null}
prime eval run configs/eval/compare.toml
```

### Pre-Training Baseline

```bash theme={null}
# Establish baseline before training
prime eval run my-env -m my-model -n 200 -r 5 -s
```

### Post-Training Validation

```bash theme={null}
# Evaluate checkpoints
for checkpoint in checkpoint_*.pt; do
    prime eval run my-env -m $checkpoint -n 100 -s
done
```

## Troubleshooting

### Slow Evaluation

* Increase concurrency: `-c 64`
* Reduce rollouts per example: `-r 1`
* Use faster model for initial testing
* Check network latency to API

### Out of Memory

* Reduce concurrency: `-c 16`
* Use smaller models
* Enable independent scoring: `-i`

### API Rate Limits

* Reduce concurrency: `-c 8`
* Use endpoint registry with multiple replicas
* Add retries: `--max-retries 3`

### Inconsistent Results

* Increase rollouts per example: `-r 10`
* Check temperature setting: `-T 0.0` for deterministic
* Ensure environment is deterministic (check random seeds)

## Next Steps

* **Training**: Use evaluation results to guide RL training → [Training Guide](/guides/training)
* **Environment improvements**: Analyze failed rollouts to improve your environment
* **Model selection**: Compare models to choose the best starting point for training
