> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/primeintellect-ai/verifiers/llms.txt
> Use this file to discover all available pages before exploring further.

# prime eval run

> Run evaluations on environments with any model

## Overview

The `prime eval run` command executes rollouts against model APIs and reports aggregate metrics. It supports single-environment evaluations or multi-environment benchmark suites via TOML config files.

## Usage

```bash theme={null}
prime eval run <env_id_or_config> [OPTIONS]
```

## Arguments

<ParamField path="env_id_or_config" type="string" required>
  Either:

  * **Environment ID**: `gsm8k`, `primeintellect/math-python`
  * **TOML config path**: `configs/eval/benchmark.toml` (for multi-environment evals)
</ParamField>

## Model Configuration

<ParamField path="--model" type="string" default="openai/gpt-4.1-mini">
  Model name or endpoint alias from the registry.

  Aliases: `-m`
</ParamField>

<ParamField path="--api-base-url" type="string" default="https://api.pinference.ai/api/v1">
  API base URL. Overrides endpoint registry.

  Aliases: `-b`
</ParamField>

<ParamField path="--api-key-var" type="string" default="PRIME_API_KEY">
  Environment variable containing API key.

  Aliases: `-k`
</ParamField>

<ParamField path="--api-client-type" type="string" default="openai_chat_completions">
  Client type: `openai_chat_completions`, `openai_completions`, `openai_chat_completions_token`, or `anthropic_messages`.
</ParamField>

<ParamField path="--endpoints-path" type="string" default="./configs/endpoints.toml">
  Path to TOML endpoint registry.

  Aliases: `-e`
</ParamField>

<ParamField path="--provider" type="string">
  Provider shorthand (`prime`, `openai`, `anthropic`, `openrouter`, `deepseek`, `minimax`, `glm`, `local`, `vllm`).

  Aliases: `-p`
</ParamField>

<ParamField path="--header" type="string">
  Extra HTTP header (`Name: Value`). Repeatable.
</ParamField>

## Sampling Parameters

<ParamField path="--max-tokens" type="integer">
  Maximum tokens to generate.

  Aliases: `-t`
</ParamField>

<ParamField path="--temperature" type="float">
  Sampling temperature.

  Aliases: `-T`
</ParamField>

<ParamField path="--sampling-args" type="json">
  Additional sampling parameters as JSON object.

  Aliases: `-S`

  Example: `-S '{"top_p": 0.9, "frequency_penalty": 0.5}'`
</ParamField>

## Environment Configuration

<ParamField path="--env-args" type="json" default="{}">
  Arguments passed to `load_environment()` as JSON.

  Aliases: `-a`

  Example: `-a '{"difficulty": "hard"}'`
</ParamField>

<ParamField path="--extra-env-kwargs" type="json" default="{}">
  Arguments passed directly to environment constructor.

  Aliases: `-x`

  Example: `-x '{"max_turns": 20}'`
</ParamField>

<ParamField path="--env-dir-path" type="string" default="./environments">
  Base path for environment outputs.
</ParamField>

## Evaluation Scope

<ParamField path="--num-examples" type="integer" default="5">
  Number of dataset examples to evaluate.

  Aliases: `-n`
</ParamField>

<ParamField path="--rollouts-per-example" type="integer" default="3">
  Rollouts per example (for pass\@k metrics).

  Aliases: `-r`
</ParamField>

## Concurrency

<ParamField path="--max-concurrent" type="integer" default="32">
  Maximum concurrent requests (both generation and scoring).

  Aliases: `-c`
</ParamField>

<ParamField path="--max-concurrent-generation" type="integer">
  Concurrent generation requests (defaults to `--max-concurrent`).
</ParamField>

<ParamField path="--max-concurrent-scoring" type="integer">
  Concurrent scoring requests (defaults to `--max-concurrent`).
</ParamField>

<ParamField path="--no-interleave-scoring" type="flag">
  Disable interleaved scoring (score all rollouts after generation completes).

  Aliases: `-N`
</ParamField>

<ParamField path="--independent-scoring" type="flag">
  Score each rollout individually instead of by group.

  Aliases: `-i`
</ParamField>

<ParamField path="--max-retries" type="integer" default="0">
  Retries per rollout on transient infrastructure errors.
</ParamField>

## Output and Display

<ParamField path="--verbose" type="flag">
  Enable debug logging.

  Aliases: `-v`
</ParamField>

<ParamField path="--tui" type="flag">
  Use alternate screen mode (TUI) for live display.

  Aliases: `-u`
</ParamField>

<ParamField path="--debug" type="flag">
  Disable Rich display; use normal logging and tqdm progress.

  Aliases: `-d`
</ParamField>

<ParamField path="--save-results" type="flag">
  Save results to disk in `./outputs/evals/` or `./environments/*/outputs/evals/`.

  Aliases: `-s`
</ParamField>

<ParamField path="--state-columns" type="string">
  Extra state columns to save (comma-separated).

  Aliases: `-C`

  Example: `-C "judge_response,parsed_answer"`
</ParamField>

<ParamField path="--resume" type="string">
  Resume from a previous run. Optionally provide a path; if omitted, auto-detect latest incomplete run.

  Aliases: `-R`
</ParamField>

<ParamField path="--save-to-hf-hub" type="flag">
  Push results to Hugging Face Hub.

  Aliases: `-H`
</ParamField>

<ParamField path="--hf-hub-dataset-name" type="string">
  Dataset name for HF Hub upload.

  Aliases: `-D`
</ParamField>

<ParamField path="--heartbeat-url" type="string">
  Heartbeat URL for uptime monitoring.
</ParamField>

<ParamField path="--disable-env-server" type="flag">
  Do not start environment servers for OpenEnv environments.
</ParamField>

## Examples

### Basic Evaluation

```bash theme={null}
prime eval run gsm8k -m gpt-4.1-mini -n 10
```

**Output:**

```
╭──────────────── gsm8k | openai/gpt-4.1-mini ────────────────╮
│ Examples: 10 | Rollouts/ex: 3 | Total: 30 rollouts         │
│                                                              │
│ Progress  ████████████████████████████████  30/30  100%     │
│                                                              │
│ Avg reward:    0.867                                         │
│ Avg metrics:   accuracy=0.867                                │
│ Runtime:       12.3s                                         │
╰──────────────────────────────────────────────────────────────╯
```

### With Custom Sampling

```bash theme={null}
prime eval run math-python \
  -m gpt-4.1-mini \
  -n 50 -r 5 \
  -t 1024 -T 0.7 \
  -S '{"top_p": 0.95}'
```

### With Environment Arguments

```bash theme={null}
prime eval run my-env \
  -m gpt-4.1-mini \
  -a '{"difficulty": "hard", "split": "test"}'
```

### Save and Resume

```bash theme={null}
# Start evaluation with checkpointing
prime eval run gsm8k -n 1000 -s

# If interrupted, resume from checkpoint
prime eval run gsm8k -n 1000 -s --resume

# Resume from specific path
prime eval run gsm8k -n 1000 -s \
  --resume ./environments/gsm8k/outputs/evals/gsm8k--openai--gpt-4.1-mini/abc12345
```

### Using Anthropic API

```bash theme={null}
prime eval run gsm8k \
  -m claude-sonnet-4 \
  --api-client-type anthropic_messages \
  --api-base-url https://api.anthropic.com \
  --api-key-var ANTHROPIC_API_KEY
```

Or using the provider shorthand:

```bash theme={null}
prime eval run gsm8k -m claude-sonnet-4 -p anthropic
```

### Multi-Environment Benchmark

```bash theme={null}
prime eval run configs/eval/benchmark.toml
```

**configs/eval/benchmark.toml:**

```toml theme={null}
model = "openai/gpt-4.1-mini"
num_examples = 100

[[eval]]
env_id = "gsm8k"

[[eval]]
env_id = "math-python"
rollouts_per_example = 5

[[eval]]
env_id = "alphabet-sort"
num_examples = 50
```

### High Concurrency

```bash theme={null}
prime eval run gsm8k \
  -n 1000 -r 10 \
  -c 128 \
  --max-concurrent-generation 64 \
  --max-concurrent-scoring 64
```

### Debug Mode

```bash theme={null}
prime eval run gsm8k -n 5 --debug --verbose
```

## Configuration Files

### Endpoint Registry

Define model endpoints in `configs/endpoints.toml`:

```toml theme={null}
[[endpoint]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"

[[endpoint]]
endpoint_id = "qwen3-235b-i"
model = "qwen/qwen3-235b-a22b-instruct-2507"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"

# Multiple replicas for load balancing
[[endpoint]]
endpoint_id = "my-model"
model = "my-model"
url = "https://api1.example.com/v1"
key = "API_KEY"

[[endpoint]]
endpoint_id = "my-model"
model = "my-model"
url = "https://api2.example.com/v1"
key = "API_KEY"
```

Then reference by endpoint ID:

```bash theme={null}
prime eval run gsm8k -m qwen3-235b-i
```

### Multi-Environment Config

```toml theme={null}
# Global defaults
model = "openai/gpt-4.1-mini"
num_examples = 50
rollouts_per_example = 3

[[eval]]
env_id = "gsm8k"
num_examples = 100  # overrides global

[eval.env_args]
difficulty = "hard"

[[eval]]
env_id = "math-python"
# uses global defaults

[[eval]]
env_id = "alphabet-sort"
endpoint_id = "qwen3-235b-i"
num_examples = 25
```

## Results Output

With `--save-results`, outputs are saved to:

```
./outputs/evals/{env_id}--{model}/{run_id}/
├── results.jsonl      # One rollout per line
└── metadata.json      # Config and aggregate metrics
```

Or per-environment:

```
./environments/{env_id}/outputs/evals/{env_id}--{model}/{run_id}/
```

### results.jsonl Format

Each line contains one rollout:

```json theme={null}
{
  "prompt": [{"role": "user", "content": "What is 2+2?"}],
  "completion": [{"role": "assistant", "content": "4"}],
  "reward": 1.0,
  "metrics": {"accuracy": 1.0},
  "task": {"question": "What is 2+2?", "answer": "4"},
  "info": {},
  "timing": {"generation_ms": 123, "scoring_ms": 45}
}
```

### metadata.json Format

```json theme={null}
{
  "env_id": "gsm8k",
  "model": "openai/gpt-4.1-mini",
  "num_examples": 10,
  "rollouts_per_example": 3,
  "avg_reward": 0.867,
  "avg_metrics": {"accuracy": 0.867},
  "time_ms": 12345,
  "sampling_args": {"max_tokens": 512, "temperature": 0.7},
  "env_args": {},
  "date": "2026-03-03",
  "time": "14:23:45"
}
```

## Configuration Precedence

### CLI Mode

1. CLI flags
2. Environment defaults (from `pyproject.toml`)
3. Built-in defaults

### TOML Config Mode

1. Per-eval settings (`[[eval]]` sections)
2. Global settings (top of config file)
3. Environment defaults (from `pyproject.toml`)
4. Built-in defaults

## Environment Defaults

Environments can specify defaults in `pyproject.toml`:

```toml theme={null}
[tool.verifiers.eval]
num_examples = 100
rollouts_per_example = 5
```

These are used when higher-priority sources don't specify a value.

## Resuming Evaluations

Long evaluations can be resumed:

```bash theme={null}
# Start with checkpointing
prime eval run gsm8k -n 1000 -s

# Interrupt with Ctrl+C

# Resume automatically (finds latest incomplete run)
prime eval run gsm8k -n 1000 -s --resume
```

Resume requirements:

* Same `env_id`, `model`, and `rollouts_per_example`
* `num_examples` must be >= original target
* Results directory must contain valid `results.jsonl` and `metadata.json`

## Troubleshooting

### Environment Not Found

```
Missing environments. Install them first:
  prime env install gsm8k
```

**Solution:**

```bash theme={null}
prime env install gsm8k
```

### API Key Not Set

```
Error: Environment variable 'PRIME_API_KEY' not set
```

**Solution:**

```bash theme={null}
export PRIME_API_KEY="your-api-key"
```

### Rate Limit Errors

Reduce concurrency:

```bash theme={null}
prime eval run gsm8k -n 100 -c 8
```

Or add retries:

```bash theme={null}
prime eval run gsm8k -n 100 --max-retries 3
```

### Out of Memory

For large evaluations, enable checkpointing and reduce batch size:

```bash theme={null}
prime eval run gsm8k -n 1000 -s -c 16
```
