> ## Documentation Index > Fetch the complete documentation index at: https://mintlify.com/primeintellect-ai/verifiers/llms.txt > Use this file to discover all available pages before exploring further. # prime eval run > Run evaluations on environments with any model ## Overview The `prime eval run` command executes rollouts against model APIs and reports aggregate metrics. It supports single-environment evaluations or multi-environment benchmark suites via TOML config files. ## Usage ```bash theme={null} prime eval run [OPTIONS] ``` ## Arguments Either: * **Environment ID**: `gsm8k`, `primeintellect/math-python` * **TOML config path**: `configs/eval/benchmark.toml` (for multi-environment evals) ## Model Configuration Model name or endpoint alias from the registry. Aliases: `-m` API base URL. Overrides endpoint registry. Aliases: `-b` Environment variable containing API key. Aliases: `-k` Client type: `openai_chat_completions`, `openai_completions`, `openai_chat_completions_token`, or `anthropic_messages`. Path to TOML endpoint registry. Aliases: `-e` Provider shorthand (`prime`, `openai`, `anthropic`, `openrouter`, `deepseek`, `minimax`, `glm`, `local`, `vllm`). Aliases: `-p` Extra HTTP header (`Name: Value`). Repeatable. ## Sampling Parameters Maximum tokens to generate. Aliases: `-t` Sampling temperature. Aliases: `-T` Additional sampling parameters as JSON object. Aliases: `-S` Example: `-S '{"top_p": 0.9, "frequency_penalty": 0.5}'` ## Environment Configuration Arguments passed to `load_environment()` as JSON. Aliases: `-a` Example: `-a '{"difficulty": "hard"}'` Arguments passed directly to environment constructor. Aliases: `-x` Example: `-x '{"max_turns": 20}'` Base path for environment outputs. ## Evaluation Scope Number of dataset examples to evaluate. Aliases: `-n` Rollouts per example (for pass\@k metrics). Aliases: `-r` ## Concurrency Maximum concurrent requests (both generation and scoring). Aliases: `-c` Concurrent generation requests (defaults to `--max-concurrent`). Concurrent scoring requests (defaults to `--max-concurrent`). Disable interleaved scoring (score all rollouts after generation completes). Aliases: `-N` Score each rollout individually instead of by group. Aliases: `-i` Retries per rollout on transient infrastructure errors. ## Output and Display Enable debug logging. Aliases: `-v` Use alternate screen mode (TUI) for live display. Aliases: `-u` Disable Rich display; use normal logging and tqdm progress. Aliases: `-d` Save results to disk in `./outputs/evals/` or `./environments/*/outputs/evals/`. Aliases: `-s` Extra state columns to save (comma-separated). Aliases: `-C` Example: `-C "judge_response,parsed_answer"` Resume from a previous run. Optionally provide a path; if omitted, auto-detect latest incomplete run. Aliases: `-R` Push results to Hugging Face Hub. Aliases: `-H` Dataset name for HF Hub upload. Aliases: `-D` Heartbeat URL for uptime monitoring. Do not start environment servers for OpenEnv environments. ## Examples ### Basic Evaluation ```bash theme={null} prime eval run gsm8k -m gpt-4.1-mini -n 10 ``` **Output:** ``` ╭──────────────── gsm8k | openai/gpt-4.1-mini ────────────────╮ │ Examples: 10 | Rollouts/ex: 3 | Total: 30 rollouts │ │ │ │ Progress ████████████████████████████████ 30/30 100% │ │ │ │ Avg reward: 0.867 │ │ Avg metrics: accuracy=0.867 │ │ Runtime: 12.3s │ ╰──────────────────────────────────────────────────────────────╯ ``` ### With Custom Sampling ```bash theme={null} prime eval run math-python \ -m gpt-4.1-mini \ -n 50 -r 5 \ -t 1024 -T 0.7 \ -S '{"top_p": 0.95}' ``` ### With Environment Arguments ```bash theme={null} prime eval run my-env \ -m gpt-4.1-mini \ -a '{"difficulty": "hard", "split": "test"}' ``` ### Save and Resume ```bash theme={null} # Start evaluation with checkpointing prime eval run gsm8k -n 1000 -s # If interrupted, resume from checkpoint prime eval run gsm8k -n 1000 -s --resume # Resume from specific path prime eval run gsm8k -n 1000 -s \ --resume ./environments/gsm8k/outputs/evals/gsm8k--openai--gpt-4.1-mini/abc12345 ``` ### Using Anthropic API ```bash theme={null} prime eval run gsm8k \ -m claude-sonnet-4 \ --api-client-type anthropic_messages \ --api-base-url https://api.anthropic.com \ --api-key-var ANTHROPIC_API_KEY ``` Or using the provider shorthand: ```bash theme={null} prime eval run gsm8k -m claude-sonnet-4 -p anthropic ``` ### Multi-Environment Benchmark ```bash theme={null} prime eval run configs/eval/benchmark.toml ``` **configs/eval/benchmark.toml:** ```toml theme={null} model = "openai/gpt-4.1-mini" num_examples = 100 [[eval]] env_id = "gsm8k" [[eval]] env_id = "math-python" rollouts_per_example = 5 [[eval]] env_id = "alphabet-sort" num_examples = 50 ``` ### High Concurrency ```bash theme={null} prime eval run gsm8k \ -n 1000 -r 10 \ -c 128 \ --max-concurrent-generation 64 \ --max-concurrent-scoring 64 ``` ### Debug Mode ```bash theme={null} prime eval run gsm8k -n 5 --debug --verbose ``` ## Configuration Files ### Endpoint Registry Define model endpoints in `configs/endpoints.toml`: ```toml theme={null} [[endpoint]] endpoint_id = "gpt-4.1-mini" model = "gpt-4.1-mini" url = "https://api.openai.com/v1" key = "OPENAI_API_KEY" [[endpoint]] endpoint_id = "qwen3-235b-i" model = "qwen/qwen3-235b-a22b-instruct-2507" url = "https://api.pinference.ai/api/v1" key = "PRIME_API_KEY" # Multiple replicas for load balancing [[endpoint]] endpoint_id = "my-model" model = "my-model" url = "https://api1.example.com/v1" key = "API_KEY" [[endpoint]] endpoint_id = "my-model" model = "my-model" url = "https://api2.example.com/v1" key = "API_KEY" ``` Then reference by endpoint ID: ```bash theme={null} prime eval run gsm8k -m qwen3-235b-i ``` ### Multi-Environment Config ```toml theme={null} # Global defaults model = "openai/gpt-4.1-mini" num_examples = 50 rollouts_per_example = 3 [[eval]] env_id = "gsm8k" num_examples = 100 # overrides global [eval.env_args] difficulty = "hard" [[eval]] env_id = "math-python" # uses global defaults [[eval]] env_id = "alphabet-sort" endpoint_id = "qwen3-235b-i" num_examples = 25 ``` ## Results Output With `--save-results`, outputs are saved to: ``` ./outputs/evals/{env_id}--{model}/{run_id}/ ├── results.jsonl # One rollout per line └── metadata.json # Config and aggregate metrics ``` Or per-environment: ``` ./environments/{env_id}/outputs/evals/{env_id}--{model}/{run_id}/ ``` ### results.jsonl Format Each line contains one rollout: ```json theme={null} { "prompt": [{"role": "user", "content": "What is 2+2?"}], "completion": [{"role": "assistant", "content": "4"}], "reward": 1.0, "metrics": {"accuracy": 1.0}, "task": {"question": "What is 2+2?", "answer": "4"}, "info": {}, "timing": {"generation_ms": 123, "scoring_ms": 45} } ``` ### metadata.json Format ```json theme={null} { "env_id": "gsm8k", "model": "openai/gpt-4.1-mini", "num_examples": 10, "rollouts_per_example": 3, "avg_reward": 0.867, "avg_metrics": {"accuracy": 0.867}, "time_ms": 12345, "sampling_args": {"max_tokens": 512, "temperature": 0.7}, "env_args": {}, "date": "2026-03-03", "time": "14:23:45" } ``` ## Configuration Precedence ### CLI Mode 1. CLI flags 2. Environment defaults (from `pyproject.toml`) 3. Built-in defaults ### TOML Config Mode 1. Per-eval settings (`[[eval]]` sections) 2. Global settings (top of config file) 3. Environment defaults (from `pyproject.toml`) 4. Built-in defaults ## Environment Defaults Environments can specify defaults in `pyproject.toml`: ```toml theme={null} [tool.verifiers.eval] num_examples = 100 rollouts_per_example = 5 ``` These are used when higher-priority sources don't specify a value. ## Resuming Evaluations Long evaluations can be resumed: ```bash theme={null} # Start with checkpointing prime eval run gsm8k -n 1000 -s # Interrupt with Ctrl+C # Resume automatically (finds latest incomplete run) prime eval run gsm8k -n 1000 -s --resume ``` Resume requirements: * Same `env_id`, `model`, and `rollouts_per_example` * `num_examples` must be >= original target * Results directory must contain valid `results.jsonl` and `metadata.json` ## Troubleshooting ### Environment Not Found ``` Missing environments. Install them first: prime env install gsm8k ``` **Solution:** ```bash theme={null} prime env install gsm8k ``` ### API Key Not Set ``` Error: Environment variable 'PRIME_API_KEY' not set ``` **Solution:** ```bash theme={null} export PRIME_API_KEY="your-api-key" ``` ### Rate Limit Errors Reduce concurrency: ```bash theme={null} prime eval run gsm8k -n 100 -c 8 ``` Or add retries: ```bash theme={null} prime eval run gsm8k -n 100 --max-retries 3 ``` ### Out of Memory For large evaluations, enable checkpointing and reduce batch size: ```bash theme={null} prime eval run gsm8k -n 1000 -s -c 16 ```