> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/primeintellect-ai/verifiers/llms.txt
> Use this file to discover all available pages before exploring further.

# Rubrics

> Reward functions and scoring system for RL training

## Overview

Rubrics manage the scoring logic for rollouts, combining multiple reward functions into a final reward signal. Each rubric holds reward functions, computes weighted combinations, and tracks metrics for observability.

```python theme={null}
import verifiers as vf

async def correct_answer(completion, answer) -> float:
    response = completion[-1]["content"]
    return 1.0 if answer in response else 0.0

rubric = vf.Rubric(funcs=[correct_answer])
```

## Basic Reward Functions

Reward functions evaluate rollouts and return floats (typically 0.0 to 1.0). They request data by naming arguments:

```python theme={null}
async def exact_match(completion, answer) -> float:
    """Check if answer appears in completion."""
    response = completion[-1]["content"]
    return 1.0 if answer in response else 0.0

async def length_penalty(completion) -> float:
    """Penalize overly long responses."""
    response = completion[-1]["content"]
    return 1.0 if len(response) < 500 else 0.5

rubric = vf.Rubric(
    funcs=[exact_match, length_penalty],
    weights=[1.0, 0.1]  # exact_match weighted 10x more
)
```

### Available Arguments

Reward functions can request these standard arguments:

| Argument     | Type          | Description               |
| ------------ | ------------- | ------------------------- |
| `completion` | `Messages`    | Model's output messages   |
| `prompt`     | `Messages`    | Input messages            |
| `answer`     | `str`         | Ground truth from dataset |
| `info`       | `Info` (dict) | Metadata from dataset     |
| `state`      | `State`       | Full rollout state        |
| `task`       | `str`         | Task identifier           |

**Type signatures:**

```python theme={null}
from verifiers.types import Messages, Info, State

Messages = list[Message]  # list of chat messages
Info = dict[str, Any]
State = dict  # with additional input forwarding
```

### Argument Injection Pattern

The rubric uses introspection to inject only requested arguments:

```python theme={null}
# Only receives what it asks for
async def simple_reward(completion, answer) -> float:
    return 1.0 if answer in completion[-1]["content"] else 0.0

# Can request all available data
async def complex_reward(prompt, completion, answer, info, state) -> float:
    difficulty = info.get("difficulty", 1)
    tokens_used = state.get("usage", {}).get("total_tokens", 0)
    correct = answer in completion[-1]["content"]
    return float(correct) * (1.0 / difficulty) * (1.0 if tokens_used < 1000 else 0.5)

# Use **kwargs to accept everything
async def flexible_reward(**kwargs) -> float:
    completion = kwargs["completion"]
    answer = kwargs.get("answer", "")
    return 1.0 if answer in completion[-1]["content"] else 0.0
```

## Multiple Reward Functions

Combine reward functions with custom weights:

```python theme={null}
async def correctness(completion, answer) -> float:
    return 1.0 if answer in completion[-1]["content"] else 0.0

async def formatting(completion, parser) -> float:
    try:
        parser.parse(completion[-1]["content"])
        return 1.0
    except:
        return 0.0

async def conciseness(completion) -> float:
    length = len(completion[-1]["content"])
    return 1.0 if length < 200 else 0.5

rubric = vf.Rubric(
    funcs=[correctness, formatting, conciseness],
    weights=[1.0, 0.5, 0.1]
)
```

**Final reward computation:**

```python theme={null}
reward = (correctness * 1.0) + (formatting * 0.5) + (conciseness * 0.1)
```

### Adding Functions Dynamically

```python theme={null}
rubric = vf.Rubric()
rubric.add_reward_func(correctness, weight=1.0)
rubric.add_reward_func(formatting, weight=0.5)
```

## Execution Order and State Sharing

Reward functions execute sequentially in the order they're added. Since `state` is mutable, earlier functions can store computed values for later functions:

```python theme={null}
async def compute_similarity(completion, answer, state) -> float:
    """Compute and cache similarity score."""
    response = completion[-1]["content"]
    score = compute_embedding_similarity(response, answer)  # expensive
    state["similarity"] = score  # cache for other functions
    return score

async def similarity_threshold(state) -> float:
    """Use cached similarity without recomputing."""
    return 1.0 if state["similarity"] > 0.8 else 0.0

rubric = vf.Rubric(
    funcs=[compute_similarity, similarity_threshold],
    weights=[0.0, 1.0]  # log similarity (weight=0), reward threshold (weight=1)
)
```

**Execution flow:**

1. `compute_similarity` runs, stores `state["similarity"]`
2. `similarity_threshold` runs, reads cached value
3. Final reward = `0.0 * similarity + 1.0 * threshold`

## Group-Based Reward Functions

During evaluation and RL training, rollouts are organized into **groups** by `example_id`. Group reward functions operate on all rollouts for an example together:

```python theme={null}
async def diversity_bonus(completions) -> list[float]:
    """Reward unique responses within a group."""
    responses = [c[-1]["content"] for c in completions]
    unique = set(responses)
    return [0.2 if responses.count(r) == 1 else 0.0 for r in responses]

async def individual_correctness(completion, answer) -> float:
    return 1.0 if answer in completion[-1]["content"] else 0.0

rubric = vf.Rubric(
    funcs=[individual_correctness, diversity_bonus],
    weights=[1.0, 0.5]
)
```

### Detection

Group functions are detected by:

1. **Plural argument names**: `completions`, `prompts`, `answers`, `states`, `tasks`, `infos`
2. **Return type**: `list[float]` instead of `float`

### Available Group Arguments

| Argument      | Type             | Description              |
| ------------- | ---------------- | ------------------------ |
| `completions` | `list[Messages]` | All completions in group |
| `prompts`     | `list[Messages]` | All prompts in group     |
| `answers`     | `list[str]`      | All answers in group     |
| `states`      | `list[State]`    | All states in group      |
| `tasks`       | `list[str]`      | All task IDs in group    |
| `infos`       | `list[Info]`     | All info dicts in group  |

### Example: Relative Ranking

```python theme={null}
async def rank_by_length(completions) -> list[float]:
    """Reward shorter completions more within a group."""
    lengths = [len(c[-1]["content"]) for c in completions]
    max_len = max(lengths) if lengths else 1
    return [1.0 - (length / max_len) for length in lengths]
```

## Shared Objects

Rubrics can provide shared objects accessible to all reward functions via `class_objects`:

```python theme={null}
rubric = vf.Rubric(funcs=[my_reward_func])
rubric.add_class_object("my_helper", some_helper_object)

async def my_reward_func(completion, my_helper) -> float:
    # my_helper is injected by name
    return await my_helper.score(completion)
```

### Parsers

Parsers extract structured content from model responses:

```python theme={null}
parser = vf.XMLParser(fields=["reasoning", "answer"])
rubric = vf.Rubric(funcs=[my_reward_func], parser=parser)

async def my_reward_func(completion, answer, parser) -> float:
    parsed = parser.parse(completion[-1]["content"])
    return 1.0 if parsed.answer == answer else 0.0
```

**Built-in parsers:**

* `vf.Parser()` - Pass-through (no parsing)
* `vf.XMLParser(fields=[...])` - Extract XML tags
* `vf.ThinkParser()` - Extract content after `</think>`
* `vf.MaybeThinkParser()` - Handle optional `<think>` tags

### Judges (LLM-as-Judge)

`JudgeRubric` integrates LLM-based evaluation:

```python theme={null}
judge_rubric = vf.JudgeRubric(
    judge_model="gpt-4.1-mini",
    judge_prompt="""Is this answer correct?
Question: {question}
Answer: {response}
Ground Truth: {answer}

Respond with YES or NO."""
)

async def judge_correctness(prompt, completion, answer, judge) -> float:
    question = prompt[0]["content"]
    response = completion[-1]["content"]
    verdict = await judge(prompt, completion, answer)
    return 1.0 if "yes" in verdict.lower() else 0.0

judge_rubric.add_reward_func(judge_correctness)
```

**Built-in judge callable:**

```python theme={null}
# Signature
async def judge(
    prompt: Messages,
    completion: Messages,
    answer: str
) -> str:
    ...
```

**Exposed objects:**

* `judge` - Callable that formats prompt and calls judge model
* `judge_client` - Raw `AsyncOpenAI` client
* `judge_model` - Model name string
* `judge_prompt` - Template string
* `judge_sampling_args` - Sampling parameters dict

### Custom Shared Objects

Add domain-specific helpers:

```python theme={null}
class MathVerifier:
    async def verify(self, expression: str, expected: str) -> bool:
        # Symbolic math verification
        ...

verifier = MathVerifier()
rubric = vf.Rubric(funcs=[verify_answer])
rubric.add_class_object("verifier", verifier)

async def verify_answer(completion, answer, verifier) -> float:
    response = completion[-1]["content"]
    is_correct = await verifier.verify(response, answer)
    return 1.0 if is_correct else 0.0
```

## Rubric Groups

Combine multiple rubrics for heterogeneous scoring:

```python theme={null}
math_rubric = vf.MathRubric()  # symbolic math verification
judge_rubric = vf.JudgeRubric(judge_model="gpt-4.1-mini")
judge_rubric.add_reward_func(judge_correctness, weight=0.5)

combined = vf.RubricGroup(rubrics=[math_rubric, judge_rubric])
```

**Behavior:**

* All rubrics execute in parallel
* Final reward = sum of all rubric rewards
* Metrics from all rubrics are collected together

**Use cases:**

* Combining deterministic and LLM-based evaluation
* Multi-faceted scoring (correctness + style + efficiency)
* Environment-specific monitors + task-specific rewards

## Metrics and Monitor Rubrics

### Adding Metrics

Metrics are reward functions with `weight=0.0` (tracked but don't affect reward):

```python theme={null}
async def response_length(completion) -> float:
    return float(len(completion[-1]["content"]))

async def token_count(state) -> float:
    return float(state.get("usage", {}).get("total_tokens", 0))

rubric = vf.Rubric(funcs=[correctness])
rubric.add_metric(response_length)  # shorthand for weight=0
rubric.add_metric(token_count)
```

### Monitor Rubrics

Environments automatically include monitor rubrics for tracking metrics:

| Environment    | Tracked Metrics                                             |
| -------------- | ----------------------------------------------------------- |
| `MultiTurnEnv` | `num_turns`                                                 |
| `ToolEnv`      | `total_tool_calls`, per-tool counts                         |
| `SandboxEnv`   | `sandbox_ready_wait_time`, `sandbox_command_execution_time` |
| `PythonEnv`    | `python_ready_wait_time`                                    |

**Example monitor rubric:**

```python theme={null}
class MultiTurnMonitorRubric(vf.Rubric):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.add_metric(self.num_turns)
    
    async def num_turns(self, state: vf.State) -> int:
        return len(state["trajectory"])
```

### Custom Monitor Rubrics

Add environment-specific metrics:

```python theme={null}
class MyMonitorRubric(vf.Rubric):
    def __init__(self):
        super().__init__()
        self.add_metric(self.trajectory_length)
        self.add_metric(self.error_count)
    
    async def trajectory_length(self, state: vf.State) -> float:
        return float(len(state["trajectory"]))
    
    async def error_count(self, state: vf.State) -> float:
        errors = sum(1 for step in state["trajectory"] if step.get("error"))
        return float(errors)

env = vf.ToolEnv(dataset=dataset, tools=tools, rubric=rubric)
env.add_rubric(MyMonitorRubric())
```

## Built-in Rubrics

### MathRubric

Symbolic math verification using `math-verify`:

```python theme={null}
math_rubric = vf.MathRubric()

# Automatically includes:
# - Parser for \boxed{} answers
# - Symbolic equivalence checking
# - Normalization of mathematical expressions
```

**Usage:**

```python theme={null}
env = vf.SingleTurnEnv(
    dataset=math_dataset,
    rubric=math_rubric
)
```

### JudgeRubric

LLM-as-judge evaluation:

```python theme={null}
judge_rubric = vf.JudgeRubric(
    judge_model="gpt-4.1-mini",
    judge_prompt="""Evaluate if the response correctly answers the question.

Question: {question}
Response: {response}
Ground Truth: {answer}

Answer YES or NO.""",
    judge_sampling_args={"temperature": 0.0}
)

async def judge_score(prompt, completion, answer, judge) -> float:
    verdict = await judge(prompt, completion, answer)
    return 1.0 if "yes" in verdict.lower() else 0.0

judge_rubric.add_reward_func(judge_score)
```

## Scoring Lifecycle

### Individual Scoring

For rollouts scored independently:

```python theme={null}
# Called automatically after each rollout
state = await env.rollout(input, client, model, sampling_args)
await rubric.score_rollout(state)

# Sets state["reward"], state["metrics"]
```

### Group Scoring

For rollouts scored together (default for `evaluate()` and training):

```python theme={null}
# Generate group of rollouts
states = await asyncio.gather(*[
    env.rollout(input, client, model, sampling_args)
    for input in group_inputs
])

# Score group together
await rubric.score_group(states)

# Sets state["reward"], state["advantage"], state["metrics"] for all states
```

**Advantage computation:**

```python theme={null}
avg_reward = sum(state["reward"] for state in states) / len(states)
for state in states:
    state["advantage"] = state["reward"] - avg_reward
```

### Disabling Scoring

For pure generation without scoring:

```python theme={null}
env = vf.SingleTurnEnv(dataset=dataset, rubric=rubric, score_rollouts=False)

# Or dynamically:
env.set_score_rollouts(False)
```

## RolloutScore Type

Rubrics produce `RolloutScore` objects:

```python theme={null}
from verifiers.types import RolloutScore

class RolloutScore(TypedDict):
    reward: float
    metrics: dict[str, float]

# Example:
result = RolloutScore(
    reward=0.85,
    metrics={
        "correctness": 1.0,
        "formatting": 0.8,
        "conciseness": 0.5,
        "response_length": 342.0
    }
)
```

## Complete Example

```python theme={null}
import verifiers as vf
from datasets import Dataset

# Dataset
dataset = Dataset.from_list([
    {
        "question": "What is 2+2?",
        "answer": "4",
        "info": {"difficulty": 1, "category": "arithmetic"}
    },
    {
        "question": "What is the derivative of x^2?",
        "answer": "2x",
        "info": {"difficulty": 3, "category": "calculus"}
    },
])

# Reward functions
async def correctness(completion, answer, parser) -> float:
    parsed = parser.parse_answer(completion)
    return 1.0 if parsed == answer else 0.0

async def difficulty_bonus(state, info) -> float:
    """Bonus for harder problems."""
    if state["reward"] == 1.0:  # only if correct
        return info.get("difficulty", 1) * 0.1
    return 0.0

async def response_length(completion) -> float:
    return float(len(completion[-1]["content"]))

# Parser
parser = vf.XMLParser(fields=["reasoning", "answer"])

# Rubric
rubric = vf.Rubric(
    funcs=[correctness, difficulty_bonus],
    weights=[1.0, 1.0],
    parser=parser
)
rubric.add_metric(response_length)

# Environment
env = vf.SingleTurnEnv(
    dataset=dataset,
    parser=parser,
    rubric=rubric,
    system_prompt="Answer step by step in XML format."
)
```

<Note>
  When combining multiple reward functions, ensure weights are tuned to avoid any single function dominating the reward signal. Common practice is to normalize weights or use coefficients \< 1.0 for auxiliary rewards.
</Note>
