> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/primeintellect-ai/verifiers/llms.txt
> Use this file to discover all available pages before exploring further.

# JudgeRubric

> LLM-as-judge scoring rubric for evaluating model responses

## Overview

`JudgeRubric` extends `Rubric` to provide LLM-as-judge scoring. It uses a language model to evaluate whether responses are correct by comparing them against ground truth answers.

## Constructor

```python theme={null}
JudgeRubric(
    parser: Parser | None = None,
    parallelize_scoring: bool = False,
    judge_client: AsyncOpenAI | None = None,
    judge_model: str = "gpt-4.1-nano",
    judge_sampling_args: dict[str, Any] | None = None,
    judge_prompt: str = DEFAULT_JUDGE_PROMPT,
)
```

<ParamField path="parser" type="Parser | None" default="None">
  Parser for extracting answers from completions. Defaults to `vf.Parser()`.
</ParamField>

<ParamField path="parallelize_scoring" type="bool" default="False">
  Whether to parallelize judge API calls across multiple rollouts.
</ParamField>

<ParamField path="judge_client" type="AsyncOpenAI | None" default="None">
  OpenAI client for judge model calls. Defaults to `AsyncOpenAI()` with environment API key.
</ParamField>

<ParamField path="judge_model" type="str" default="gpt-4.1-nano">
  Model identifier for the judge. Can be any OpenAI-compatible model.
</ParamField>

<ParamField path="judge_sampling_args" type="dict[str, Any] | None" default="None">
  Additional sampling parameters for judge completions (e.g., `temperature`, `max_tokens`).
</ParamField>

<ParamField path="judge_prompt" type="str" default="DEFAULT_JUDGE_PROMPT">
  Template for judge prompts. Must include `{question}`, `{answer}`, and `{response}` placeholders.
</ParamField>

## Default Judge Prompt

The default prompt template is:

```python theme={null}
DEFAULT_JUDGE_PROMPT = """Given a ground truth answer \
and a response, determine if the response is correct.

Question:
```

{question}

```

Ground truth answer:
```

{answer}

```

Response:
```

{response}

```

Respond either "yes" or "no" only."""
```

## Methods

### judge

```python theme={null}
async def judge(
    self,
    prompt: Messages,
    completion: Messages,
    answer: str,
    state: State | None = None,
) -> str
```

Call the judge model to evaluate a response. Caches results in `state["judge_response"]` if state is provided.

<ParamField path="prompt" type="Messages">
  The input prompt (either string or list of message dicts).
</ParamField>

<ParamField path="completion" type="Messages">
  The model's completion to evaluate.
</ParamField>

<ParamField path="answer" type="str">
  Ground truth answer for comparison.
</ParamField>

<ParamField path="state" type="State | None" default="None">
  Optional state dict for caching judge responses.
</ParamField>

**Returns**: `str` - The judge model's response (typically "yes" or "no").

<Note>
  Judge responses are cached by prompt to avoid redundant API calls for the same evaluation.
</Note>

## Inherited Methods

All methods from `Rubric` are available:

* `add_reward_func(func, weight=1.0)`
* `add_metric(func, weight=0.0)`
* `score_rollout(state)`
* `score_group(states)`

See the [Rubric](/api/rubric) documentation for details.

## Class Objects

The following objects are automatically available to reward functions:

<ParamField path="judge" type="callable">
  The `judge()` method, callable as `judge(prompt, completion, answer, state=None)`.
</ParamField>

<ParamField path="judge_client" type="AsyncOpenAI">
  The OpenAI client instance.
</ParamField>

<ParamField path="judge_model" type="str">
  The judge model identifier.
</ParamField>

<ParamField path="judge_prompt" type="str">
  The judge prompt template.
</ParamField>

<ParamField path="judge_sampling_args" type="dict">
  Sampling arguments for judge calls.
</ParamField>

<ParamField path="parser" type="Parser">
  The parser instance.
</ParamField>

## Example Usage

### Basic Judge Scoring

```python theme={null}
import verifiers as vf
from openai import AsyncOpenAI

# Create judge rubric with custom model
rubric = vf.JudgeRubric(
    judge_client=AsyncOpenAI(api_key="sk-..."),
    judge_model="gpt-4o-mini",
    judge_sampling_args={"temperature": 0.0}
)

# Add custom reward function using the judge
async def judge_correctness(prompt, completion, answer, judge, state, **kwargs):
    """Use judge to determine correctness."""
    response = await judge(prompt, completion, answer, state)
    return 1.0 if "yes" in response.lower() else 0.0

rubric.add_reward_func(judge_correctness)

# Score a state
state = {
    "prompt": "What is the capital of France?",
    "completion": [{"role": "assistant", "content": "Paris"}],
    "answer": "Paris",
    "task": "qa",
    "timing": {"scoring_ms": 0, "total_ms": 0}
}

await rubric.score_rollout(state)
print(f"Reward: {state['reward']}")  # 1.0 if judge says "yes"
```

### Custom Judge Prompt

```python theme={null}
custom_prompt = """Evaluate if the response correctly answers the question.

Question: {question}
Expected: {answer}
Got: {response}

Reply with CORRECT or INCORRECT."""

rubric = vf.JudgeRubric(
    judge_model="gpt-4o",
    judge_prompt=custom_prompt,
    judge_sampling_args={
        "temperature": 0.0,
        "max_tokens": 10
    }
)
```

### Using Judge in Reward Functions

```python theme={null}
rubric = vf.JudgeRubric(judge_model="gpt-4o-mini")

# Access judge as a class object
async def strict_correctness(judge, prompt, completion, answer, state, **kwargs):
    """Strict yes/no scoring."""
    result = await judge(prompt, completion, answer, state)
    return 1.0 if result.strip().lower() == "yes" else 0.0

async def partial_credit(judge, prompt, completion, answer, state, **kwargs):
    """Partial credit based on judge confidence."""
    result = await judge(prompt, completion, answer, state)
    if "yes" in result.lower():
        return 1.0
    elif "partially" in result.lower():
        return 0.5
    return 0.0

rubric.add_reward_func(strict_correctness)
rubric.add_metric(partial_credit, weight=0.0)  # Track but don't use for reward
```

### Error Handling

```python theme={null}
from openai import RateLimitError, APITimeoutError

rubric = vf.JudgeRubric(
    judge_model="gpt-4o",
    judge_sampling_args={
        "timeout": 30.0,  # 30 second timeout
    }
)

try:
    await rubric.score_rollout(state)
except RuntimeError as e:
    if "rate limit" in str(e).lower():
        print("Reduce concurrency or wait before retrying")
    elif "timeout" in str(e).lower():
        print("Increase timeout in judge_sampling_args")
    raise
```

## Notes

<Warning>
  Judge API calls can be slow and expensive. Consider:

  * Using cheaper/faster models like `gpt-4.1-nano` for high-throughput evaluations
  * Caching judge responses by passing `state` parameter
  * Setting appropriate timeouts in `judge_sampling_args`
</Warning>

<Info>
  The `max_tokens` parameter is automatically converted to `max_completion_tokens` for compatibility with OpenAI's chat API.
</Info>

## See Also

* [Rubric](/api/rubric) - Base rubric class
* [MathRubric](/api/math-rubric) - Specialized math scoring
* [Parser](/api/parser) - Answer extraction
