> ## Documentation Index > Fetch the complete documentation index at: https://mintlify.com/primeintellect-ai/verifiers/llms.txt > Use this file to discover all available pages before exploring further. # JudgeRubric > LLM-as-judge scoring rubric for evaluating model responses ## Overview `JudgeRubric` extends `Rubric` to provide LLM-as-judge scoring. It uses a language model to evaluate whether responses are correct by comparing them against ground truth answers. ## Constructor ```python theme={null} JudgeRubric( parser: Parser | None = None, parallelize_scoring: bool = False, judge_client: AsyncOpenAI | None = None, judge_model: str = "gpt-4.1-nano", judge_sampling_args: dict[str, Any] | None = None, judge_prompt: str = DEFAULT_JUDGE_PROMPT, ) ``` Parser for extracting answers from completions. Defaults to `vf.Parser()`. Whether to parallelize judge API calls across multiple rollouts. OpenAI client for judge model calls. Defaults to `AsyncOpenAI()` with environment API key. Model identifier for the judge. Can be any OpenAI-compatible model. Additional sampling parameters for judge completions (e.g., `temperature`, `max_tokens`). Template for judge prompts. Must include `{question}`, `{answer}`, and `{response}` placeholders. ## Default Judge Prompt The default prompt template is: ```python theme={null} DEFAULT_JUDGE_PROMPT = """Given a ground truth answer \ and a response, determine if the response is correct. Question: ``` {question} ``` Ground truth answer: ``` {answer} ``` Response: ``` {response} ``` Respond either "yes" or "no" only.""" ``` ## Methods ### judge ```python theme={null} async def judge( self, prompt: Messages, completion: Messages, answer: str, state: State | None = None, ) -> str ``` Call the judge model to evaluate a response. Caches results in `state["judge_response"]` if state is provided. The input prompt (either string or list of message dicts). The model's completion to evaluate. Ground truth answer for comparison. Optional state dict for caching judge responses. **Returns**: `str` - The judge model's response (typically "yes" or "no"). Judge responses are cached by prompt to avoid redundant API calls for the same evaluation. ## Inherited Methods All methods from `Rubric` are available: * `add_reward_func(func, weight=1.0)` * `add_metric(func, weight=0.0)` * `score_rollout(state)` * `score_group(states)` See the [Rubric](/api/rubric) documentation for details. ## Class Objects The following objects are automatically available to reward functions: The `judge()` method, callable as `judge(prompt, completion, answer, state=None)`. The OpenAI client instance. The judge model identifier. The judge prompt template. Sampling arguments for judge calls. The parser instance. ## Example Usage ### Basic Judge Scoring ```python theme={null} import verifiers as vf from openai import AsyncOpenAI # Create judge rubric with custom model rubric = vf.JudgeRubric( judge_client=AsyncOpenAI(api_key="sk-..."), judge_model="gpt-4o-mini", judge_sampling_args={"temperature": 0.0} ) # Add custom reward function using the judge async def judge_correctness(prompt, completion, answer, judge, state, **kwargs): """Use judge to determine correctness.""" response = await judge(prompt, completion, answer, state) return 1.0 if "yes" in response.lower() else 0.0 rubric.add_reward_func(judge_correctness) # Score a state state = { "prompt": "What is the capital of France?", "completion": [{"role": "assistant", "content": "Paris"}], "answer": "Paris", "task": "qa", "timing": {"scoring_ms": 0, "total_ms": 0} } await rubric.score_rollout(state) print(f"Reward: {state['reward']}") # 1.0 if judge says "yes" ``` ### Custom Judge Prompt ```python theme={null} custom_prompt = """Evaluate if the response correctly answers the question. Question: {question} Expected: {answer} Got: {response} Reply with CORRECT or INCORRECT.""" rubric = vf.JudgeRubric( judge_model="gpt-4o", judge_prompt=custom_prompt, judge_sampling_args={ "temperature": 0.0, "max_tokens": 10 } ) ``` ### Using Judge in Reward Functions ```python theme={null} rubric = vf.JudgeRubric(judge_model="gpt-4o-mini") # Access judge as a class object async def strict_correctness(judge, prompt, completion, answer, state, **kwargs): """Strict yes/no scoring.""" result = await judge(prompt, completion, answer, state) return 1.0 if result.strip().lower() == "yes" else 0.0 async def partial_credit(judge, prompt, completion, answer, state, **kwargs): """Partial credit based on judge confidence.""" result = await judge(prompt, completion, answer, state) if "yes" in result.lower(): return 1.0 elif "partially" in result.lower(): return 0.5 return 0.0 rubric.add_reward_func(strict_correctness) rubric.add_metric(partial_credit, weight=0.0) # Track but don't use for reward ``` ### Error Handling ```python theme={null} from openai import RateLimitError, APITimeoutError rubric = vf.JudgeRubric( judge_model="gpt-4o", judge_sampling_args={ "timeout": 30.0, # 30 second timeout } ) try: await rubric.score_rollout(state) except RuntimeError as e: if "rate limit" in str(e).lower(): print("Reduce concurrency or wait before retrying") elif "timeout" in str(e).lower(): print("Increase timeout in judge_sampling_args") raise ``` ## Notes Judge API calls can be slow and expensive. Consider: * Using cheaper/faster models like `gpt-4.1-nano` for high-throughput evaluations * Caching judge responses by passing `state` parameter * Setting appropriate timeouts in `judge_sampling_args` The `max_tokens` parameter is automatically converted to `max_completion_tokens` for compatibility with OpenAI's chat API. ## See Also * [Rubric](/api/rubric) - Base rubric class * [MathRubric](/api/math-rubric) - Specialized math scoring * [Parser](/api/parser) - Answer extraction