> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/primeintellect-ai/verifiers/llms.txt
> Use this file to discover all available pages before exploring further.

# Datasets

> How datasets work with HuggingFace integration and lazy loading

## Overview

Verifiers uses the HuggingFace `datasets` library for loading and manipulating training and evaluation data. Each dataset row becomes a **rollout** during generation or evaluation.

```python theme={null}
from datasets import Dataset
import verifiers as vf

dataset = Dataset.from_list([
    {"prompt": [{"role": "user", "content": "What is 2+2?"}], "answer": "4"},
    {"prompt": [{"role": "user", "content": "What is 3*5?"}], "answer": "15"},
])

env = vf.SingleTurnEnv(dataset=dataset, rubric=rubric)
```

## Dataset Schema

### Required Columns

Datasets are automatically processed by the environment to include:

* `example_id` - Integer ID for grouping rollouts (auto-generated if missing)
* `prompt` - Messages to send to the model (list of chat messages)
* `task` - Task identifier for routing in `EnvGroup` (defaults to `env_id`)

### Optional Columns

* `answer` - Ground truth for scoring (string)
* `info` - Structured metadata (dict or JSON string)

```python theme={null}
dataset = Dataset.from_list([
    {
        "prompt": [{"role": "user", "content": "Solve: 2x + 3 = 7"}],
        "answer": "x = 2",
        "info": '{"difficulty": 3, "topic": "algebra"}'
    },
])
```

<Note>
  When using `info`, prefer JSON strings if rows have different schemas (different fields or nested structures). The environment automatically parses JSON strings into dicts during rollout initialization.
</Note>

## Building Prompts

### Direct Prompt Construction

Provide a `prompt` column with pre-formatted chat messages:

```python theme={null}
dataset = Dataset.from_list([
    {
        "prompt": [
            {"role": "system", "content": "You are a math tutor."},
            {"role": "user", "content": "What is 2+2?"}
        ],
        "answer": "4"
    },
])
```

### Question-Based Construction

Use a `question` column and let the environment wrap it:

```python theme={null}
dataset = Dataset.from_list([
    {"question": "What is 2+2?", "answer": "4"},
    {"question": "What is 3*5?", "answer": "15"},
])

# Automatically converts to:
# {"prompt": [{"role": "user", "content": "What is 2+2?"}], "answer": "4"}
```

### System Prompts and Few-Shot Examples

Add system prompts and few-shot examples via environment parameters:

```python theme={null}
env = vf.SingleTurnEnv(
    dataset=dataset,
    system_prompt="You are a helpful math tutor.",
    few_shot=[
        {"role": "user", "content": "What is 1+1?"},
        {"role": "assistant", "content": "1+1 equals 2."}
    ],
    rubric=rubric
)
```

**Result:**

```python theme={null}
[
    {"role": "system", "content": "You are a helpful math tutor."},
    {"role": "user", "content": "What is 1+1?"},
    {"role": "assistant", "content": "1+1 equals 2."},
    {"role": "user", "content": "What is 2+2?"}
]
```

**Behavior:**

* If dataset has `prompt` column: `system_prompt` is prepended (if not already present), `few_shot` is ignored
* If dataset has `question` column: both `system_prompt` and `few_shot` are applied

## Dataset Builders (Lazy Loading)

For large datasets or when running multiple environment replicas, defer dataset loading using a `DatasetBuilder` - a callable that returns a `Dataset`:

```python theme={null}
from datasets import load_dataset
import verifiers as vf

def get_dataset_builder(split: str = "train", seed: int = 42) -> vf.DatasetBuilder:
    """Returns a builder that lazily loads the dataset."""
    def build() -> Dataset:
        ds = load_dataset("gsm8k", "main", split=split)
        ds = ds.shuffle(seed=seed)
        return ds
    return build

def load_environment():
    dataset_builder = get_dataset_builder(split="train")
    eval_builder = get_dataset_builder(split="test")
    
    return vf.SingleTurnEnv(
        dataset=dataset_builder,      # built on first access
        eval_dataset=eval_builder,    # built on first access
        rubric=rubric,
    )
```

**When to use builders:**

* Dataset loading is expensive (e.g., downloading from HuggingFace Hub)
* Multiple environment replicas don't all need to own the dataset
* You want to parameterize dataset creation without loading immediately

**Lazy loading behavior:**

```python theme={null}
# Dataset not loaded yet
env = vf.SingleTurnEnv(dataset=dataset_builder, rubric=rubric)

# Triggers loading
train_ds = env.get_dataset()

# Already loaded, returns cached dataset
train_ds_again = env.get_dataset()
```

<Info>
  When a raw `Dataset` is passed (not a builder), it's loaded eagerly during environment initialization for backwards compatibility.
</Info>

## Training vs Evaluation Datasets

Environments support separate datasets for training and evaluation:

```python theme={null}
train_ds = Dataset.from_list([...])
eval_ds = Dataset.from_list([...])

env = vf.SingleTurnEnv(
    dataset=train_ds,
    eval_dataset=eval_ds,
    rubric=rubric
)

# Uses train_ds
train_results = await env.generate(train_ds, client=client, model="gpt-4")

# Uses eval_ds
eval_results = await env.evaluate(client=client, model="gpt-4")
```

**Fallback behavior:**

* If `eval_dataset` is not provided, `evaluate()` falls back to `dataset`
* If neither is provided, environment initialization raises `ValueError`

## Dataset Access Methods

### get\_dataset()

Retrieve the training dataset with optional sampling:

```python theme={null}
# Get full dataset
train_ds = env.get_dataset()

# Get first 100 examples
train_ds = env.get_dataset(n=100)

# Shuffle and get first 50
train_ds = env.get_dataset(n=50, seed=42)
```

**Type signature:**

```python theme={null}
def get_dataset(self, n: int = -1, seed: int | None = None) -> Dataset:
    ...
```

### get\_eval\_dataset()

Retrieve the evaluation dataset:

```python theme={null}
# Get full eval dataset
eval_ds = env.get_eval_dataset()

# Get first 20 examples, shuffled
eval_ds = env.get_eval_dataset(n=20, seed=42)
```

**Type signature:**

```python theme={null}
def get_eval_dataset(self, n: int = -1, seed: int | None = None) -> Dataset:
    ...
```

## Dataset Formatting

The environment automatically formats datasets during initialization:

### Example ID Assignment

```python theme={null}
# Input dataset
dataset = Dataset.from_list([
    {"question": "What is 2+2?"},
    {"question": "What is 3*5?"},
])

# After formatting
# example_id column added: [0, 1]
```

### Prompt Construction

```python theme={null}
# Input with question column
dataset = Dataset.from_list([{"question": "What is 2+2?"}])

env = vf.SingleTurnEnv(
    dataset=dataset,
    system_prompt="You are helpful.",
    rubric=rubric
)

# Formatted dataset has prompt column:
# {
#     "prompt": [
#         {"role": "system", "content": "You are helpful."},
#         {"role": "user", "content": "What is 2+2?"}
#     ]
# }
```

### Task Assignment

```python theme={null}
env = vf.SingleTurnEnv(
    dataset=dataset,
    env_id="math-qa",
    rubric=rubric
)

# Each row gets task="math-qa"
```

## Loading Example Datasets

Verifiers includes built-in example datasets:

```python theme={null}
import verifiers as vf

# Load example dataset
dataset = vf.load_example_dataset("gsm8k-sample")

env = vf.SingleTurnEnv(dataset=dataset, rubric=rubric)
```

## Creating Datasets Programmatically

### From Lists

```python theme={null}
from datasets import Dataset

data = [
    {"question": "What is 2+2?", "answer": "4"},
    {"question": "What is 3*5?", "answer": "15"},
]

dataset = Dataset.from_list(data)
```

### From HuggingFace Hub

```python theme={null}
from datasets import load_dataset

dataset = load_dataset("gsm8k", "main", split="train")
```

### From Pandas DataFrame

```python theme={null}
import pandas as pd
from datasets import Dataset

df = pd.DataFrame({
    "question": ["What is 2+2?", "What is 3*5?"],
    "answer": ["4", "15"]
})

dataset = Dataset.from_pandas(df)
```

### Using make\_dataset()

Static helper for creating datasets from rollout inputs:

```python theme={null}
import verifiers as vf

inputs = [
    {"prompt": [{"role": "user", "content": "Hello"}], "answer": "Hi"},
    {"prompt": [{"role": "user", "content": "Goodbye"}], "answer": "Bye"},
]

dataset = vf.Environment.make_dataset(inputs)
```

## Dataset Transformations

Use the `datasets` library's `.map()` for preprocessing:

```python theme={null}
from datasets import load_dataset

dataset = load_dataset("gsm8k", "main", split="train")

# Extract answer from GSM8K format
def extract_answer(example):
    answer_text = example["answer"].split("####")[1].strip()
    return {"answer": answer_text}

dataset = dataset.map(extract_answer)
```

**Configure mapping parallelism:**

```python theme={null}
env = vf.SingleTurnEnv(
    dataset=dataset,
    map_kwargs={"num_proc": 4},  # use 4 processes for .map()
    rubric=rubric
)
```

## Info Column Patterns

### Simple Metadata

```python theme={null}
dataset = Dataset.from_list([
    {
        "question": "What is 2+2?",
        "answer": "4",
        "info": {"difficulty": 1, "category": "arithmetic"}
    },
])
```

### Heterogeneous Schemas (JSON Strings)

```python theme={null}
dataset = Dataset.from_list([
    {
        "question": "Solve: 2x + 3 = 7",
        "info": '{"type": "algebra", "variables": ["x"]}'
    },
    {
        "question": "What is the capital of France?",
        "info": '{"type": "geography", "country": "France"}'
    },
])
```

### Tool Definitions in Info

Store per-example tool definitions:

```python theme={null}
dataset = Dataset.from_list([
    {
        "question": "Calculate the square root of 16",
        "info": {
            "tool_defs": [
                {
                    "name": "sqrt",
                    "description": "Calculate square root",
                    "parameters": {"type": "object", "properties": {"x": {"type": "number"}}}
                }
            ]
        }
    },
])
```

The environment automatically extracts and normalizes `info["tool_defs"]` during state initialization.

## Dataset Persistence

### Saving to Disk

```python theme={null}
dataset.save_to_disk("./my_dataset")

# Load later
from datasets import load_from_disk
dataset = load_from_disk("./my_dataset")
```

### Pushing to HuggingFace Hub

```python theme={null}
dataset.push_to_hub("username/dataset-name")

# Load later
from datasets import load_dataset
dataset = load_dataset("username/dataset-name")
```

## Rollouts Per Example

Generate multiple rollouts per dataset row for best-of-N sampling or diversity:

```python theme={null}
# Dataset has 100 examples
dataset = Dataset.from_list([...])  # 100 rows

# Generate 4 rollouts per example = 400 total rollouts
results = await env.evaluate(
    client=client,
    model="gpt-4",
    num_examples=100,
    rollouts_per_example=4
)

# Results grouped by example_id for pass@k metrics
```

**Implementation:**

```python theme={null}
# Environment internally repeats the dataset
if rollouts_per_example > 1:
    inputs = dataset.repeat(rollouts_per_example)
```

## Example: Complete Dataset Pipeline

```python theme={null}
from datasets import load_dataset
import verifiers as vf

def get_dataset_builder():
    def build():
        # Load from HuggingFace
        ds = load_dataset("gsm8k", "main", split="train")
        
        # Extract answer
        def extract_answer(example):
            answer = example["answer"].split("####")[1].strip()
            return {"answer": answer}
        
        ds = ds.map(extract_answer)
        
        # Format question as prompt
        def format_question(example):
            return {
                "question": example["question"],
                "answer": example["answer"],
                "info": {"original_answer": example["answer"]}
            }
        
        ds = ds.map(format_question)
        
        return ds
    
    return build

def load_environment():
    dataset_builder = get_dataset_builder()
    
    async def correct_answer(completion, answer, parser) -> float:
        parsed = parser.parse_answer(completion)
        return 1.0 if parsed == answer else 0.0
    
    parser = vf.XMLParser(fields=["reasoning", "answer"])
    rubric = vf.Rubric(funcs=[correct_answer], parser=parser)
    
    return vf.SingleTurnEnv(
        dataset=dataset_builder,
        system_prompt="Solve the math problem step by step.",
        parser=parser,
        rubric=rubric,
        env_id="gsm8k"
    )
```

<Warning>
  When using `DatasetBuilder`, ensure the builder function is deterministic if you need reproducible dataset ordering across runs. Use fixed seeds for shuffling operations.
</Warning>
