> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/primeintellect-ai/verifiers/llms.txt
> Use this file to discover all available pages before exploring further.

# OpenAIChatCompletionsTokenClient

> vLLM token-level client with TITO (token-in, token-out) optimization

The `OpenAIChatCompletionsTokenClient` class extends `OpenAIChatCompletionsClient` to use vLLM's custom `/v1/chat/completions/tokens` endpoint for token-level prompt stitching (TITO - token-in, token-out) instead of message-level inference (MITO - message-in, token-out).

## Overview

This client optimizes multi-turn conversations by reusing tokenized prompts from previous turns rather than re-tokenizing the entire conversation history on each turn. It:

* Detects message-level prefix matches in the conversation trajectory
* Reuses token IDs from previous turns when possible
* Handles chat template suffix tokens correctly
* Falls back to standard message-based inference for the first turn or when multimodal content is present
* Automatically manages token stitching across truncated turns

<Note>
  This client requires a vLLM server with the custom `/v1/chat/completions/tokens` and `/tokenize` endpoints. It is designed for inference optimization and will fall back to standard behavior when necessary.
</Note>

## Type Aliases

Inherits from `OpenAIChatCompletionsClient`:

```python theme={null}
OpenAIChatMessage = ChatCompletionMessageParam
OpenAIChatMessages = list[OpenAIChatMessage]
OpenAIChatResponse = ChatCompletion
OpenAITool = ChatCompletionToolParam
```

Additional response type:

```python theme={null}
class TokenizeResponse(BaseModel):
    count: int
    max_model_len: int
    tokens: list[int]
    token_strs: Optional[list[str]] = None
```

## Class Definition

```python theme={null}
class OpenAIChatCompletionsTokenClient(OpenAIChatCompletionsClient)
```

Inherits all generic type parameters from `OpenAIChatCompletionsClient`:

* **ClientT**: `AsyncOpenAI`
* **MessagesT**: `OpenAIChatMessages`
* **ResponseT**: `OpenAIChatResponse`
* **ToolT**: `OpenAITool`

## Constructor

```python theme={null}
OpenAIChatCompletionsTokenClient(client_or_config: AsyncOpenAI | ClientConfig)
```

<ParamField path="client_or_config" type="AsyncOpenAI | ClientConfig" required>
  Either a pre-configured `AsyncOpenAI` client or a `ClientConfig` to create one. The base URL should point to a vLLM server.
</ParamField>

### Example

```python theme={null}
from verifiers.clients.openai_chat_completions_token_client import OpenAIChatCompletionsTokenClient
from verifiers.types import ClientConfig

# Using ClientConfig with vLLM server
client = OpenAIChatCompletionsTokenClient(
    ClientConfig(
        api_key="EMPTY",  # vLLM typically doesn't require real API key
        base_url="http://localhost:8000/v1"
    )
)

# Using pre-configured AsyncOpenAI client
from openai import AsyncOpenAI
client = OpenAIChatCompletionsTokenClient(
    AsyncOpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
)
```

## Properties

### token\_client

```python theme={null}
@property
def token_client(self) -> AsyncOpenAI
```

Returns an `AsyncOpenAI` client with the base URL stripped of `/v1` suffix for accessing vLLM's `/tokenize` endpoint.

**Returns:** `AsyncOpenAI` instance configured for tokenization endpoints.

## Methods

### get\_native\_response

```python theme={null}
@handle_openai_overlong_prompt
async def get_native_response(
    self,
    prompt: OpenAIChatMessages,
    model: str,
    sampling_args: SamplingArgs,
    tools: list[OpenAITool] | None = None,
    **kwargs,
) -> OpenAIChatResponse
```

Calls the vLLM Chat Completions API, using token-level inference when possible.

<ParamField path="prompt" type="OpenAIChatMessages" required>
  List of OpenAI message parameters.
</ParamField>

<ParamField path="model" type="str" required>
  Model identifier hosted on vLLM.
</ParamField>

<ParamField path="sampling_args" type="SamplingArgs" required>
  Sampling parameters. `max_tokens` is automatically renamed to `max_completion_tokens`. `logprobs` is automatically set to `True` and `return_token_ids=True` is added to `extra_body`.
</ParamField>

<ParamField path="tools" type="list[OpenAITool] | None" default="None">
  Optional list of tools in OpenAI format.
</ParamField>

<ParamField path="kwargs" type="dict">
  Must include `state` (type `State`) for accessing trajectory and managing cached tokens.
</ParamField>

**Returns:** OpenAI `ChatCompletion` object.

**Behavior:**

* **First turn** (`len(state["trajectory"]) == 0`): Uses standard `/chat/completions` endpoint (MITO)
* **Multimodal content present**: Falls back to standard endpoint because vLLM's `/tokenize` doesn't run multimodal processor
* **Subsequent text-only turns**: Uses `/chat/completions/tokens` endpoint (TITO) with token stitching
* **No prefix match found**: Falls back to standard endpoint

### get\_prompt\_ids

```python theme={null}
async def get_prompt_ids(
    self,
    state: State,
    prompt_messages: OpenAIChatMessages,
    oai_tools: list[OpenAITool] | None,
) -> list[int] | None
```

Builds prompt token IDs by finding the longest message-level prefix match in the trajectory and stitching with new tokens.

<ParamField path="state" type="State" required>
  Current rollout state containing trajectory history.
</ParamField>

<ParamField path="prompt_messages" type="OpenAIChatMessages" required>
  Current prompt messages to convert to token IDs.
</ParamField>

<ParamField path="oai_tools" type="list[OpenAITool] | None" required>
  Tools in OpenAI format (affects tokenization).
</ParamField>

**Returns:** List of token IDs representing the full prompt, or `None` if no prefix match found.

**Algorithm:**

1. Scans trajectory backwards to find the step whose messages form the longest prefix of `prompt_messages`
2. Extracts token IDs from that step (prompt\_ids + completion\_ids)
3. Computes and appends chat template suffix tokens (e.g., EOM tokens)
4. Tokenizes the full prompt to derive environment response tokens
5. Returns `prev_turn_ids + suffix_ids + env_response_ids`

### tokenize

```python theme={null}
async def tokenize(
    self,
    messages: str | OpenAIChatMessages,
    tools: list[OpenAITool] | None,
    model: str,
    extra_kwargs: dict = {},
    **kwargs,
) -> list[int]
```

Tokenizes messages or text using the vLLM `/tokenize` API.

<ParamField path="messages" type="str | OpenAIChatMessages" required>
  Either a plain text string or a list of OpenAI message parameters.
</ParamField>

<ParamField path="tools" type="list[OpenAITool] | None" required>
  Optional tools (affects tokenization of messages).
</ParamField>

<ParamField path="model" type="str" required>
  Model identifier for tokenization.
</ParamField>

<ParamField path="extra_kwargs" type="dict" default="{}">
  Additional parameters for tokenization (e.g., `add_generation_prompt`).
</ParamField>

**Returns:** List of token IDs.

## Usage Example

```python theme={null}
import asyncio
import verifiers as vf
from verifiers.clients.openai_chat_completions_token_client import OpenAIChatCompletionsTokenClient
from verifiers.types import ClientConfig, SamplingArgs

async def main():
    # Initialize client pointing to vLLM server
    client = OpenAIChatCompletionsTokenClient(
        ClientConfig(
            api_key="EMPTY",
            base_url="http://localhost:8000/v1"
        )
    )
    
    # Create a simple environment
    def load_environment():
        dataset = vf.Environment.make_dataset([
            {"question": "What is 2+2?"}
        ])
        
        def correctness(completion: vf.Messages, **kwargs) -> float:
            text = vf.content_to_text(completion[-1].content)
            return 1.0 if "4" in text else 0.0
        
        return vf.SingleTurnEnv(
            dataset=dataset,
            rubric=vf.Rubric(correctness),
        )
    
    env = load_environment()
    
    # Run rollout - token client will automatically use TITO on subsequent turns
    state = await env.rollout(
        input={"question": "What is 2+2?"},
        client=client,
        model="meta-llama/Llama-3.1-8B-Instruct",
        sampling_args=SamplingArgs(temperature=0.0, max_tokens=100)
    )
    
    # Check token information
    if state["trajectory"]:
        first_turn = state["trajectory"][0]
        if first_turn["tokens"]:
            print(f"Prompt tokens: {len(first_turn['tokens']['prompt_ids'])}")
            print(f"Completion tokens: {len(first_turn['tokens']['completion_ids'])}")
    
    await client.close()

asyncio.run(main())
```

## Multi-Turn Example

```python theme={null}
import asyncio
import verifiers as vf
from verifiers.clients.openai_chat_completions_token_client import OpenAIChatCompletionsTokenClient
from verifiers.types import ClientConfig, SamplingArgs

async def main():
    client = OpenAIChatCompletionsTokenClient(
        ClientConfig(api_key="EMPTY", base_url="http://localhost:8000/v1")
    )
    
    # Multi-turn environment
    class CountingEnv(vf.MultiTurnEnv):
        def __init__(self):
            dataset = vf.Environment.make_dataset([{"start": 1}])
            super().__init__(
                dataset=dataset,
                rubric=vf.Rubric(lambda **kw: 1.0),
                max_turns=5
            )
        
        async def env_response(
            self, completion: vf.Messages, state: vf.State
        ) -> vf.Messages:
            turn = state["turn"]
            return [vf.UserMessage(content=f"What is {turn + 1} + 1?")]
        
        async def is_completed(self, state: vf.State) -> bool:
            return state["turn"] >= 5
    
    env = CountingEnv()
    
    state = await env.rollout(
        input={"start": 1},
        client=client,
        model="meta-llama/Llama-3.1-8B-Instruct",
        sampling_args=SamplingArgs(temperature=0.0, max_tokens=50)
    )
    
    # Turn 0: Uses MITO (first turn)
    # Turn 1-4: Uses TITO (reuses tokens from previous turns)
    
    print(f"Completed {len(state['trajectory'])} turns")
    for i, step in enumerate(state["trajectory"]):
        prompt_len = len(step["tokens"]["prompt_ids"]) if step["tokens"] else 0
        completion_len = len(step["tokens"]["completion_ids"]) if step["tokens"] else 0
        print(f"Turn {i}: prompt={prompt_len} tokens, completion={completion_len} tokens")
    
    await client.close()

asyncio.run(main())
```

## State Keys

The client manages these keys in `state`:

<ParamField path="_cached_suffix_ids" type="list[int]">
  Cached chat template suffix tokens computed once per rollout. Used to correctly handle message delimiter tokens across turns.
</ParamField>

## TITO vs MITO

### Message-In Token-Out (MITO)

Standard behavior:

* Sends full message history on each turn
* Server re-tokenizes everything
* Used for: first turn, multimodal content

### Token-In Token-Out (TITO)

Optimized behavior:

* Sends token IDs directly to skip re-tokenization
* Reuses cached tokens from previous turns
* Stitches new tokens for environment responses
* Used for: subsequent text-only turns

**Performance benefit:** TITO eliminates redundant tokenization overhead in multi-turn conversations, especially valuable for:

* Long conversation histories
* Models with complex chat templates
* High-throughput inference scenarios

## Multimodal Content Handling

The client automatically detects multimodal content (images, audio) and falls back to MITO:

```python theme={null}
def _has_multimodal_content(messages) -> bool:
    """Check if any message contains multimodal content."""
    for msg in messages:
        content = msg.get("content") if hasattr(msg, "get") else None
        if isinstance(content, list):
            for part in content:
                if hasattr(part, "get") and part.get("type") in (
                    "image_url",
                    "input_audio",
                ):
                    return True
    return False
```

**Reason:** vLLM ≤0.16's `/tokenize` endpoint doesn't run the multimodal processor, so image placeholders stay collapsed (1 token instead of N) and token-stitching produces broken prompts.

## Chat Template Suffix Tokens

The client handles chat template suffix tokens (e.g., EOM tokens, newlines) correctly:

1. Computes suffix tokens once using dummy messages
2. Caches them in `state["_cached_suffix_ids"]`
3. For each turn, finds the largest overlap between previous turn tokens and suffix tokens
4. Appends non-overlapping suffix tokens to handle truncated turns

This ensures that token stitching respects the chat template format even when turns are truncated mid-message.

## Fallback Conditions

The client falls back to standard MITO when:

1. **First turn**: `len(state["trajectory"]) == 0`
2. **Multimodal content**: Current or any previous turn contains images/audio
3. **No prefix match**: `get_prompt_ids()` returns `None`

## Error Handling

Inherits error handling from `OpenAIChatCompletionsClient`:

* Context length errors → `OverlongPromptError`
* Empty responses → `EmptyModelResponseError`
* Invalid responses → `InvalidModelResponseError`
* Authentication errors → Re-raised from provider

## See Also

* [OpenAIChatCompletionsClient](/api/openai-client) - Parent class with standard message-based inference
* [Client](/api/client) - Base client interface
* [State](/api/types/state) - State type with trajectory structure
* [Response](/api/types#response) - Response type
