Skip to content

LLM as a Judge

The judge() function is a multi-use tool, that can be used for:

  • Hallucination Detection
  • Schema Validation
  • Multi-Response Accuracy
  • Guardrails Enforcement
  • Response Regeneration

Using the judge() function

Fact Checking

For most quick fact-checking tasks, the judge function can be utilized without any verifier schema or instructions.

import zyx

fact = "The capital of France is India"

zyx.judge(
    fact,
    process = "fact_check",
    model = "gpt-4o-mini"
)
FactCheckResult(
    is_accurate=False,
    explanation='The statement that the capital of France is India is incorrect. The capital of France is Paris, not India. India is a country
in South Asia, and it has its own capital, which is New Delhi.',
    confidence=1.0
)

Accuracy Judgement

You can use the judge() function to compare multiple responses to a given prompt and determine which one is the most accurate, helpful, and relevant.

import zyx

prompt = "Explain the theory of relativity"
responses = [
    "The theory of relativity describes how space and time are interconnected.",
    "Einstein's theory states that E=mc^2, which relates energy and mass."
]

zyx.judge(
    prompt,
    responses=responses,
    process="accuracy"
)
JudgmentResult(
    explanation='Response 1 provides a broader understanding of the theory of relativity by mentioning the interconnection of space and time,
which is fundamental to both the special and general theories of relativity. Response 2 focuses on the famous equation E=mc^2, which is a key
result of the theory but does not provide a comprehensive explanation of the overall theory. Therefore, while both responses are accurate,
Response 1 is more helpful and relevant as it captures a core aspect of the theory.',
    verdict='Response 1 is the most accurate, helpful, and relevant.'
)

Schema Validation

The judge() function can also be used to validate responses against a predefined schema or set of criteria.

import zyx

prompt = "Describe the water cycle"
response = "Water evaporates, forms clouds, and then falls as rain."
schema = "The response should include: 1) Evaporation, 2) Condensation, 3) Precipitation, 4) Collection"

result = zyx.judge(
    prompt,
    responses=response,
    process="validate",
    schema=schema
)

print(result)
ValidationResult(
    is_valid=False,
    explanation='The response does not include all required components of the water cycle as outlined in the schema. Specifically, it mentions
evaporation and precipitation, but it fails to mention condensation and collection.'
)

Response Regeneration

An important functionality of this module is the ability to regenerate a correct response if the original one was determined to be inaccurate or incomplete. This can be useful for generating high-quality responses for a given prompt based on the schema or instructions provided.

import zyx

prompt = "Explain photosynthesis"
responses = [
    "Photosynthesis is how plants make food.",
    "Plants use sunlight to convert CO2 and water into glucose and oxygen."
]

regenerated_response = zyx.judge(
    prompt,
    responses=responses,
    process="accuracy",
    regenerate=True,
    verbose=True
)

print(regenerated_response)
[09/29/24 00:42:45] INFO     judge - judge - Judging responses for prompt: Explain photosynthesis
...
[09/29/24 00:42:47] WARNING  judge - judge - Response is not accurate. Regenerating response.
...
RegeneratedResponse(
    response='Photosynthesis is a biochemical process used by plants...'
)

Guardrails

The final big functionality of the judge() function is the ability to enforce guardrails on the responses generated. This can help ensure that the responses are accurate, relevant, and appropriate for the given prompt. If a response violates guardrails, it will always be regenerated.

import zyx

prompt = "Describe the benefits of exercise"
responses = ["Exercise helps you lose weight and build muscle."]
guardrails = [
    "Ensure the response mentions mental health benefits.",
    "Include at least three distinct benefits of exercise.",
    "Avoid focusing solely on physical appearance."
]

result = zyx.judge(
    prompt,
    responses=responses,
    process="accuracy",
    guardrails=guardrails,
    verbose=True
)

print(result)
[09/29/24 00:50:30] INFO     judge - judge - Judging responses for prompt: Describe the benefits of exercise
...
[09/29/24 00:50:33] WARNING  judge - judge - Response violates guardrails. Regenerating response.
...
RegeneratedResponse(
    response="Exercise offers a multitude of benefits that extend beyond..."
)

API Reference

Judge responses based on accuracy, validate against a schema, or fact-check a single response, with an option to regenerate an optimized response.

Example
>>> judge(
    prompt="Explain the concept of quantum entanglement.",
    responses=[
        "Quantum entanglement is a phenomenon where two particles become interconnected and their quantum states cannot be described independently.",
        "Quantum entanglement is when particles are really close to each other and move in the same way."
    ],
    process="accuracy",
    verbose=True
)

Accuracy Judgment:
Explanation: The first response is more accurate as it provides a clear definition of quantum entanglement.
Verdict: The first response is the most accurate.

Validation Result:
Is Valid: True
Explanation: The response adheres to the provided schema.

Fact-Check Result:
Is Accurate: True
Explanation: The response accurately reflects the fact that quantum entanglement occurs when two particles are separated by a large distance but still instantaneously affect each other's quantum states.
Confidence: 0.95

Regenerated Response:
Response: Quantum entanglement is a phenomenon where two particles become interconnected and their quantum states cannot be described independently.

Parameters:

Name Type Description Default
prompt str

The original prompt or question.

required
responses List[str]

List of responses to judge, validate, or fact-check.

None
process Literal['accuracy', 'validate', 'fact_check']

The type of verification to perform.

'accuracy'
schema Optional[Union[str, dict]]

Schema for validation or fact-checking (optional for fact_check).

None
regenerate bool

Whether to regenerate an optimized response.

False
model str

The model to use for judgment.

'gpt-4o-mini'
api_key Optional[str]

API key for the LLM service.

None
base_url Optional[str]

Base URL for the LLM service.

None
temperature float

Temperature for response generation.

0.7
mode InstructorMode

Mode for the instructor.

'markdown_json_mode'
max_retries int

Maximum number of retries for API calls.

3
organization Optional[str]

Organization for the LLM service.

None
client Optional[Literal['openai', 'litellm']]

Client to use for API calls.

None
verbose bool

Whether to log verbose output.

False
guardrails Optional[Union[str, List[str]]]

Guardrails for content moderation.

None

Returns:

Type Description
Union[JudgmentResult, ValidationResult, RegeneratedResponse, FactCheckResult]

Union[JudgmentResult, ValidationResult, RegeneratedResponse, FactCheckResult]: The result of the judgment, validation, fact-check, or regeneration.

Source code in zyx/resources/completions/agents/judge.py
def judge(
    prompt: str,
    responses: Optional[Union[List[str], str]] = None,
    process: Literal["accuracy", "validate", "fact_check"] = "accuracy",
    schema: Optional[Union[str, dict]] = None,
    regenerate: bool = False,
    model: str = "gpt-4o-mini",
    api_key: Optional[str] = None,
    base_url: Optional[str] = None,
    temperature: float = 0.7,
    mode: InstructorMode = "markdown_json_mode",
    max_retries: int = 3,
    organization: Optional[str] = None,
    client: Optional[Literal["openai", "litellm"]] = None,
    verbose: bool = False,
    guardrails: Optional[Union[str, List[str]]] = None,
) -> Union[JudgmentResult, ValidationResult, RegeneratedResponse, FactCheckResult]:
    """
    Judge responses based on accuracy, validate against a schema, or fact-check a single response,
    with an option to regenerate an optimized response.

    Example:
        ```python
        >>> judge(
            prompt="Explain the concept of quantum entanglement.",
            responses=[
                "Quantum entanglement is a phenomenon where two particles become interconnected and their quantum states cannot be described independently.",
                "Quantum entanglement is when particles are really close to each other and move in the same way."
            ],
            process="accuracy",
            verbose=True
        )

        Accuracy Judgment:
        Explanation: The first response is more accurate as it provides a clear definition of quantum entanglement.
        Verdict: The first response is the most accurate.

        Validation Result:
        Is Valid: True
        Explanation: The response adheres to the provided schema.

        Fact-Check Result:
        Is Accurate: True
        Explanation: The response accurately reflects the fact that quantum entanglement occurs when two particles are separated by a large distance but still instantaneously affect each other's quantum states.
        Confidence: 0.95

        Regenerated Response:
        Response: Quantum entanglement is a phenomenon where two particles become interconnected and their quantum states cannot be described independently.
        ```

    Args:
        prompt (str): The original prompt or question.
        responses (List[str]): List of responses to judge, validate, or fact-check.
        process (Literal["accuracy", "validate", "fact_check"]): The type of verification to perform.
        schema (Optional[Union[str, dict]]): Schema for validation or fact-checking (optional for fact_check).
        regenerate (bool): Whether to regenerate an optimized response.
        model (str): The model to use for judgment.
        api_key (Optional[str]): API key for the LLM service.
        base_url (Optional[str]): Base URL for the LLM service.
        temperature (float): Temperature for response generation.
        mode (InstructorMode): Mode for the instructor.
        max_retries (int): Maximum number of retries for API calls.
        organization (Optional[str]): Organization for the LLM service.
        client (Optional[Literal["openai", "litellm"]]): Client to use for API calls.
        verbose (bool): Whether to log verbose output.
        guardrails (Optional[Union[str, List[str]]]): Guardrails for content moderation.

    Returns:
        Union[JudgmentResult, ValidationResult, RegeneratedResponse, FactCheckResult]: The result of the judgment, validation, fact-check, or regeneration.
    """
    if verbose:
        logger.info(f"Judging responses for prompt: {prompt}")
        logger.info(f"process: {process}")
        logger.info(f"Regenerate: {regenerate}")

    if isinstance(responses, str):
        responses = [responses]

    completion_client = Client(
        api_key=api_key,
        base_url=base_url,
        organization=organization,
        provider=client,
        verbose=verbose,
    )

    if process == "accuracy":
        system_message = (
            "You are an impartial judge evaluating responses to a given prompt. "
            "Compare the responses and determine which one is the most accurate, helpful, and relevant. "
            "Provide a brief explanation for your decision and then state your verdict."
        )
        user_message = f"Prompt: {prompt}\n\nResponses:\n"
        for idx, response in enumerate(responses, 1):
            user_message += f"{idx}. {response}\n\n"

        result = completion_client.completion(
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message},
            ],
            model=model,
            response_model=JudgmentResult,
            mode=mode,
            max_retries=max_retries,
            temperature=temperature,
        )

        if regenerate:
            if verbose:
                logger.warning(f"Response is not accurate. Regenerating response.")

            system_message = (
                "Based on the judgment provided, generate an optimized response "
                "that addresses the prompt more effectively than the original responses."
            )
            user_message = f"Original prompt: {prompt}\n\nJudgment: {result.explanation}\n\nGenerate an optimized response:"

            regenerated = completion_client.completion(
                messages=[
                    {"role": "system", "content": system_message},
                    {"role": "user", "content": user_message},
                ],
                model=model,
                response_model=RegeneratedResponse,
                mode=mode,
                max_retries=max_retries,
                temperature=temperature,
            )
            result = regenerated

    elif process == "validate":
        if not schema:
            raise ValueError("Schema is required for validation.")

        system_message = (
            "You are a validation expert. Your task is to determine if the given response "
            "matches the provided schema or instructions. Provide a detailed explanation "
            "of your validation process and state whether the response is valid or not."
        )
        user_message = f"Prompt: {prompt}\n\nResponse: {responses[0]}\n\nSchema/Instructions: {schema}"

        result = completion_client.completion(
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message},
            ],
            model=model,
            response_model=ValidationResult,
            mode=mode,
            max_retries=max_retries,
            temperature=temperature,
        )

        if regenerate and not result.is_valid:
            if verbose:
                logger.warning(f"Response is not valid. Regenerating response.")

            system_message = (
                "Based on the validation result, generate a new response that "
                "correctly adheres to the given schema or instructions."
            )
            user_message = f"Original prompt: {prompt}\n\nSchema/Instructions: {schema}\n\nGenerate a valid response:"

            regenerated = completion_client.completion(
                messages=[
                    {"role": "system", "content": system_message},
                    {"role": "user", "content": user_message},
                ],
                model=model,
                response_model=RegeneratedResponse,
                mode=mode,
                max_retries=max_retries,
                temperature=temperature,
            )
            result = regenerated

    elif process == "fact_check":
        if responses is None:
            responses = [prompt]  # Use the prompt as the response for fact-checking
        elif len(responses) != 1:
            raise ValueError("Fact-check requires exactly one response.")

        system_message = (
            "You are a fact-checking expert. Your task is to determine if the given response "
            "is accurate based on the prompt and your knowledge. Provide a detailed explanation "
            "of your fact-checking process, state whether the response is accurate or not, "
            "and provide a confidence score between 0.0 and 1.0."
        )
        user_message = f"Prompt: {prompt}\n\nResponse to fact-check: {responses[0]}"
        if schema:
            user_message += f"\n\nAdditional fact-checking guidelines: {schema}"

        result = completion_client.completion(
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message},
            ],
            model=model,
            response_model=FactCheckResult,
            mode=mode,
            max_retries=max_retries,
            temperature=temperature,
        )

        if regenerate and not result.is_accurate:
            if verbose:
                logger.warning(f"Response is not accurate. Regenerating response.")

            system_message = (
                "Based on the fact-check result, generate a new response that "
                "is accurate and addresses the original prompt correctly."
            )
            user_message = f"Original prompt: {prompt}\n\nFact-check result: {result.explanation}\n\nGenerate an accurate response:"

            regenerated = completion_client.completion(
                messages=[
                    {"role": "system", "content": system_message},
                    {"role": "user", "content": user_message},
                ],
                model=model,
                response_model=RegeneratedResponse,
                mode=mode,
                max_retries=max_retries,
                temperature=temperature,
            )
            result = regenerated

    else:
        raise ValueError(
            "Invalid process. Choose 'accuracy', 'validate', or 'fact_check'."
        )

    # Add guardrails check after the main process
    if guardrails:
        guardrails_result = check_guardrails(
            prompt,
            result,
            guardrails,
            completion_client,
            model,
            mode,
            max_retries,
            temperature,
            verbose,
        )
        if not guardrails_result.passed:
            if verbose:
                logger.warning(f"Response violates guardrails. Regenerating response.")
            result = regenerate_response(
                prompt,
                guardrails_result.explanation,
                completion_client,
                model,
                mode,
                max_retries,
                temperature,
            )

    return result