LLM as a Judge
The judge()
function is a multi-use tool, that can be used for:
- Hallucination Detection
- Schema Validation
- Multi-Response Accuracy
- Guardrails Enforcement
- Response Regeneration
Using the judge()
function
Fact Checking
For most quick fact-checking tasks, the judge function can be utilized without any verifier schema or instructions.
import zyx
fact = "The capital of France is India"
zyx.judge(
fact,
process = "fact_check",
model = "gpt-4o-mini"
)
FactCheckResult(
is_accurate=False,
explanation='The statement that the capital of France is India is incorrect. The capital of France is Paris, not India. India is a country
in South Asia, and it has its own capital, which is New Delhi.',
confidence=1.0
)
Accuracy Judgement
You can use the judge()
function to compare multiple responses to a given prompt and determine which one is the most accurate, helpful, and relevant.
import zyx
prompt = "Explain the theory of relativity"
responses = [
"The theory of relativity describes how space and time are interconnected.",
"Einstein's theory states that E=mc^2, which relates energy and mass."
]
zyx.judge(
prompt,
responses=responses,
process="accuracy"
)
JudgmentResult(
explanation='Response 1 provides a broader understanding of the theory of relativity by mentioning the interconnection of space and time,
which is fundamental to both the special and general theories of relativity. Response 2 focuses on the famous equation E=mc^2, which is a key
result of the theory but does not provide a comprehensive explanation of the overall theory. Therefore, while both responses are accurate,
Response 1 is more helpful and relevant as it captures a core aspect of the theory.',
verdict='Response 1 is the most accurate, helpful, and relevant.'
)
Schema Validation
The judge()
function can also be used to validate responses against a predefined schema or set of criteria.
import zyx
prompt = "Describe the water cycle"
response = "Water evaporates, forms clouds, and then falls as rain."
schema = "The response should include: 1) Evaporation, 2) Condensation, 3) Precipitation, 4) Collection"
result = zyx.judge(
prompt,
responses=response,
process="validate",
schema=schema
)
print(result)
ValidationResult(
is_valid=False,
explanation='The response does not include all required components of the water cycle as outlined in the schema. Specifically, it mentions
evaporation and precipitation, but it fails to mention condensation and collection.'
)
Response Regeneration
An important functionality of this module is the ability to regenerate a correct response if the original one was determined to be inaccurate or incomplete. This can be useful for generating high-quality responses for a given prompt based on the schema or instructions provided.
import zyx
prompt = "Explain photosynthesis"
responses = [
"Photosynthesis is how plants make food.",
"Plants use sunlight to convert CO2 and water into glucose and oxygen."
]
regenerated_response = zyx.judge(
prompt,
responses=responses,
process="accuracy",
regenerate=True,
verbose=True
)
print(regenerated_response)
[09/29/24 00:42:45] INFO judge - judge - Judging responses for prompt: Explain photosynthesis
...
[09/29/24 00:42:47] WARNING judge - judge - Response is not accurate. Regenerating response.
...
RegeneratedResponse(
response='Photosynthesis is a biochemical process used by plants...'
)
Guardrails
The final big functionality of the judge()
function is the ability to enforce guardrails on the responses generated. This can help ensure that the responses are accurate, relevant, and appropriate for the given prompt.
If a response violates guardrails, it will always be regenerated.
import zyx
prompt = "Describe the benefits of exercise"
responses = ["Exercise helps you lose weight and build muscle."]
guardrails = [
"Ensure the response mentions mental health benefits.",
"Include at least three distinct benefits of exercise.",
"Avoid focusing solely on physical appearance."
]
result = zyx.judge(
prompt,
responses=responses,
process="accuracy",
guardrails=guardrails,
verbose=True
)
print(result)
[09/29/24 00:50:30] INFO judge - judge - Judging responses for prompt: Describe the benefits of exercise
...
[09/29/24 00:50:33] WARNING judge - judge - Response violates guardrails. Regenerating response.
...
RegeneratedResponse(
response="Exercise offers a multitude of benefits that extend beyond..."
)
API Reference
Judge responses based on accuracy, validate against a schema, or fact-check a single response, with an option to regenerate an optimized response.
Example
>>> judge(
prompt="Explain the concept of quantum entanglement.",
responses=[
"Quantum entanglement is a phenomenon where two particles become interconnected and their quantum states cannot be described independently.",
"Quantum entanglement is when particles are really close to each other and move in the same way."
],
process="accuracy",
verbose=True
)
Accuracy Judgment:
Explanation: The first response is more accurate as it provides a clear definition of quantum entanglement.
Verdict: The first response is the most accurate.
Validation Result:
Is Valid: True
Explanation: The response adheres to the provided schema.
Fact-Check Result:
Is Accurate: True
Explanation: The response accurately reflects the fact that quantum entanglement occurs when two particles are separated by a large distance but still instantaneously affect each other's quantum states.
Confidence: 0.95
Regenerated Response:
Response: Quantum entanglement is a phenomenon where two particles become interconnected and their quantum states cannot be described independently.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt
|
str
|
The original prompt or question. |
required |
responses
|
List[str]
|
List of responses to judge, validate, or fact-check. |
None
|
process
|
Literal['accuracy', 'validate', 'fact_check']
|
The type of verification to perform. |
'accuracy'
|
schema
|
Optional[Union[str, dict]]
|
Schema for validation or fact-checking (optional for fact_check). |
None
|
regenerate
|
bool
|
Whether to regenerate an optimized response. |
False
|
model
|
str
|
The model to use for judgment. |
'gpt-4o-mini'
|
api_key
|
Optional[str]
|
API key for the LLM service. |
None
|
base_url
|
Optional[str]
|
Base URL for the LLM service. |
None
|
temperature
|
float
|
Temperature for response generation. |
0.7
|
mode
|
InstructorMode
|
Mode for the instructor. |
'markdown_json_mode'
|
max_retries
|
int
|
Maximum number of retries for API calls. |
3
|
organization
|
Optional[str]
|
Organization for the LLM service. |
None
|
client
|
Optional[Literal['openai', 'litellm']]
|
Client to use for API calls. |
None
|
verbose
|
bool
|
Whether to log verbose output. |
False
|
guardrails
|
Optional[Union[str, List[str]]]
|
Guardrails for content moderation. |
None
|
Returns:
Type | Description |
---|---|
Union[JudgmentResult, ValidationResult, RegeneratedResponse, FactCheckResult]
|
Union[JudgmentResult, ValidationResult, RegeneratedResponse, FactCheckResult]: The result of the judgment, validation, fact-check, or regeneration. |
Source code in zyx/resources/completions/agents/judge.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 |
|