Summary
Add a built-in LLM-based check that evaluates whether the model answer is relevant to the user question.
Motivation
Answer relevance is a standard evaluation metric. The RAG evaluation use case in Giskard docs shows this as a custom implementation; a built-in check lowers adoption barriers.
Edge case
Multi-turn awareness: For dialogues, the last user message is often underspecified or refers to earlier turns. If the judge only sees the final exchange, it can mis-score relevance.
Example:
| Turn |
User |
Assistant |
Context |
| 1 |
What is Giskard checks? |
Giskard checks is ... |
[Document about Giskard checks] |
| 2 |
How to install it |
... |
[pip install giskard-checks...] |
Judging Turn 2: The judge should recognize "it" refers to Giskard checks. If the retrieved context is about pip install giskard-chacks, it should be scored as relevant.
Evaluation scope: Once again for multi-turn dialog, we should only evaluate the designed part of the conversation. If a prior message was irrelevant, it should not impact the result
Example:
| Turn |
User |
Assistant |
| 1 |
What is the best language? |
You should try to cook lasagna |
| 2 |
Is Python a language or an animal |
It's both |
Judging turn 2 should return that the answer is relevant. Regardless of the irrelevant answer produced in the last turn.
Implementation Guide
- Context for the judge: The prompt must include the full Trace, the specific query, and the retrieved_context (often a list of strings).
- Domain context: Optional context input to provide high-level system behavior (e.g., "This bot only retrieves medical documentation").
Steps
- Template: src/giskard/checks/prompts/judges/context_relevance.j2
- Inputs: conversation context (history), current query, retrieved context.
- Task: Does the retrieved context contain information necessary to answer the current query?
- Consider: Information density, presence of "noise," and query disambiguation via history.
- Check: src/giskard/checks/judges/context_relevance.py
- Subclass BaseLLMCheck, register as "context_relevance".
- Support:
- query: str | None = None
- query_key: JSONPathStr = 'trace.last.inputs'
- context_key: JSONPathStr = 'trace.last.metadata.context'
- history: JSONPathStr = 'trace.interaction[:-1]'
- domain_context: str | None = None (Domain context)
- Tests must include:
- Standard RAG: Relevant chunk passes; irrelevant chunk fails.
- List handling: Ensure it correctly processes a list of strings vs. a single string.
- Multi-turn: The "How do I install it?" scenario described above.
Example usage
from giskard.checks import ContextRelevance, Scenario
scenario = (
Scenario(name="retrieval_quality")
.interact(
inputs="What is Python?",
outputs="Python is a language.",
metadata={"context": ["Python is high-level..."]}
)
.interact(
inputs="How do I install it?",
metadata={"context": ["To install Python, use pyenv..."]}
)
.check(ContextRelevance())
)
Acceptance Criteria
Summary
Add a built-in LLM-based check that evaluates whether the model answer is relevant to the user question.
Motivation
Answer relevance is a standard evaluation metric. The RAG evaluation use case in Giskard docs shows this as a custom implementation; a built-in check lowers adoption barriers.
Edge case
Multi-turn awareness: For dialogues, the last user message is often underspecified or refers to earlier turns. If the judge only sees the final exchange, it can mis-score relevance.
Example:
giskard-checks...]Judging Turn 2: The judge should recognize "it" refers to Giskard checks. If the retrieved context is about pip install giskard-chacks, it should be scored as relevant.
Evaluation scope: Once again for multi-turn dialog, we should only evaluate the designed part of the conversation. If a prior message was irrelevant, it should not impact the result
Example:
Judging turn 2 should return that the answer is relevant. Regardless of the irrelevant answer produced in the last turn.
Implementation Guide
Steps
Example usage
Acceptance Criteria