Add ContextRelevance check

Summary

Add a built-in LLM-based check that evaluates whether the model answer is relevant to the user question.

Motivation

Answer relevance is a standard evaluation metric. The RAG evaluation use case in Giskard docs shows this as a custom implementation; a built-in check lowers adoption barriers.

Edge case

Multi-turn awareness: For dialogues, the last user message is often underspecified or refers to earlier turns. If the judge only sees the final exchange, it can mis-score relevance.

Example:

Turn	User	Assistant	Context
1	What is Giskard checks?	Giskard checks is ...	[Document about Giskard checks]
2	How to install it	...	[pip install `giskard-checks`...]

Judging Turn 2: The judge should recognize "it" refers to Giskard checks. If the retrieved context is about pip install giskard-chacks, it should be scored as relevant.

Evaluation scope: Once again for multi-turn dialog, we should only evaluate the designed part of the conversation. If a prior message was irrelevant, it should not impact the result

Example:

Turn	User	Assistant
1	What is the best language?	You should try to cook lasagna
2	Is Python a language or an animal	It's both

Judging turn 2 should return that the answer is relevant. Regardless of the irrelevant answer produced in the last turn.

Implementation Guide

Context for the judge: The prompt must include the full Trace, the specific query, and the retrieved_context (often a list of strings).
Domain context: Optional context input to provide high-level system behavior (e.g., "This bot only retrieves medical documentation").

Steps

Template: src/giskard/checks/prompts/judges/context_relevance.j2

Inputs: conversation context (history), current query, retrieved context.
Task: Does the retrieved context contain information necessary to answer the current query?
Consider: Information density, presence of "noise," and query disambiguation via history.

Check: src/giskard/checks/judges/context_relevance.py

Subclass BaseLLMCheck, register as "context_relevance".
Support:
- query: str | None = None
- query_key: JSONPathStr = 'trace.last.inputs'
- context_key: JSONPathStr = 'trace.last.metadata.context'
- history: JSONPathStr = 'trace.interaction[:-1]'
- domain_context: str | None = None (Domain context)

Tests must include:

Standard RAG: Relevant chunk passes; irrelevant chunk fails.
List handling: Ensure it correctly processes a list of strings vs. a single string.
Multi-turn: The "How do I install it?" scenario described above.

Example usage

from giskard.checks import ContextRelevance, Scenario

scenario = (
    Scenario(name="retrieval_quality")
    .interact(
        inputs="What is Python?", 
        outputs="Python is a language.",
        metadata={"context": ["Python is high-level..."]}
    )
    .interact(
        inputs="How do I install it?",
        metadata={"context": ["To install Python, use pyenv..."]}
    )
    .check(ContextRelevance())
)

Acceptance Criteria

Evaluates relevance of retrieved context to the final query using prior turns for disambiguation.
Supports JSONPath extraction for query and context (handling both str and list[str]).
Provides a clear reason (e.g., "Context contains installation instructions for the requested tool").
Tests cover: Relevant context; Irrelevant noise; Multi-turn pronoun resolution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `ContextRelevance` check #2339

Summary

Motivation

Edge case

Implementation Guide

Steps

Example usage

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add ContextRelevance check #2339

Description

Summary

Motivation

Edge case

Implementation Guide

Steps

Example usage

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add `ContextRelevance` check #2339