Skip to content

Add ContextRelevance check #2339

@kevinmessiaen

Description

@kevinmessiaen

Summary

Add a built-in LLM-based check that evaluates whether the model answer is relevant to the user question.

Motivation

Answer relevance is a standard evaluation metric. The RAG evaluation use case in Giskard docs shows this as a custom implementation; a built-in check lowers adoption barriers.

Edge case

Multi-turn awareness: For dialogues, the last user message is often underspecified or refers to earlier turns. If the judge only sees the final exchange, it can mis-score relevance.

Example:

Turn User Assistant Context
1 What is Giskard checks? Giskard checks is ... [Document about Giskard checks]
2 How to install it ... [pip install giskard-checks...]

Judging Turn 2: The judge should recognize "it" refers to Giskard checks. If the retrieved context is about pip install giskard-chacks, it should be scored as relevant.

Evaluation scope: Once again for multi-turn dialog, we should only evaluate the designed part of the conversation. If a prior message was irrelevant, it should not impact the result

Example:

Turn User Assistant
1 What is the best language? You should try to cook lasagna
2 Is Python a language or an animal It's both

Judging turn 2 should return that the answer is relevant. Regardless of the irrelevant answer produced in the last turn.

Implementation Guide

  1. Context for the judge: The prompt must include the full Trace, the specific query, and the retrieved_context (often a list of strings).
  2. Domain context: Optional context input to provide high-level system behavior (e.g., "This bot only retrieves medical documentation").

Steps

  1. Template: src/giskard/checks/prompts/judges/context_relevance.j2
  • Inputs: conversation context (history), current query, retrieved context.
  • Task: Does the retrieved context contain information necessary to answer the current query?
  • Consider: Information density, presence of "noise," and query disambiguation via history.
  1. Check: src/giskard/checks/judges/context_relevance.py
  • Subclass BaseLLMCheck, register as "context_relevance".
  • Support:
    • query: str | None = None
    • query_key: JSONPathStr = 'trace.last.inputs'
    • context_key: JSONPathStr = 'trace.last.metadata.context'
    • history: JSONPathStr = 'trace.interaction[:-1]'
    • domain_context: str | None = None (Domain context)
  1. Tests must include:
  • Standard RAG: Relevant chunk passes; irrelevant chunk fails.
  • List handling: Ensure it correctly processes a list of strings vs. a single string.
  • Multi-turn: The "How do I install it?" scenario described above.

Example usage

from giskard.checks import ContextRelevance, Scenario

scenario = (
    Scenario(name="retrieval_quality")
    .interact(
        inputs="What is Python?", 
        outputs="Python is a language.",
        metadata={"context": ["Python is high-level..."]}
    )
    .interact(
        inputs="How do I install it?",
        metadata={"context": ["To install Python, use pyenv..."]}
    )
    .check(ContextRelevance())
)

Acceptance Criteria

  • Evaluates relevance of retrieved context to the final query using prior turns for disambiguation.
  • Supports JSONPath extraction for query and context (handling both str and list[str]).
  • Provides a clear reason (e.g., "Context contains installation instructions for the requested tool").
  • Tests cover: Relevant context; Irrelevant noise; Multi-turn pronoun resolution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions