Add contains_any and contains_all checksFeat/contains any all checks#2431
Add contains_any and contains_all checksFeat/contains any all checks#2431cosmicdk wants to merge 6 commits intoGiskard-AI:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces two new text matching checks, ContainsAny and ContainsAll, along with a shared base class ListStringMatching to support validation against lists of strings with Unicode normalization and case sensitivity. Comprehensive unit tests have been added to verify the new functionality. Feedback suggests aligning the default case_sensitive parameter with existing checks for consistency and optimizing the ContainsAll logic to identify matched and missing values in a single pass.
| description="Unicode normalization form to apply (NFC, NFD, NFKC, NFKD). Defaults to NFKC.", | ||
| ) | ||
| case_sensitive: bool = Field( | ||
| default=False, |
There was a problem hiding this comment.
The default value for case_sensitive is set to False, which is inconsistent with the existing StringMatching check in the same module (which defaults to True at line 222). To ensure a consistent user experience across similar text matching checks, it is recommended to align the default values unless there is a specific reason for this difference.
case_sensitive: bool = Field(
default=True,| missing_values = [ | ||
| value | ||
| for value, formatted_value in zip(self.values, formatted_values, strict=True) | ||
| if formatted_value not in formatted_text | ||
| ] | ||
| matched_values = [value for value in self.values if value not in missing_values] | ||
|
|
||
| details["matched_values"] = matched_values | ||
| details["missing_values"] = missing_values |
There was a problem hiding this comment.
The current implementation of ContainsAll.run performs two separate passes over the values and includes an O(N*M) check when calculating matched_values (where N is the number of values and M is the number of missing values). This can be optimized into a single pass that populates both lists simultaneously, improving performance and readability.
| missing_values = [ | |
| value | |
| for value, formatted_value in zip(self.values, formatted_values, strict=True) | |
| if formatted_value not in formatted_text | |
| ] | |
| matched_values = [value for value in self.values if value not in missing_values] | |
| details["matched_values"] = matched_values | |
| details["missing_values"] = missing_values | |
| matched_values = [] | |
| missing_values = [] | |
| for value, formatted_value in zip(self.values, formatted_values, strict=True): | |
| if formatted_value in formatted_text: | |
| matched_values.append(value) | |
| else: | |
| missing_values.append(value) | |
| details["matched_values"] = matched_values | |
| details["missing_values"] = missing_values |
Summary
Fixes #2361.
This PR adds two built-in text matching checks:
ContainsAnyregistered ascontains_anyContainsAllregistered ascontains_allThey allow users to validate whether an output contains at least one or all values from a provided list of strings.
Changes
giskard.checks.builtinandgiskard.checksValidation
tests/builtin/test_string_matching.py