Handle empty input inference#123763
Conversation
|
@elasticmachine update branch |
Mikep86
left a comment
There was a problem hiding this comment.
Good start to this. Can you add a highlighter test where empty input is part of multi-chunk input (ex: ["some test data", " ", "now with chunks"])? The highlighter uses the offsets at query time to recreate the chunks, so this is a good test that the chunk offsets are valid end-to-end.
...nference/src/main/java/org/elasticsearch/xpack/inference/mapper/SemanticTextFieldMapper.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java
Outdated
Show resolved
Hide resolved
...rence/src/yamlRestTest/resources/rest-api-spec/test/inference/30_semantic_text_inference.yml
Show resolved
Hide resolved
|
@elasticmachine update branch |
...nce/src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter.yml
Show resolved
Hide resolved
|
Pinging @elastic/search-eng (Team:SearchOrg) |
|
Pinging @elastic/search-relevance (Team:Search - Relevance) |
|
Hi @Samiul-TheSoccerFan, I've created a changelog YAML for you. |
| "Empty semantic_text field skips embedding generation": | ||
| - requires: | ||
| cluster_features: "semantic_text.handle_empty_input" | ||
| reason: skips generating embeddings when semantic_text field is contains empty or whitespace only input |
There was a problem hiding this comment.
Nitpick: We usually put the reason as when the fix was introduced e.g. 8.19.
There was a problem hiding this comment.
Do we only mention 8.19 or we should mention 9.1.0 as well?
Skips embedding generation when semantic_text is empty or contains only whitespace, effective from 8.19 and 9.1.0.. How about this one?
Mikep86
left a comment
There was a problem hiding this comment.
Great iterations on this 🙌 ! This is coming along nicely. There's a bug when using the legacy format, which you can replicate with:
PUT test-index
{
"settings": {
"index.mapping.semantic_text.use_legacy_format": true
},
"mappings": {
"properties": {
"inference": {
"type": "semantic_text"
}
}
}
}
POST test-index/_doc/1
{
"inference": " "
}
GET test-index/_doc/1
You get a parsing error, when you should get:
{
"_index": "test-index",
"_id": "1",
"_version": 1,
"_seq_no": 0,
"_primary_term": 1,
"found": true,
"_source": {
"inference": {
"text": " ",
"inference": {
"inference_id": ".elser-2-elasticsearch",
"model_settings": null,
"chunks": []
}
}
}
}
...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java
Outdated
Show resolved
Hide resolved
...ain/java/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilter.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java
Outdated
Show resolved
Hide resolved
...nce/src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter.yml
Show resolved
Hide resolved
...nce/src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter.yml
Outdated
Show resolved
Hide resolved
...rence/src/yamlRestTest/resources/rest-api-spec/test/inference/30_semantic_text_inference.yml
Show resolved
Hide resolved
...nce/src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter.yml
Show resolved
Hide resolved
|
@elasticmachine update branch |
Mikep86
left a comment
There was a problem hiding this comment.
Great work, this is very close. Just a few things to clean up before this is good to merge :)
...ava/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java
Outdated
Show resolved
Hide resolved
...e/src/yamlRestTest/resources/rest-api-spec/test/inference/30_semantic_text_inference_bwc.yml
Outdated
Show resolved
Hide resolved
...src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter_bwc.yml
Show resolved
Hide resolved
Mikep86
left a comment
There was a problem hiding this comment.
LGTM, thanks for the iterations!
💚 Backport successful
|
* Added check for blank string to skip generating embeddings with unit test * Adding yaml tests for skipping embedding generation * dynamic update not required if model_settings stays null * Updating node feature for handling empty input name and description * Update yaml tests with refresh=true * Update unit test to follow more accurate behavior * Added yaml tests for multu chunks * [CI] Auto commit changes from spotless * Adding highlighter yaml tests for empty input * Update docs/changelog/123763.yaml * Update changelog and test reason to have more polished documentation * adding input value into the response source and fixing unit tests by reformating * Adding highligher test for backward compatibility and refactor existing test * Added bwc tests for empty input and multi chunks * Removed reindex for empty input from bwc * [CI] Auto commit changes from spotless * Fixing yaml test * Update unit tests helper function to support both format * [CI] Auto commit changes from spotless * Adding cluster features for bwc * Centralize logic for assertInference helper --------- Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
* Added check for blank string to skip generating embeddings with unit test * Adding yaml tests for skipping embedding generation * dynamic update not required if model_settings stays null * Updating node feature for handling empty input name and description * Update yaml tests with refresh=true * Update unit test to follow more accurate behavior * Added yaml tests for multu chunks * [CI] Auto commit changes from spotless * Adding highlighter yaml tests for empty input * Update docs/changelog/123763.yaml * Update changelog and test reason to have more polished documentation * adding input value into the response source and fixing unit tests by reformating * Adding highligher test for backward compatibility and refactor existing test * Added bwc tests for empty input and multi chunks * Removed reindex for empty input from bwc * [CI] Auto commit changes from spotless * Fixing yaml test * Update unit tests helper function to support both format * [CI] Auto commit changes from spotless * Adding cluster features for bwc * Centralize logic for assertInference helper --------- Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
* Added check for blank string to skip generating embeddings with unit test * Adding yaml tests for skipping embedding generation * dynamic update not required if model_settings stays null * Updating node feature for handling empty input name and description * Update yaml tests with refresh=true * Update unit test to follow more accurate behavior * Added yaml tests for multu chunks * [CI] Auto commit changes from spotless * Adding highlighter yaml tests for empty input * Update docs/changelog/123763.yaml * Update changelog and test reason to have more polished documentation * adding input value into the response source and fixing unit tests by reformating * Adding highligher test for backward compatibility and refactor existing test * Added bwc tests for empty input and multi chunks * Removed reindex for empty input from bwc * [CI] Auto commit changes from spotless * Fixing yaml test * Update unit tests helper function to support both format * [CI] Auto commit changes from spotless * Adding cluster features for bwc * Centralize logic for assertInference helper --------- Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
This PR skips embedding generation when a sematic_text field is empty or with whitespace only.
doc2anddoc3will return without_inference_fieldsMulti field
doc1anddoc2will return without_inference_fields