fix(semantic highlighter): add vector similarity queries and bbq_disk support #138140
Conversation
|
Hi @mromaios, I've created a changelog YAML for you. |
...rence/src/main/java/org/elasticsearch/xpack/inference/highlight/SemanticTextHighlighter.java
Outdated
Show resolved
Hide resolved
...rence/src/main/java/org/elasticsearch/xpack/inference/highlight/SemanticTextHighlighter.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/elasticsearch/xpack/inference/highlight/SemanticTextHighlighterTests.java
Show resolved
Hide resolved
|
Pinging @elastic/search-relevance (Team:Search - Relevance) |
…om:mromaios/elasticsearch into fix_min_similarity_in_semantic_highlighter
Mikep86
left a comment
There was a problem hiding this comment.
Looks good overall, I like the approach! I think this is good to go with a little more work on the tests 🚀
| @Override | ||
| public void visitLeaf(Query query) { | ||
| private void visitLeaf(Query query, Float similarity) { | ||
| if (query instanceof KnnFloatVectorQuery knnQuery) { | ||
| queries.add(fieldType.createExactKnnQuery(VectorData.fromFloats(knnQuery.getTargetCopy()), null)); | ||
| queries.add(fieldType.createExactKnnQuery(VectorData.fromFloats(knnQuery.getTargetCopy()), similarity)); | ||
| } else if (query instanceof KnnByteVectorQuery knnQuery) { | ||
| queries.add(fieldType.createExactKnnQuery(VectorData.fromBytes(knnQuery.getTargetCopy()), null)); | ||
| queries.add(fieldType.createExactKnnQuery(VectorData.fromBytes(knnQuery.getTargetCopy()), similarity)); | ||
| } else if (query instanceof MatchAllDocsQuery) { | ||
| queries.add(new MatchAllDocsQuery()); | ||
| } else if (query instanceof DenseVectorQuery.Floats floatsQuery) { | ||
| queries.add(fieldType.createExactKnnQuery(VectorData.fromFloats(floatsQuery.getQuery()), null)); | ||
| queries.add(fieldType.createExactKnnQuery(VectorData.fromFloats(floatsQuery.getQuery()), similarity)); | ||
| } else if (query instanceof IVFKnnFloatVectorQuery ivfQuery) { | ||
| queries.add(fieldType.createExactKnnQuery(VectorData.fromFloats(ivfQuery.getQuery()), similarity)); | ||
| } else if (query instanceof RescoreKnnVectorQuery rescoreQuery) { | ||
| visitLeaf(rescoreQuery.innerQuery(), similarity); | ||
| } else if (query instanceof VectorSimilarityQuery similarityQuery) { | ||
| visitLeaf(similarityQuery.getInnerKnnQuery(), similarityQuery.getSimilarity()); | ||
| } | ||
| } | ||
|
|
||
| @Override | ||
| public void visitLeaf(Query query) { | ||
| visitLeaf(query, null); | ||
|
|
||
| } |
There was a problem hiding this comment.
Nice use of recursion to solve the problem here ❤️
.../src/test/java/org/elasticsearch/xpack/inference/highlight/SemanticTextHighlighterTests.java
Outdated
Show resolved
Hide resolved
| // TODO: figure out why this fails | ||
| // @SuppressWarnings("unchecked") | ||
| // public void testDenseVectorWithDiscBBQ() throws Exception { | ||
| // var mapperService = createDefaultMapperService(useLegacyFormat); | ||
| // Map<String, Object> queryMap = (Map<String, Object>) queries.get("dense_vector_1"); | ||
| // float[] vector = readDenseVector(queryMap.get("embeddings")); | ||
| // var fieldType = (SemanticTextFieldMapper.SemanticTextFieldType) | ||
| // mapperService.mappingLookup().getFieldType(SEMANTIC_FIELD_E5_DISK_BBQ); | ||
| // | ||
| // KnnVectorQueryBuilder knnQuery = new KnnVectorQueryBuilder( | ||
| // fieldType.getEmbeddingsField().fullPath(), | ||
| // vector, | ||
| // 10, | ||
| // 10, | ||
| // 10f, | ||
| // null, | ||
| // null | ||
| // ); | ||
| // NestedQueryBuilder nestedQueryBuilder = new NestedQueryBuilder(fieldType.getChunksField().fullPath(), knnQuery, ScoreMode.Max); | ||
| // var shardRequest = createShardSearchRequest(nestedQueryBuilder); | ||
| // var sourceToParse = new SourceToParse("0", readSampleDoc(useLegacyFormat), XContentType.JSON); | ||
| // | ||
| // String[] expectedPassages = ((List<String>) queryMap.get("expected_with_similarity_threshold")).toArray(String[]::new); | ||
| // assertHighlightOneDoc( | ||
| // mapperService, | ||
| // shardRequest, | ||
| // sourceToParse, | ||
| // SEMANTIC_FIELD_E5_DISK_BBQ, | ||
| // expectedPassages.length, | ||
| // HighlightBuilder.Order.SCORE, | ||
| // expectedPassages | ||
| // ); | ||
| // } |
There was a problem hiding this comment.
Any insights on the problems with this case?
There was a problem hiding this comment.
Yep, me 😓🤦
Sooooo, of course we don't create embeddings in the UT, but read them from the gzipped file.
var sourceToParse = new SourceToParse("0", readSampleDoc(useLegacyFormat), XContentType.JSON);
which I hadn't updated to include the new field.
kderusso
left a comment
There was a problem hiding this comment.
Looks good! On top of Mike's comments, I would love to see a yaml test added under 90_semantic_text_highlighter.yml as well as the associated BWC. Remember that for yaml tests we'll also need to make sure to create a feature/capability for Green CI 👍
…ec/test/inference/90_semantic_text_highlighter_bwc.yml Co-authored-by: Kathleen DeRusso <kathleen.derusso@elastic.co>
💔 Backport failedYou can use sqren/backport to manually backport by running |
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
… support (elastic#138140) (cherry picked from commit 8603933) # Conflicts: # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferenceFeatures.java
… support (elastic#138140) (cherry picked from commit 8603933) # Conflicts: # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferenceFeatures.java # x-pack/plugin/inference/src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter.yml
… support (elastic#138140) (cherry picked from commit 8603933) # Conflicts: # server/src/main/java/org/elasticsearch/search/vectors/IVFKnnFloatVectorQuery.java # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferenceFeatures.java # x-pack/plugin/inference/src/yamlRestTest/resources/rest-api-spec/test/inference/90_semantic_text_highlighter.yml
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
Closes: #136056
This PR improves the handling of vector based queries in the SemanticTextHighlighter.
VectorSimilarityQuery(e.g. when asimilaritythreshold is set in aknnquery) as well as vector queries on abbq_diskdense vector type.