Skip to content

Push vector similarity functions using BlockDocValuesReader#137002

Merged
carlosdelest merged 80 commits intoelastic:mainfrom
carlosdelest:non-issue/esql-push-vector-similarity-blockloaders-function-config
Oct 30, 2025
Merged

Push vector similarity functions using BlockDocValuesReader#137002
carlosdelest merged 80 commits intoelastic:mainfrom
carlosdelest:non-issue/esql-push-vector-similarity-blockloaders-function-config

Conversation

@carlosdelest
Copy link
Member

@carlosdelest carlosdelest commented Oct 23, 2025

Pushes down vector similarity functions, and applies them when loading vectors.

This avoids the overhead of loading vectors into blocks, and calculate the functions afterwards in the compute engine. This directly calculates the similarity functions when loading the field, and loads the results directly into a block.

This approach consists on:

  • A BlockLoaderFunctionConfig marker interface, that is available on the BlockLoaderContext. This interface will be implemented by config objects, that will be used by the field types to instantiate the appropriate BlockLoader based on this config.
  • A BlockLoaderExpression interface. Functions that implement this interface will provide a BlockLoaderFunctionConfig.
  • An EsFunctionField that extends EsField. This field has a BlockLoaderFunctionConfig, which is the function configuration to extract from the field.
  • A PushDownVectorSimilarityFunctions local logical optimization rule, that replaces a vector similarity function applied to a field with a FieldAttribute that contains an EsFunctionField that will be used to extract the function result at field loading.

I've used a local logical optimization rule to push down. It looks cleaner and we don't have to deal with different cases when the attribute is sent to the local nodes or stays in the coordinator. Plus, it can use the mapping information to decide if a function can be pushed down or not.

Related: #103636
Previous work done on: #136739

elasticsearchmachine added 23 commits October 14, 2025 20:25
…ector-similarity-blockloaders

# Conflicts:
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
…ector-similarity-blockloaders-function-config

# Conflicts:
#	x-pack/plugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/vector/VectorSimilarityFunctionsIT.java
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
#	x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizerTests.java
elasticsearchmachine added 4 commits October 23, 2025 08:53
…-similarity-blockloaders-function-config' into non-issue/esql-push-vector-similarity-blockloaders-function-config
Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm excited. Looks great. I'd love to make the rule more generic, but that should probably wait for a follow up.

You may want to look into my comments about canonicalize - that could make this a little easier to reason about and feel fairly cheap to handle now.

I think we want a review from @alex-spies before we get this in.

I left a comment about another kind of query on the test that I think is important to address too. It's cool if we don't optimize this case now, but we shouldn't break it.


@Override
public MappedFieldType.BlockLoaderFunction<DenseVectorFieldMapper.VectorSimilarityFunctionConfig> getBlockLoaderFunction() {
Literal literal = (Literal) (left() instanceof Literal ? left() : right());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this return null if neither are Literal? I feel like letting the functions themselves decide when they should push is the right thing to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like, we'll push stuff like ST_CENTROID and LENGTH and those don't have any literal checking. We might push DATE_PARSE which has who knows what rules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait. Ignore this. Let's get it in a follow up when we do LENGTH or something. It'll be really obvious what to do then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment why we can assume one side to be a literal? This reads like it has to cause a CCE in some cases; but I assume this never really happens because PushDownVectorSimilarityFunctions is required for this code to be relevant, and that checks for literals? If that's true, maybe let's even link to PushDownVectorSimilarityFunctions for ease of reading :)


@Override
protected LogicalPlan rule(Eval eval) {
// TODO Extend to WHERE if necessary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave that for a follow up. When someone needs it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 053ea74

// Only replace if exactly one side is a literal and the other a field attribute
if ((similarityFunction.left() instanceof Literal ^ similarityFunction.right() instanceof Literal) == false) {
return similarityFunction;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can all deal with the new interface, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait. Let's just do that in a follow up. It's a good thing to do, but this feels like quite a good step.

@@ -210,6 +206,22 @@ public void close() {
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you implement canonicalize you can rotate Literals to the right hand side. Then you don't need all the xor fun. That feels like it'd be easier to manage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can swap the order of the parameters, right? That doesn't change the meaning for any of these functions, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I'll do that - it's easier to reason as you say, and may benefit other cases we do optimizations for in the evaluator

| eval s = v_dot_product(dense_vector, [0.1, 0.2, 0.3])
| sort s desc
| keep s
""";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a query like

            from test
            | eval s = v_dot_product(dense_vector, [0.1, 0.2, 0.3]) * 2 + 12
            | sort s desc
            | keep s

Does this get rewritten now? It doesn't have to in this PR, but I feel like we'd want it to eventually. And we do have to make sure not to break it in this PR either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:

            from test
            | eval s = v_dot_product(dense_vector, [0.1, 0.2, 0.3]) + v_dot_product(dense_vector, [0.3, 0.4, 0.5])
            | sort s desc
            | keep s

I'm not sure precisely how we'd want to push these, but again, it shouldn't break.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

            from test
            | eval s = v_dot_product(dense_vector1, [0.1, 0.2, 0.3]) + v_dot_product(dense_vector2, [0.3, 0.4, 0.5])
            | sort s desc
            | keep s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And do we need this for the non-sorted case too? I'm not sure. I didn't read closely enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all those cases are handled, but you're right in that I need to add tests for it. Basically, each function in an expression is replaced and pushed individually. But I need to make explicit tests for it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's tested in CSV tests and in optimizer tests in 11a6df1

// A null value on the left or right vector. Similarity is null
builder.appendNull();
continue;
} else if (dimsLeft != dimsRight) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dimensions are checked on the similarity implementation now

if (blockLoaderFunction == null) {
return new BlockDocValuesReader.DenseVectorBlockLoader(name(), dims, this);
}
if (SIMILARITY_FUNCTION_NAME.equals(blockLoaderFunction.name()) == false) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super fan of using a name to identify the function. I think we could get away with checking the type of the config 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think you can just:

if (blockLoaderFunction == null) {
                    return new BlockDocValuesReader.DenseVectorBlockLoader(name(), dims, this);
}
if (blockLoaderFunction instanceof DenseVectorSimilarityFunctionBlockLoader config) {
                  return new BlockDocValuesReader.DenseVectorSimilarityFunctionBlockLoader(
}
                    throw new UnsupportedOperationException("Unknown block loader function: " + blockLoaderFunction);

The only thing is - we need to make sure there's a nice toString for the config.

@@ -210,6 +206,22 @@ public void close() {
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I'll do that - it's easier to reason as you say, and may benefit other cases we do optimizations for in the evaluator

| eval s = v_dot_product(dense_vector, [0.1, 0.2, 0.3])
| sort s desc
| keep s
""";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all those cases are handled, but you're right in that I need to add tests for it. Basically, each function in an expression is replaced and pushed individually. But I need to make explicit tests for it

nik9000 added a commit that referenced this pull request Oct 28, 2025
Creates a `BlockLoader` for pushing the `LENGTH` function down into the
loader for `keyword` fields. It takes advantage of the terms dictionary
so we only need to calculate the code point count once per unique term
loaded.

This `BlockLoader` implementation isn't plugged into the infrastructure
for emitting it because we're waiting on the infrastructure we've
started in #137002. We'll make a follow up PR to plug this in.

We're doing this mostly to demonstrate another function that we can push
into field loading, in addition to the vector similarity functions we're
building in #137002. We don't expect `LENGTH` to be a super hot
function. If it happens to be then this'll help.

Before we plug this in we'll have to figure out emitting warnings from
functions that we've fused to field loading. Because `LENGTH` can emit a
warning, specifically when it hits a multivalued field.
elasticsearchmachine added 7 commits October 29, 2025 10:49
@carlosdelest
Copy link
Member Author

OK, I think this is ready for another round!

Changes I made:

  • Made the pushdown rule to act on WHERE, EVAL and STATS
  • Take into account duplicate functions, so a single attribute is created
  • Serialization and equality for FunctionEsField
  • Fix intermediate projections (MV_EXPAND)
  • Added tests:
    • LOOKUP JOIN and ENRICH
    • Checking duplicate attributes
    • Functions being part of expressions
    • Add tests for FunctionEsField
    • Check missing and not indexed fields cases

I believe this is ready for merging - we can take a look into issues as separate PRs.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the iterations @carlosdelest !

I think there's still some minor work to do on the push down rule, but it's definitely very fine to do so in follow-ups! Please, proceed at your own discretion; I left some recommendations below.

private final transient MappedFieldType.BlockLoaderFunctionConfig functionConfig;

public FunctionEsField(EsField esField, Function function, MappedFieldType.BlockLoaderFunctionConfig functionConfig) {
public FunctionEsField(EsField esField, DataType dataType, MappedFieldType.BlockLoaderFunctionConfig functionConfig) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the added tests!

Comment on lines +73 to +75
if (EsqlCapabilities.Cap.VECTOR_SIMILARITY_FUNCTIONS_PUSHDOWN.isEnabled()) {
rules.add(new PushDownVectorSimilarityFunctions());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice! That said, I was overly cautious about turning this off in Prod because I thought that the vec similarity functions are already live - which they aren't. So it's also fine to always enable PushDownVectorSimilarityFunctions.

@@ -19,7 +19,8 @@ public boolean exists(FieldName field) {

@Override
public boolean isIndexed(FieldName field) {
return true;
// Vector fields are not indexed, so vector similarity functions are be evaluated via evaluator instead of pushed down in CSV tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return esRelation.withAttributes(updatedOutput.stream().toList());
});
// Transforms Projects so the new attribute is not discarded
transformedPlan = transformedPlan.transformDown(EsqlProject.class, esProject -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still walks the whole plan again on each Eval, Filter or Aggregate, which is a little more complex than needed (O(n^2)).

I think what you'll want to do (in a follow-up), is to just have extends Rule<LogicalPlan, LogicalPlan>, and then perform a single walk of the plan, like ResolveUnionTypes. Actually, it's probably a good idea to follow ResolveUnionTypes' general structure. The latter also checks for any already fused fields from previous passes, which we should likely also do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I'll do that in a follow up though - I want this to be merged so it unblocks other related work. Thanks for the pointers!

// Transforms Projects so the new attribute is not discarded
transformedPlan = transformedPlan.transformDown(EsqlProject.class, esProject -> {
List<NamedExpression> projections = new ArrayList<>(esProject.projections());
projections.addAll(addedAttrsList);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: this might run into problems in the future if data node plans ever get projects that aren't between the main EsRelation and the, say, Eval that we're currently trying to optimize. E.g. if we allow Project nodes in the right hand side of Joins.

It's fine for now, though :)

var nameId = new NameId();
var name = rawTemporaryName(fieldAttr.name(), similarityFunction.nodeName(), nameId.toString());
var functionAttr = new FieldAttribute(
var name = rawTemporaryName(fieldAttr.name(), similarityFunction.nodeName(), String.valueOf(arrayHashCode));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You just need the hash so that the name doesn't mess up the deduplication, right?

Maybe it's easier/safer to stuff the field attr with a truncated name into the IdIgnoringWrapper, and give it a unique suffix in the actual plan. OTOH, the chance of a hash collision causing a name conflict between two different fused field attributes is likely so tiny, it almost doesn't matter.

I wonder if we could just use functionEsField.hashCode, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You just need the hash so that the name doesn't mess up the deduplication, right?

The problem is that the query vector needs to be taken into account - we could have two similarity functions used on the same field, but with different vectors to compare to.

I used the vector hash to take that into account and deduplicate when both the similarity function and the vector are the same.

Copy link
Contributor

@alex-spies alex-spies Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off: sorry, should've marked this as nit. It's a nit :) I'm just pointing it out because new folks coming across this piece of code may scratch their heads as to why a manual hash computation is added. The very slim chance of hash collisions is likely negligible.

The problem is that the query vector needs to be taken into account

I see that!

But since you use a Map<Attribute.IdIgnoringWrapper, Attribute> , the field attribute that you use as key is fully used for equality in the map, except for its NameId. Which means that, if two field attributes used as map keys have different function configurations, they should be recognized as different keys in the map, regardless of their names.

So, it should be fine to use non-unique names for the attributes that we use as keys (e.g. $$dense_vector_field$dotProduct$ - and just use unique names in the value that the key points to ($$dense_vector_field$dotProduct$1234)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, thanks for explaining @alex-spies !

I believe the name was being used as well on equality for IdIgnoringWrapper and that was the reason that threw me off. I'll take another look when doing the follow up 👍

assertThat(Expressions.names(eval.fields()), contains("s"));
var alias = as(eval.fields().getFirst(), Alias.class);
var literal = as(alias.child(), Literal.class);
assertThat(literal.value(), is(nullValue()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Comment on lines +1371 to +1372
| keep *
| mv_expand keyword
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we'll implement pushdown past project for mv expand very very soon, making this test a bit pointless. We could parse the plan and manually place a custom unary plan node after the projection, if we want to make this robust.

Comment on lines +1423 to +1425
| eval s1 = v_dot_product(dense_vector, [1.0, 2.0, 3.0]), s2 = v_dot_product(dense_vector, [1.0, 2.0, 3.0]) * 2 / 3
| eval s3 = v_dot_product(dense_vector, [1.0, 2.0, 3.0]) + 5, r1 = v_dot_product(dense_vector, [4.0, 5.0, 6.0])
| eval r2 = v_dot_product(dense_vector, [4.0, 5.0, 6.0]) + v_cosine(dense_vector, [4.0, 5.0, 6.0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll see more interesting results when we use the same dot product in different commands. E.g. one EVAL, one WHERE, on STATS.

We combine subsequent EVALs, so you don't see the effect of the same similarity function occuring in different commands.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True! As I need to rework the pushddown rule in a follow up, and it will traverse the plan in one go, I'll update the test accordingly there 👍

…ector-similarity-blockloaders-function-config

# Conflicts:
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
@carlosdelest carlosdelest enabled auto-merge (squash) October 30, 2025 08:32
@carlosdelest carlosdelest merged commit d220b33 into elastic:main Oct 30, 2025
33 of 34 checks passed
fzowl pushed a commit to voyage-ai/elasticsearch that referenced this pull request Nov 3, 2025
We have a bunch of these as inner classes. Some of them are inner
classes of the field types they support - which is fine and good. But a
bunch of the shared ones were inner classes of `BlockDocValuesReader`
which made the class like a billion lines long. And it made it harder to
find the source code of the shared ones because they weren't in
alphabetical order.

This moves these block loaders to top level classes. And lets our IDE
put them in alphabetical order for us. Which is nice because we'll be
adding a bunch more of them as we move further on elastic#137002.
chrisparrinello pushed a commit to chrisparrinello/elasticsearch that referenced this pull request Nov 3, 2025
Creates a `BlockLoader` for pushing the `LENGTH` function down into the
loader for `keyword` fields. It takes advantage of the terms dictionary
so we only need to calculate the code point count once per unique term
loaded.

This `BlockLoader` implementation isn't plugged into the infrastructure
for emitting it because we're waiting on the infrastructure we've
started in elastic#137002. We'll make a follow up PR to plug this in.

We're doing this mostly to demonstrate another function that we can push
into field loading, in addition to the vector similarity functions we're
building in elastic#137002. We don't expect `LENGTH` to be a super hot
function. If it happens to be then this'll help.

Before we plug this in we'll have to figure out emitting warnings from
functions that we've fused to field loading. Because `LENGTH` can emit a
warning, specifically when it hits a multivalued field.
chrisparrinello pushed a commit to chrisparrinello/elasticsearch that referenced this pull request Nov 3, 2025
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 6, 2025
BASE=82df0825778dc24773236299d5ecff6ae48d6786
HEAD=23450dd9771a51939201e641b79ad16fee302ff7
Branch=main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >non-issue :Search Relevance/ES|QL Search functionality in ES|QL Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.3.0

4 participants