Push vector similarity functions using BlockDocValuesReader#137002
Conversation
…ectorFieldMapper and BlockDocValuesReader
…ector-similarity-blockloaders # Conflicts: # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
…on interface to create block loader
…ector-similarity-blockloaders-function-config # Conflicts: # x-pack/plugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/vector/VectorSimilarityFunctionsIT.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java # x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizerTests.java
…-similarity-blockloaders-function-config' into non-issue/esql-push-vector-similarity-blockloaders-function-config
nik9000
left a comment
There was a problem hiding this comment.
Now I'm excited. Looks great. I'd love to make the rule more generic, but that should probably wait for a follow up.
You may want to look into my comments about canonicalize - that could make this a little easier to reason about and feel fairly cheap to handle now.
I think we want a review from @alex-spies before we get this in.
I left a comment about another kind of query on the test that I think is important to address too. It's cool if we don't optimize this case now, but we shouldn't break it.
server/src/main/java/org/elasticsearch/index/mapper/MappedFieldType.java
Outdated
Show resolved
Hide resolved
|
|
||
| @Override | ||
| public MappedFieldType.BlockLoaderFunction<DenseVectorFieldMapper.VectorSimilarityFunctionConfig> getBlockLoaderFunction() { | ||
| Literal literal = (Literal) (left() instanceof Literal ? left() : right()); |
There was a problem hiding this comment.
Could this return null if neither are Literal? I feel like letting the functions themselves decide when they should push is the right thing to do.
There was a problem hiding this comment.
Like, we'll push stuff like ST_CENTROID and LENGTH and those don't have any literal checking. We might push DATE_PARSE which has who knows what rules.
There was a problem hiding this comment.
Wait. Ignore this. Let's get it in a follow up when we do LENGTH or something. It'll be really obvious what to do then.
There was a problem hiding this comment.
Can we add a comment why we can assume one side to be a literal? This reads like it has to cause a CCE in some cases; but I assume this never really happens because PushDownVectorSimilarityFunctions is required for this code to be relevant, and that checks for literals? If that's true, maybe let's even link to PushDownVectorSimilarityFunctions for ease of reading :)
|
|
||
| @Override | ||
| protected LogicalPlan rule(Eval eval) { | ||
| // TODO Extend to WHERE if necessary |
There was a problem hiding this comment.
I'd leave that for a follow up. When someone needs it.
| // Only replace if exactly one side is a literal and the other a field attribute | ||
| if ((similarityFunction.left() instanceof Literal ^ similarityFunction.right() instanceof Literal) == false) { | ||
| return similarityFunction; | ||
| } |
There was a problem hiding this comment.
This can all deal with the new interface, right?
There was a problem hiding this comment.
Wait. Let's just do that in a follow up. It's a good thing to do, but this feels like quite a good step.
| @@ -210,6 +206,22 @@ public void close() { | |||
| } | |||
| } | |||
There was a problem hiding this comment.
If you implement canonicalize you can rotate Literals to the right hand side. Then you don't need all the xor fun. That feels like it'd be easier to manage.
There was a problem hiding this comment.
You can swap the order of the parameters, right? That doesn't change the meaning for any of these functions, right?
There was a problem hiding this comment.
Correct, I'll do that - it's easier to reason as you say, and may benefit other cases we do optimizations for in the evaluator
| | eval s = v_dot_product(dense_vector, [0.1, 0.2, 0.3]) | ||
| | sort s desc | ||
| | keep s | ||
| """; |
There was a problem hiding this comment.
What about a query like
from test
| eval s = v_dot_product(dense_vector, [0.1, 0.2, 0.3]) * 2 + 12
| sort s desc
| keep s
Does this get rewritten now? It doesn't have to in this PR, but I feel like we'd want it to eventually. And we do have to make sure not to break it in this PR either way.
There was a problem hiding this comment.
What about:
from test
| eval s = v_dot_product(dense_vector, [0.1, 0.2, 0.3]) + v_dot_product(dense_vector, [0.3, 0.4, 0.5])
| sort s desc
| keep s
I'm not sure precisely how we'd want to push these, but again, it shouldn't break.
There was a problem hiding this comment.
from test
| eval s = v_dot_product(dense_vector1, [0.1, 0.2, 0.3]) + v_dot_product(dense_vector2, [0.3, 0.4, 0.5])
| sort s desc
| keep s
There was a problem hiding this comment.
And do we need this for the non-sorted case too? I'm not sure. I didn't read closely enough.
There was a problem hiding this comment.
I think all those cases are handled, but you're right in that I need to add tests for it. Basically, each function in an expression is replaced and pushed individually. But I need to make explicit tests for it
There was a problem hiding this comment.
That's tested in CSV tests and in optimizer tests in 11a6df1
| // A null value on the left or right vector. Similarity is null | ||
| builder.appendNull(); | ||
| continue; | ||
| } else if (dimsLeft != dimsRight) { |
There was a problem hiding this comment.
Dimensions are checked on the similarity implementation now
| if (blockLoaderFunction == null) { | ||
| return new BlockDocValuesReader.DenseVectorBlockLoader(name(), dims, this); | ||
| } | ||
| if (SIMILARITY_FUNCTION_NAME.equals(blockLoaderFunction.name()) == false) { |
There was a problem hiding this comment.
I'm not super fan of using a name to identify the function. I think we could get away with checking the type of the config 🤔
There was a problem hiding this comment.
Yeah. I think you can just:
if (blockLoaderFunction == null) {
return new BlockDocValuesReader.DenseVectorBlockLoader(name(), dims, this);
}
if (blockLoaderFunction instanceof DenseVectorSimilarityFunctionBlockLoader config) {
return new BlockDocValuesReader.DenseVectorSimilarityFunctionBlockLoader(
}
throw new UnsupportedOperationException("Unknown block loader function: " + blockLoaderFunction);
The only thing is - we need to make sure there's a nice toString for the config.
| @@ -210,6 +206,22 @@ public void close() { | |||
| } | |||
| } | |||
There was a problem hiding this comment.
Correct, I'll do that - it's easier to reason as you say, and may benefit other cases we do optimizations for in the evaluator
| | eval s = v_dot_product(dense_vector, [0.1, 0.2, 0.3]) | ||
| | sort s desc | ||
| | keep s | ||
| """; |
There was a problem hiding this comment.
I think all those cases are handled, but you're right in that I need to add tests for it. Basically, each function in an expression is replaced and pushed individually. But I need to make explicit tests for it
server/src/main/java/org/elasticsearch/index/mapper/MappedFieldType.java
Outdated
Show resolved
Hide resolved
Creates a `BlockLoader` for pushing the `LENGTH` function down into the loader for `keyword` fields. It takes advantage of the terms dictionary so we only need to calculate the code point count once per unique term loaded. This `BlockLoader` implementation isn't plugged into the infrastructure for emitting it because we're waiting on the infrastructure we've started in #137002. We'll make a follow up PR to plug this in. We're doing this mostly to demonstrate another function that we can push into field loading, in addition to the vector similarity functions we're building in #137002. We don't expect `LENGTH` to be a super hot function. If it happens to be then this'll help. Before we plug this in we'll have to figure out emitting warnings from functions that we've fused to field loading. Because `LENGTH` can emit a warning, specifically when it hits a multivalued field.
…ector-similarity-blockloaders-function-config # Conflicts: # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
|
OK, I think this is ready for another round! Changes I made:
I believe this is ready for merging - we can take a look into issues as separate PRs. |
alex-spies
left a comment
There was a problem hiding this comment.
Great, thanks for the iterations @carlosdelest !
I think there's still some minor work to do on the push down rule, but it's definitely very fine to do so in follow-ups! Please, proceed at your own discretion; I left some recommendations below.
| private final transient MappedFieldType.BlockLoaderFunctionConfig functionConfig; | ||
|
|
||
| public FunctionEsField(EsField esField, Function function, MappedFieldType.BlockLoaderFunctionConfig functionConfig) { | ||
| public FunctionEsField(EsField esField, DataType dataType, MappedFieldType.BlockLoaderFunctionConfig functionConfig) { |
There was a problem hiding this comment.
Great, thanks for the added tests!
| if (EsqlCapabilities.Cap.VECTOR_SIMILARITY_FUNCTIONS_PUSHDOWN.isEnabled()) { | ||
| rules.add(new PushDownVectorSimilarityFunctions()); | ||
| } |
There was a problem hiding this comment.
This is nice! That said, I was overly cautious about turning this off in Prod because I thought that the vec similarity functions are already live - which they aren't. So it's also fine to always enable PushDownVectorSimilarityFunctions.
| @@ -19,7 +19,8 @@ public boolean exists(FieldName field) { | |||
|
|
|||
| @Override | |||
| public boolean isIndexed(FieldName field) { | |||
| return true; | |||
| // Vector fields are not indexed, so vector similarity functions are be evaluated via evaluator instead of pushed down in CSV tests | |||
| return esRelation.withAttributes(updatedOutput.stream().toList()); | ||
| }); | ||
| // Transforms Projects so the new attribute is not discarded | ||
| transformedPlan = transformedPlan.transformDown(EsqlProject.class, esProject -> { |
There was a problem hiding this comment.
This still walks the whole plan again on each Eval, Filter or Aggregate, which is a little more complex than needed (O(n^2)).
I think what you'll want to do (in a follow-up), is to just have extends Rule<LogicalPlan, LogicalPlan>, and then perform a single walk of the plan, like ResolveUnionTypes. Actually, it's probably a good idea to follow ResolveUnionTypes' general structure. The latter also checks for any already fused fields from previous passes, which we should likely also do.
There was a problem hiding this comment.
That makes sense. I'll do that in a follow up though - I want this to be merged so it unblocks other related work. Thanks for the pointers!
| // Transforms Projects so the new attribute is not discarded | ||
| transformedPlan = transformedPlan.transformDown(EsqlProject.class, esProject -> { | ||
| List<NamedExpression> projections = new ArrayList<>(esProject.projections()); | ||
| projections.addAll(addedAttrsList); |
There was a problem hiding this comment.
thought: this might run into problems in the future if data node plans ever get projects that aren't between the main EsRelation and the, say, Eval that we're currently trying to optimize. E.g. if we allow Project nodes in the right hand side of Joins.
It's fine for now, though :)
| var nameId = new NameId(); | ||
| var name = rawTemporaryName(fieldAttr.name(), similarityFunction.nodeName(), nameId.toString()); | ||
| var functionAttr = new FieldAttribute( | ||
| var name = rawTemporaryName(fieldAttr.name(), similarityFunction.nodeName(), String.valueOf(arrayHashCode)); |
There was a problem hiding this comment.
You just need the hash so that the name doesn't mess up the deduplication, right?
Maybe it's easier/safer to stuff the field attr with a truncated name into the IdIgnoringWrapper, and give it a unique suffix in the actual plan. OTOH, the chance of a hash collision causing a name conflict between two different fused field attributes is likely so tiny, it almost doesn't matter.
I wonder if we could just use functionEsField.hashCode, though.
There was a problem hiding this comment.
You just need the hash so that the name doesn't mess up the deduplication, right?
The problem is that the query vector needs to be taken into account - we could have two similarity functions used on the same field, but with different vectors to compare to.
I used the vector hash to take that into account and deduplicate when both the similarity function and the vector are the same.
There was a problem hiding this comment.
First off: sorry, should've marked this as nit. It's a nit :) I'm just pointing it out because new folks coming across this piece of code may scratch their heads as to why a manual hash computation is added. The very slim chance of hash collisions is likely negligible.
The problem is that the query vector needs to be taken into account
I see that!
But since you use a Map<Attribute.IdIgnoringWrapper, Attribute> , the field attribute that you use as key is fully used for equality in the map, except for its NameId. Which means that, if two field attributes used as map keys have different function configurations, they should be recognized as different keys in the map, regardless of their names.
So, it should be fine to use non-unique names for the attributes that we use as keys (e.g. $$dense_vector_field$dotProduct$ - and just use unique names in the value that the key points to ($$dense_vector_field$dotProduct$1234)
There was a problem hiding this comment.
Ah, I see, thanks for explaining @alex-spies !
I believe the name was being used as well on equality for IdIgnoringWrapper and that was the reason that threw me off. I'll take another look when doing the follow up 👍
| assertThat(Expressions.names(eval.fields()), contains("s")); | ||
| var alias = as(eval.fields().getFirst(), Alias.class); | ||
| var literal = as(alias.child(), Literal.class); | ||
| assertThat(literal.value(), is(nullValue())); |
| | keep * | ||
| | mv_expand keyword |
There was a problem hiding this comment.
nit: we'll implement pushdown past project for mv expand very very soon, making this test a bit pointless. We could parse the plan and manually place a custom unary plan node after the projection, if we want to make this robust.
| | eval s1 = v_dot_product(dense_vector, [1.0, 2.0, 3.0]), s2 = v_dot_product(dense_vector, [1.0, 2.0, 3.0]) * 2 / 3 | ||
| | eval s3 = v_dot_product(dense_vector, [1.0, 2.0, 3.0]) + 5, r1 = v_dot_product(dense_vector, [4.0, 5.0, 6.0]) | ||
| | eval r2 = v_dot_product(dense_vector, [4.0, 5.0, 6.0]) + v_cosine(dense_vector, [4.0, 5.0, 6.0]) |
There was a problem hiding this comment.
I think you'll see more interesting results when we use the same dot product in different commands. E.g. one EVAL, one WHERE, on STATS.
We combine subsequent EVALs, so you don't see the effect of the same similarity function occuring in different commands.
There was a problem hiding this comment.
True! As I need to rework the pushddown rule in a follow up, and it will traverse the plan in one go, I'll update the test accordingly there 👍
…ector-similarity-blockloaders-function-config # Conflicts: # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
We have a bunch of these as inner classes. Some of them are inner classes of the field types they support - which is fine and good. But a bunch of the shared ones were inner classes of `BlockDocValuesReader` which made the class like a billion lines long. And it made it harder to find the source code of the shared ones because they weren't in alphabetical order. This moves these block loaders to top level classes. And lets our IDE put them in alphabetical order for us. Which is nice because we'll be adding a bunch more of them as we move further on elastic#137002.
Creates a `BlockLoader` for pushing the `LENGTH` function down into the loader for `keyword` fields. It takes advantage of the terms dictionary so we only need to calculate the code point count once per unique term loaded. This `BlockLoader` implementation isn't plugged into the infrastructure for emitting it because we're waiting on the infrastructure we've started in elastic#137002. We'll make a follow up PR to plug this in. We're doing this mostly to demonstrate another function that we can push into field loading, in addition to the vector similarity functions we're building in elastic#137002. We don't expect `LENGTH` to be a super hot function. If it happens to be then this'll help. Before we plug this in we'll have to figure out emitting warnings from functions that we've fused to field loading. Because `LENGTH` can emit a warning, specifically when it hits a multivalued field.
BASE=82df0825778dc24773236299d5ecff6ae48d6786 HEAD=23450dd9771a51939201e641b79ad16fee302ff7 Branch=main
Pushes down vector similarity functions, and applies them when loading vectors.
This avoids the overhead of loading vectors into blocks, and calculate the functions afterwards in the compute engine. This directly calculates the similarity functions when loading the field, and loads the results directly into a block.
This approach consists on:
BlockLoaderFunctionConfigmarker interface, that is available on theBlockLoaderContext. This interface will be implemented by config objects, that will be used by the field types to instantiate the appropriateBlockLoaderbased on this config.BlockLoaderExpressioninterface.Functions that implement this interface will provide aBlockLoaderFunctionConfig.EsFunctionFieldthat extendsEsField. This field has aBlockLoaderFunctionConfig, which is the function configuration to extract from the field.PushDownVectorSimilarityFunctionslocal logical optimization rule, that replaces a vector similarity function applied to a field with aFieldAttributethat contains anEsFunctionFieldthat will be used to extract the function result at field loading.I've used a local logical optimization rule to push down. It looks cleaner and we don't have to deal with different cases when the attribute is sent to the local nodes or stays in the coordinator. Plus, it can use the mapping information to decide if a function can be pushed down or not.
Related: #103636
Previous work done on: #136739