ESQL: Pragma to load from stored fields#122891
Conversation
This creates a `pragma` you can use to request that fields load from a
stored field rather than doc values. It implements that pragma for
`keyword` and number fields.
We expect that, for some disk configuration and some number of fields,
that it's faster to load those fields from _source or stored fields than
it is to use doc values. Our default is doc values and on my laptop it's
*always* faster to use doc values. But we don't ship my laptop to every
cluster.
This will let us experiment and debug slow queries by trying to load
fields a different way.
You access this pragma with:
```
curl -HContent-Type:application/json -XPOST localhost:9200/_query?pretty -d '{
"query": "FROM foo",
"pragma": {
"field_extract_preference": "PREFER_STORED"
}
}'
```
On a release build you'll need to add `"accept_pragma_risks": true`.
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
| List<Object[]> args = new ArrayList<>(); | ||
| for (boolean syntheticSource : new boolean[] { false, true }) { | ||
| for (MappedFieldType.FieldExtractPreference preference : PREFERENCES) { | ||
| args.add(new Object[] { new Params(syntheticSource, preference) }); |
There was a problem hiding this comment.
It seems to work pretty well really!
|
Thanks @lkts! I'd like to have a review from @craigtaverner before merging. |
|
Hi @nik9000, I've created a changelog YAML for you. |
…sql_pragma_load_source
…sql_pragma_load_source
nik9000
left a comment
There was a problem hiding this comment.
Found a fun thing - if we pragma a field over to loading from _source it'll load nested fields when it really shouldn't. Subfields of nested fields shouldn't really work in ESQL.
I don't think that needs to block merging this as pragmas can change behavior - but it is a bug we should fix. |
|
If we're going to engage this stuff by default we need to figure out the load-from-_source problem with nested - right now if you load from _source you get the nested sub-fields. We don't expect that to happen. |
craigtaverner
left a comment
There was a problem hiding this comment.
Nice feature. But I did have some comments I thought worth considering.
| return BlockSourceReader.lookupFromNorms(name()); | ||
| } | ||
| if (isIndexed() || isStored()) { | ||
| if (hasDocValues() == false && (isIndexed() || isStored())) { |
There was a problem hiding this comment.
I understand that this check is because previously we would never reach this code if hasDocValues was true, and now we might. I'm just wondering why we care to check this. If the request was to prefer_stored, then surely the existence of doc-values should not affect the decision taken on this line?
There was a problem hiding this comment.
This is a preflight check so we don't try to load from _source if the field isn't present. We only go if the field_names field has the field. Except we never make the field_names field if doc values are enabled.
I'll leave a comment.
| * loading many fields. The {@link MappedFieldType} can chose a different | ||
| * method to load the field if it needs to. | ||
| */ | ||
| PREFER_STORED; |
There was a problem hiding this comment.
I feel like the prefix PREFER_ duplicates the enum name. None of the other enum values have PREFER_ as a prefix, so perhaps this should not. The fact that this is a preference is already in the FieldExtractPreference.
There was a problem hiding this comment.
Sure. I feel like the other "preference"s were more like requirements. But I'll rename.
There was a problem hiding this comment.
The others were intended to be preferences. The previous default behaviour of loading doc-values if they exist, otherwise source, was swapped around for spatial types, and the DOC_VALUES preference was to swap that back, for performance reasons, but we could still load from source (just slower). Also the bounds extraction preference is just a performance optimization, because if we don't do that we still get the right answer, just much slower.
| Source source, | ||
| PhysicalPlan child, | ||
| List<Attribute> attributesToExtract, | ||
| MappedFieldType.FieldExtractPreference defaultPreference |
There was a problem hiding this comment.
This constructor should not have the field-extract preference, since that is local-physical planning only. It should set the default NONE on the call to this() below.
There was a problem hiding this comment.
Just tried to apply this and we do need it, at least we need it how things are shaped now. InsertFieldExtraction wants to make a FieldExtractExec on the data nodes with this already set.
| in.readNamedWriteable(PhysicalPlan.class), | ||
| in.readNamedWriteableCollectionAsList(Attribute.class) | ||
| in.readNamedWriteableCollectionAsList(Attribute.class), | ||
| MappedFieldType.FieldExtractPreference.NONE |
There was a problem hiding this comment.
Delete this line, and have the constructor set the default. See the comments on the lines a couple of lines below for the pattern we used for which parameters can be set during local physical node planning, versus those that happen in the coordinator node.
| PhysicalPlan child = randomChild(depth); | ||
| List<Attribute> attributesToExtract = randomFieldAttributes(1, 4, false); | ||
| return new FieldExtractExec(source, child, attributesToExtract); | ||
| return new FieldExtractExec(source, child, attributesToExtract, MappedFieldType.FieldExtractPreference.NONE); |
There was a problem hiding this comment.
Could revert this change, if we use the pattern described in the comments on FieldExtractPreference
This creates a `pragma` you can use to request that fields load from a
stored field rather than doc values. It implements that pragma for
`keyword` and number fields.
We expect that, for some disk configuration and some number of fields,
that it's faster to load those fields from _source or stored fields than
it is to use doc values. Our default is doc values and on my laptop it's
*always* faster to use doc values. But we don't ship my laptop to every
cluster.
This will let us experiment and debug slow queries by trying to load
fields a different way.
You access this pragma with:
```
curl -HContent-Type:application/json -XPOST localhost:9200/_query?pretty -d '{
"query": "FROM foo",
"pragma": {
"field_extract_preference": "STORED"
}
}'
```
On a release build you'll need to add `"accept_pragma_risks": true`.
This creates a `pragma` you can use to request that fields load from a
stored field rather than doc values. It implements that pragma for
`keyword` and number fields.
We expect that, for some disk configuration and some number of fields,
that it's faster to load those fields from _source or stored fields than
it is to use doc values. Our default is doc values and on my laptop it's
*always* faster to use doc values. But we don't ship my laptop to every
cluster.
This will let us experiment and debug slow queries by trying to load
fields a different way.
You access this pragma with:
```
curl -HContent-Type:application/json -XPOST localhost:9200/_query?pretty -d '{
"query": "FROM foo",
"pragma": {
"field_extract_preference": "STORED"
}
}'
```
On a release build you'll need to add `"accept_pragma_risks": true`.
This creates a
pragmayou can use to request that fields load from a stored field rather than doc values. It implements that pragma forkeywordand number fields.We expect that, for some disk configuration and some number of fields, that it's faster to load those fields from _source or stored fields than it is to use doc values. Our default is doc values and on my laptop it's always faster to use doc values. But we don't ship my laptop to every cluster.
This will let us experiment and debug slow queries by trying to load fields a different way.
You access this pragma with:
On a release build you'll need to add
"accept_pragma_risks": true.