Skip to content

ESQL: Pragma to load from stored fields#122891

Merged
nik9000 merged 23 commits intoelastic:mainfrom
nik9000:esql_pragma_load_source
Mar 12, 2025
Merged

ESQL: Pragma to load from stored fields#122891
nik9000 merged 23 commits intoelastic:mainfrom
nik9000:esql_pragma_load_source

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented Feb 18, 2025

This creates a pragma you can use to request that fields load from a stored field rather than doc values. It implements that pragma for keyword and number fields.

We expect that, for some disk configuration and some number of fields, that it's faster to load those fields from _source or stored fields than it is to use doc values. Our default is doc values and on my laptop it's always faster to use doc values. But we don't ship my laptop to every cluster.

This will let us experiment and debug slow queries by trying to load fields a different way.

You access this pragma with:

curl -HContent-Type:application/json -XPOST localhost:9200/_query?pretty -d '{
    "query": "FROM foo",
    "pragma": {
        "field_extract_preference": "PREFER_STORED"
    }
}'

On a release build you'll need to add "accept_pragma_risks": true.

This creates a `pragma` you can use to request that fields load from a
stored field rather than doc values. It implements that pragma for
`keyword` and number fields.

We expect that, for some disk configuration and some number of fields,
that it's faster to load those fields from _source or stored fields than
it is to use doc values. Our default is doc values and on my laptop it's
*always* faster to use doc values. But we don't ship my laptop to every
cluster.

This will let us experiment and debug slow queries by trying to load
fields a different way.

You access this pragma with:
```
curl -HContent-Type:application/json -XPOST localhost:9200/_query?pretty -d '{
    "query": "FROM foo",
    "pragma": {
        "field_extract_preference": "PREFER_STORED"
    }
}'
```

On a release build you'll need to add `"accept_pragma_risks": true`.
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 18, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Contributor

@lkts lkts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

List<Object[]> args = new ArrayList<>();
for (boolean syntheticSource : new boolean[] { false, true }) {
for (MappedFieldType.FieldExtractPreference preference : PREFERENCES) {
args.add(new Object[] { new Params(syntheticSource, preference) });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to work pretty well really!

@nik9000
Copy link
Member Author

nik9000 commented Feb 18, 2025

Thanks @lkts!

I'd like to have a review from @craigtaverner before merging.

@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

Copy link
Member Author

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a fun thing - if we pragma a field over to loading from _source it'll load nested fields when it really shouldn't. Subfields of nested fields shouldn't really work in ESQL.

@nik9000
Copy link
Member Author

nik9000 commented Mar 6, 2025

Found a fun thing - if we pragma a field over to loading from _source it'll load nested fields when it really shouldn't. Subfields of nested fields shouldn't really work in ESQL.

I don't think that needs to block merging this as pragmas can change behavior - but it is a bug we should fix.

@nik9000
Copy link
Member Author

nik9000 commented Mar 11, 2025

If we're going to engage this stuff by default we need to figure out the load-from-_source problem with nested - right now if you load from _source you get the nested sub-fields. We don't expect that to happen.

Copy link
Contributor

@craigtaverner craigtaverner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature. But I did have some comments I thought worth considering.

return BlockSourceReader.lookupFromNorms(name());
}
if (isIndexed() || isStored()) {
if (hasDocValues() == false && (isIndexed() || isStored())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this check is because previously we would never reach this code if hasDocValues was true, and now we might. I'm just wondering why we care to check this. If the request was to prefer_stored, then surely the existence of doc-values should not affect the decision taken on this line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a preflight check so we don't try to load from _source if the field isn't present. We only go if the field_names field has the field. Except we never make the field_names field if doc values are enabled.

I'll leave a comment.

* loading many fields. The {@link MappedFieldType} can chose a different
* method to load the field if it needs to.
*/
PREFER_STORED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the prefix PREFER_ duplicates the enum name. None of the other enum values have PREFER_ as a prefix, so perhaps this should not. The fact that this is a preference is already in the FieldExtractPreference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I feel like the other "preference"s were more like requirements. But I'll rename.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The others were intended to be preferences. The previous default behaviour of loading doc-values if they exist, otherwise source, was swapped around for spatial types, and the DOC_VALUES preference was to swap that back, for performance reasons, but we could still load from source (just slower). Also the bounds extraction preference is just a performance optimization, because if we don't do that we still get the right answer, just much slower.

Source source,
PhysicalPlan child,
List<Attribute> attributesToExtract,
MappedFieldType.FieldExtractPreference defaultPreference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor should not have the field-extract preference, since that is local-physical planning only. It should set the default NONE on the call to this() below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tried to apply this and we do need it, at least we need it how things are shaped now. InsertFieldExtraction wants to make a FieldExtractExec on the data nodes with this already set.

in.readNamedWriteable(PhysicalPlan.class),
in.readNamedWriteableCollectionAsList(Attribute.class)
in.readNamedWriteableCollectionAsList(Attribute.class),
MappedFieldType.FieldExtractPreference.NONE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this line, and have the constructor set the default. See the comments on the lines a couple of lines below for the pattern we used for which parameters can be set during local physical node planning, versus those that happen in the coordinator node.

PhysicalPlan child = randomChild(depth);
List<Attribute> attributesToExtract = randomFieldAttributes(1, 4, false);
return new FieldExtractExec(source, child, attributesToExtract);
return new FieldExtractExec(source, child, attributesToExtract, MappedFieldType.FieldExtractPreference.NONE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could revert this change, if we use the pattern described in the comments on FieldExtractPreference

@nik9000 nik9000 merged commit 50aaa1c into elastic:main Mar 12, 2025
17 checks passed
albertzaharovits pushed a commit to albertzaharovits/elasticsearch that referenced this pull request Mar 13, 2025
This creates a `pragma` you can use to request that fields load from a
stored field rather than doc values. It implements that pragma for
`keyword` and number fields.

We expect that, for some disk configuration and some number of fields,
that it's faster to load those fields from _source or stored fields than
it is to use doc values. Our default is doc values and on my laptop it's
*always* faster to use doc values. But we don't ship my laptop to every
cluster.

This will let us experiment and debug slow queries by trying to load
fields a different way.

You access this pragma with:
```
curl -HContent-Type:application/json -XPOST localhost:9200/_query?pretty -d '{
    "query": "FROM foo",
    "pragma": {
        "field_extract_preference": "STORED"
    }
}'
```

On a release build you'll need to add `"accept_pragma_risks": true`.
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Mar 13, 2025
This creates a `pragma` you can use to request that fields load from a
stored field rather than doc values. It implements that pragma for
`keyword` and number fields.

We expect that, for some disk configuration and some number of fields,
that it's faster to load those fields from _source or stored fields than
it is to use doc values. Our default is doc values and on my laptop it's
*always* faster to use doc values. But we don't ship my laptop to every
cluster.

This will let us experiment and debug slow queries by trying to load
fields a different way.

You access this pragma with:
```
curl -HContent-Type:application/json -XPOST localhost:9200/_query?pretty -d '{
    "query": "FROM foo",
    "pragma": {
        "field_extract_preference": "STORED"
    }
}'
```

On a release build you'll need to add `"accept_pragma_risks": true`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.1.0

4 participants