ESQL: Multiple patterns for grok command#136541
Conversation
…93/elasticsearch into flash1293/grok-multiple-patterns
|
Hmm, this fails BWC tests with a mixed cluster, which is expected since this is introducing a new syntax. I'm not sure how we handle this case, could you advise? |
luigidellaquila
left a comment
There was a problem hiding this comment.
Thanks @flash1293, LGTM
I left just a couple of minor observations
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/LogicalPlanBuilder.java
Outdated
Show resolved
Hide resolved
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Sorry, I missed this comment. |
|
Hi @flash1293, I've created a changelog YAML for you. |
| if (patterns.size() > 1) { | ||
| combinedPattern = ""; | ||
| for (int i = 0; i < patterns.size(); i++) { | ||
| String pattern = patterns.get(i); |
There was a problem hiding this comment.
A bit of an edge case, but what would happen if some pattern is invalid? For example:
- %{WORD:word}\
- a)
- %{WORD:word}
The final result would be: (?:%{WORD:word}\)|(?:a))|(?:%{WORD:word})
Which lead to unexpected patterns.
This is user-made, so I don't think this is very problematic, but this looks like a grok injection and the method combinePatterns() would actually be lying here.
So, solutions:
- Can we verify and warn on wrong patterns? Is that something we do in some other case?
- Can we sanitize or remove invalid patterns?
There was a problem hiding this comment.
Whatever the resolution, we'll need tests for this, both in Grok and in ESQL, depending on what we do
There was a problem hiding this comment.
That's a fun one - tested with ingest pipelines and this indeed works here:
POST /_ingest/pipeline/_simulate
{
"docs": [
{
"_source": {
"foo": "Test)x"
}
}
],
"pipeline": {
"processors": [
{
"grok": {
"field": "foo",
"patterns": [
"%{WORD:word}\\",
"x)"
]
}
}
]
}
}
should fail because %{WORD:word}\\ and x) are not valid patterns, but it does pass, no warnings or similar.
I don't feel strongly about this case, we could pass the individual patterns down further and then validate this in the Grok lib, but:
- It seems like a lot of work for an edge case
- It's existing behavior in ingest pipelines
- We would need to compile each regex individually, which sounds like it would add a bunch of additional computation we are avoiding now
I would lean towards leaving the current behavior, wdyt?
There was a problem hiding this comment.
As the code was centralized, I guess the pipeline uses (and already used) the same logic right?
Being pragmatic, validating multiple patterns should error (probably?), which means that validating a single pattern should error too. So technically this is existing behavior, and fixing would be a breaking change (Or a bug fix).
So yeah, let's just make an issue explaining this case (For ESQL at least), and continue with this
There was a problem hiding this comment.
As the code was centralized, I guess the pipeline uses (and already used) the same logic right?
Exactly!
So technically this is existing behavior, and fixing would be a breaking change
I never get tired of quoting Hyrums Law 🙂
Closes #132486
This PR adds the ability to specify multiple grok patterns as part of a single grok command. Consistent with the grok processor for ingest pipelines, they are tried in order, the first matching one is actually applied:
returns
It's not allowed to have different types for the same semantic names in different patterns:
returns
This can be considered syntactic sugar over a more complex manual pattern.