Skip to content

Eval REPLACE function#98909

Merged
astefan merged 18 commits intomainfrom
esql/replaceFunction
Sep 29, 2023
Merged

Eval REPLACE function#98909
astefan merged 18 commits intomainfrom
esql/replaceFunction

Conversation

@abatsakis
Copy link
Contributor

@abatsakis abatsakis commented Aug 26, 2023

The EVAL replace function that substitutes the new string for every occurrence of the regex in the source string.

@abatsakis abatsakis added Team:QL (Deprecated) Meta label for query languages team :Analytics/ES|QL AKA ESQL labels Aug 26, 2023
@cla-checker-service
Copy link

cla-checker-service bot commented Aug 26, 2023

❌ Author of the following commits did not sign a Contributor Agreement:
a0aacec, ff0117b, 95ca036, 26bc2fc, a4ff2c9

Please, read and sign the above mentioned agreement if you want to contribute to this project

@github-actions
Copy link
Contributor

Documentation preview:

@elasticsearchmachine elasticsearchmachine added v8.11.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Aug 26, 2023
@elasticsearchmachine
Copy link
Collaborator

@abatsakis please enable the option "Allow edits and access to secrets by maintainers" on your PR. For more information, see the documentation.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL)

@abatsakis abatsakis requested review from astefan and bpintea August 26, 2023 15:25
Copy link
Contributor

@luigidellaquila luigidellaquila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment regarding a possible significant optimization.

}

@Evaluator
static BytesRef process(BytesRef str, BytesRef regex, BytesRef newStr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex is a very good candidate to be optimized as a @Fixed parameter.
You can have an additional process method like

@Evaluator(extraName = "Constant")
static BytesRef process(BytesRef str, @Fixed Pattern regex, BytesRef newStr) { ... }

(not sure it works in this order, maybe you'll have to switch second and third parameter)
and compile the regex Pattern only once in case the expression is foldable (that is likely the most common use case). It will make the execution much more efficient.
For a practical example see https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/date/DateFormat.java#L87

Copy link
Contributor

@abdonpijpelink abdonpijpelink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a small docs suggestion (every command on a new line; remove the semicolon).

Comment on lines 414 to 416
//tag::replaceString[]
ROW str = "Hello World" | EVAL str = REPLACE(str, "World", "Universe") | KEEP str;
// end::replaceString[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//tag::replaceString[]
ROW str = "Hello World" | EVAL str = REPLACE(str, "World", "Universe") | KEEP str;
// end::replaceString[]
//tag::replaceString[]
ROW str = "Hello World"
| EVAL str = REPLACE(str, "World", "Universe")
| KEEP str
// end::replaceString[]
;
}

try {
return new BytesRef(str.utf8ToString().replaceAll(regex.pattern(), newStr.utf8ToString()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you have the compiled Pattern, you can use it for a more efficient replaceAll():

return new BytesRef(regex.matcher(str.utf8ToString()).replaceAll(newStr.utf8ToString()));
Alexandros Batsakis added 2 commits September 10, 2023 14:53
Copy link
Contributor

@luigidellaquila luigidellaquila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thanks for your patience @abatsakis, I just left one last small comment about tests, but for the rest the PR is 👍 👍 👍

;

replace complex
from employees | where emp_no <= 10010 | eval f_l = replace(replace(substring(last_name, 1, 20), "al", "AB"), "a", "Z") | keep emp_no, last_name, f_l;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a | sort emp_no somewhere in these two queries, otherwise they'll randomly fail in multi-node

@astefan
Copy link
Contributor

astefan commented Sep 27, 2023

@abatsakis I'll polish this one and push it over the finishing line, if that's ok with you.

@astefan astefan added the ES|QL-ui Impacts ES|QL UI label Sep 29, 2023
Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* and click {@code Build->Recompile 'FunctionName.java'} in IntelliJ. This should generate
* an {@link org.elasticsearch.compute.operator.EvalOperator.ExpressionEvaluator} implementation
* and click {@code Build->Recompile 'FunctionName.java'} in IntelliJ or run the
* {@code CsvTests} again.. This should generate an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* {@code CsvTests} again.. This should generate an
* {@code CsvTests} again. This should generate an
;

replaceRegex
from hosts | where host == "epsilon" | eval l1=replace(host_group, "\\s+", "") | sort l1 | keep l1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just in case any other change might be applied: keep host_group too.

@@ -0,0 +1,16 @@
[[esql-replace]]
=== `REPLACE`
This function substitutes the replacement string (3rd argument) for every occurrence of the regular expression (2nd argument) in the string (1st argument).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional alternative: The function substitutes in the string (1st argument) any match of the regular expression (2nd argument) with the replacement string (3rd argument).
I'd find it easier to "read the function" as it flows.

regexPattern = Pattern.compile(((BytesRef) regex.fold()).utf8ToString());
} catch (PatternSyntaxException pse) {
// TODO this is not right (inconsistent). See also https://github.com/elastic/elasticsearch/issues/100038
// this should generate a header warning and return null (as do the rest of this functionality in evaluators),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO throwing here (as it now happens) is what we should align to: the argument (the reg. expression) isn't correct and we should inform the user before execution; less so, but somewhat similar to a syntax error.
We could let that happen maybe even earlier (like when folding). So the try-catch could be dropped, IMO.
(But we can go ahead with merging this as is and revisiting when deciding on #100038).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am letting this one as is for now. I reckon we'll have a discussion about this general approach (not only here) and whatever we decide then we'll be applied everywhere. I don't like the inconsistent behavior, but I do also see a reasoning in throwing an exception here. I'm very much in favor of the consistency approach, though.

assertThat(process("I love cats and cats are amazing.", "\\bcats\\b", "dogs"), equalTo("I love dogs and dogs are amazing."));
}

public void testUnicode() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@luigidellaquila luigidellaquila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@astefan
Copy link
Contributor

astefan commented Sep 29, 2023

@elasticmachine update branch

@astefan astefan merged commit 2ccdae6 into main Sep 29, 2023
@astefan astefan deleted the esql/replaceFunction branch September 29, 2023 14:41
piergm pushed a commit to piergm/elasticsearch that referenced this pull request Oct 2, 2023
Co-authored-by: Alexandros Batsakis <abatsakis@splunk.com>
Co-authored-by: Andrei Stefan <andrei@elastic.co>
jakelandis pushed a commit to jakelandis/elasticsearch that referenced this pull request Oct 2, 2023
Co-authored-by: Alexandros Batsakis <abatsakis@splunk.com>
Co-authored-by: Andrei Stefan <andrei@elastic.co>
@nik9000 nik9000 mentioned this pull request Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement ES|QL-ui Impacts ES|QL UI external-contributor Pull request authored by a developer outside the Elasticsearch team Team:QL (Deprecated) Meta label for query languages team v8.11.0

7 participants