Skip to content

[Observability:Streams] Fix too_small zod error for ai pipeline suggestions that have empty string grok patterns#251113

Merged
couvq merged 5 commits intoelastic:mainfrom
couvq:fix_ai_suggestion_too_small_error
Feb 17, 2026
Merged

[Observability:Streams] Fix too_small zod error for ai pipeline suggestions that have empty string grok patterns#251113
couvq merged 5 commits intoelastic:mainfrom
couvq:fix_ai_suggestion_too_small_error

Conversation

@couvq
Copy link
Contributor

@couvq couvq commented Jan 30, 2026

Closes https://github.com/elastic/observability-error-backlog/issues/407
Closes https://github.com/elastic/observability-error-backlog/issues/452

Description

The suggestions pipeline was generating grok patterns that had empty string patterns, leading to a too_small error when generating a pipeline suggestion. This PR filters out any patterns that have empty string inputs, which resolved the error we have been seeing.

Before

Screen.Recording.2026-01-30.at.9.22.55.AM.mov

After

Screen.Recording.2026-01-30.at.12.28.12.PM.mov
@couvq couvq changed the title fix too_small error for ai pipeline suggestions Jan 30, 2026
@couvq couvq added backport:version Backport to applied version labels release_note:fix Team:obs-onboarding Observability Onboarding Team Feature:Streams This is the label for the Streams Project v9.4.0 v9.3.1 labels Jan 30, 2026
@couvq couvq marked this pull request as ready for review January 30, 2026 17:33
@couvq couvq requested review from a team as code owners January 30, 2026 17:33
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-onboarding-team (Team:obs-onboarding)

@flash1293
Copy link
Contributor

@couvq Sorry for the late reply - I tested this and it seems to work fine in the UI.

However, the evals return a zero score:

Screenshot 2026-02-06 at 12 10 30

I took a look at the traces and it can't figure out to create a pipeline that's actually parsing something, so it just gives up. Which might be OK for the data at hand, but then it's not a good eval, since the expected thing happens, so the score shouldn't be 0.

Actually for this data this is probably the better behavior than trying to invent a meaningless pipeline that breaks more than it actually does (a good test we are currently missing). I'd say we should change the eval to expect no pipeline (0 processing steps) in this case.

Wdyt @LucaWintergerst ? If you take a look at the sample data, what outcome would you expect from the LLM?

@LucaWintergerst
Copy link
Contributor

I agree, not getting a result here is the better outcome we'd want to test for. If it does suggest one, that would indicate that it's very very eager to do things even if it has very little good reasons to actually try processing things

@couvq couvq force-pushed the fix_ai_suggestion_too_small_error branch from 4cd4c7b to ea59472 Compare February 15, 2026 21:21
@couvq
Copy link
Contributor Author

couvq commented Feb 15, 2026

@couvq Sorry for the late reply - I tested this and it seems to work fine in the UI.

However, the evals return a zero score:

Screenshot 2026-02-06 at 12 10 30 I took a look at the traces and it can't figure out to create a pipeline that's actually parsing something, so it just gives up. Which might be OK for the data at hand, but then it's not a good eval, since the expected thing happens, so the score shouldn't be 0.

Actually for this data this is probably the better behavior than trying to invent a meaningless pipeline that breaks more than it actually does (a good test we are currently missing). I'd say we should change the eval to expect no pipeline (0 processing steps) in this case.

@flash1293 I've added a commit to change the eval to expect no pipeline ea59472

@flash1293
Copy link
Contributor

flash1293 commented Feb 16, 2026

@couvq it still returns a super low score, I think we need a bit of a deeper change here.
Screenshot 2026-02-16 at 10 01 28

Check how to run the evals locally to iterate: x-pack/platform/packages/shared/kbn-evals-suite-streams/README.md

@couvq
Copy link
Contributor Author

couvq commented Feb 16, 2026

@couvq it still returns a super low score, I think we need a bit of a deeper change here. Screenshot 2026-02-16 at 10 01 28

Check how to run the evals locally to iterate: x-pack/platform/packages/shared/kbn-evals-suite-streams/README.md

@flash1293 Are we targeting to get pretty close to a 1.0 score ideally? which model are you using?

@flash1293
Copy link
Contributor

@couvq If we think the behavior is correct, then a good score would make sense, right? 1 is perfect. I'm using 4.5 sonnet

@couvq
Copy link
Contributor Author

couvq commented Feb 17, 2026

@flash1293 I made some changes to the LLM prompt to add instructions on when not to create a pipeline to handle this case. I also updated the eval logic to provide a perfect score when no pipeline was generated and that was the expected result, as well as providing a 0 score if no pipeline was generated but one was expected. Now the LLM is properly generating an empty pipeline with a perfect score for the new eval. How do you feel about this approach?

@flash1293
Copy link
Contributor

@couvq Thanks a lot for this! Could you run the whole pipeline suggestion eval suite and paste the result here? Soon this should work automatically, but it doesn't yet.

@couvq
Copy link
Contributor Author

couvq commented Feb 17, 2026

@couvq Thanks a lot for this! Could you run the whole pipeline suggestion eval suite and paste the result here? Soon this should work automatically, but it doesn't yet.

 Pipeline Suggestion - structured │ 1 │              mean: 1 │                mean: 1 ║
      ║                                  │   │            median: 1 │              median: 1 ║
      ║                                  │   │               std: 0 │                 std: 0 ║
      ║                                  │   │               min: 1 │                 min: 1 ║
      ║                                  │   │               max: 1 │                 max: 1 ║
      ╟──────────────────────────────────┼───┼──────────────────────┼────────────────────────╢
      ║ Pipeline Suggestion - HDFS       │ 1 │              mean: 1 │             mean: 0.95 ║
      ║                                  │   │            median: 1 │           median: 0.95 ║
      ║                                  │   │               std: 0 │                 std: 0 ║
      ║                                  │   │               min: 1 │              min: 0.95 ║
      ║                                  │   │               max: 1 │              max: 0.95 ║
      ╟──────────────────────────────────┼───┼──────────────────────┼────────────────────────╢
      ║ Overall                          │ 2 │              mean: 1 │             mean: 0.97 ║
      ║                                  │   │            median: 1 │           median: 0.97 ║
      ║                                  │   │               std: 0 │              std: 0.04 ║
      ║                                  │   │               min: 1 │              min: 0.95 ║
      ║                                  │   │               max: 1 │                 max: 1 

@flash1293 Looks like my changes broke 4 of the preexisting tests as it is now a bit to eager to return an empty pipeline, I'll tweak the LLM prompt again to fix those.

@couvq
Copy link
Contributor Author

couvq commented Feb 17, 2026

@flash1293 fixed and now all the evals run properly

 ═══ EVALUATION RESULTS ═══
      ╔══════════════════════════════════╤═══╤══════════════════════╤════════════════════════╗
      ║ Dataset                          │ # │ llm_pipeline_quality │ pipeline_quality_score ║
      ╟──────────────────────────────────┼───┼──────────────────────┼────────────────────────╢
      ║ Pipeline Suggestion - Apache     │ 1 │              mean: 1 │             mean: 0.98 ║
      ║                                  │   │            median: 1 │           median: 0.98 ║
      ║                                  │   │               std: 0 │                 std: 0 ║
      ║                                  │   │               min: 1 │              min: 0.98 ║
      ║                                  │   │               max: 1 │              max: 0.98 ║
      ╟──────────────────────────────────┼───┼──────────────────────┼────────────────────────╢
      ║ Pipeline Suggestion - OpenSSH    │ 1 │              mean: 1 │             mean: 0.88 ║
      ║                                  │   │            median: 1 │           median: 0.88 ║
      ║                                  │   │               std: 0 │                 std: 0 ║
      ║                                  │   │               min: 1 │              min: 0.88 ║
      ║                                  │   │               max: 1 │              max: 0.88 ║
      ╟──────────────────────────────────┼───┼──────────────────────┼────────────────────────╢
      ║ Pipeline Suggestion - structured │ 1 │              mean: 1 │                mean: 1 ║
      ║                                  │   │            median: 1 │              median: 1 ║
      ║                                  │   │               std: 0 │                 std: 0 ║
      ║                                  │   │               min: 1 │                 min: 1 ║
      ║                                  │   │               max: 1 │                 max: 1 ║
      ╟──────────────────────────────────┼───┼──────────────────────┼────────────────────────╢
      ║ Pipeline Suggestion - Spark      │ 1 │              mean: 1 │             mean: 0.97 ║
      ║                                  │   │            median: 1 │           median: 0.97 ║
      ║                                  │   │               std: 0 │                 std: 0 ║
      ║                                  │   │               min: 1 │              min: 0.97 ║
      ║                                  │   │               max: 1 │              max: 0.97 ║
      ╟──────────────────────────────────┼───┼──────────────────────┼────────────────────────╢
      ║ Pipeline Suggestion - HDFS       │ 1 │              mean: 1 │             mean: 0.95 ║
      ║                                  │   │            median: 1 │           median: 0.95 ║
      ║                                  │   │               std: 0 │                 std: 0 ║
      ║                                  │   │               min: 1 │              min: 0.95 ║
      ║                                  │   │               max: 1 │              max: 0.95 ║
      ╟──────────────────────────────────┼───┼──────────────────────┼────────────────────────╢
      ║ Pipeline Suggestion - Zookeeper  │ 1 │              mean: 1 │             mean: 0.95 ║
      ║                                  │   │            median: 1 │           median: 0.95 ║
      ║                                  │   │               std: 0 │                 std: 0 ║
      ║                                  │   │               min: 1 │              min: 0.95 ║
      ║                                  │   │               max: 1 │              max: 0.95 ║
      ╟──────────────────────────────────┼───┼──────────────────────┼────────────────────────╢
      ║ Overall                          │ 6 │              mean: 1 │             mean: 0.95 ║
      ║                                  │   │            median: 1 │           median: 0.95 ║
      ║                                  │   │               std: 0 │              std: 0.04 ║
      ║                                  │   │               min: 1 │              min: 0.88 ║
      ║                                  │   │               max: 1 │                 max: 1 ║
      ╚══════════════════════════════════╧═══╧══════════════════════╧════════════════════════╝
@flash1293
Copy link
Contributor

@couvq I'm not sure how 9c7d4c9 discourages the LLM to commit an empty pipeline, can you explain?

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

@couvq
Copy link
Contributor Author

couvq commented Feb 17, 2026

@couvq I'm not sure how 9c7d4c9 discourages the LLM to commit an empty pipeline, can you explain?

@flash1293 The intention there is to explicitly discourage commiting an empty pipeline when a parsing processor is provided. I added it as the 4 failing tests expected processors but the LLM was commiting an empty pipeline.

Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh, got it, LGTM

@couvq couvq merged commit 8405020 into elastic:main Feb 17, 2026
16 checks passed
@couvq
Copy link
Contributor Author

couvq commented Feb 17, 2026

@flash1293 thanks for the thorough review!

@kibanamachine
Copy link
Contributor

Starting backport for target branches: 9.3

https://github.com/elastic/kibana/actions/runs/22109625238

@kibanamachine
Copy link
Contributor

💔 All backports failed

Status Branch Result
9.3 Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 251113

Questions ?

Please refer to the Backport tool documentation

patrykkopycinski pushed a commit to patrykkopycinski/kibana that referenced this pull request Feb 19, 2026
…stions that have empty string grok patterns (elastic#251113)

Closes elastic/observability-error-backlog#407
Closes elastic/observability-error-backlog#452

## Description
The suggestions pipeline was generating grok patterns that had empty
string patterns, leading to a `too_small` error when generating a
pipeline suggestion. This PR filters out any patterns that have empty
string inputs, which resolved the error we have been seeing.

## Before

https://github.com/user-attachments/assets/c8cdb277-d0f0-4272-b94d-0aa244c841a9

## After

https://github.com/user-attachments/assets/8864ad1a-51c9-4b6a-b11c-e3d48668a5ad
ersin-erdal pushed a commit to ersin-erdal/kibana that referenced this pull request Feb 19, 2026
…stions that have empty string grok patterns (elastic#251113)

Closes elastic/observability-error-backlog#407
Closes elastic/observability-error-backlog#452

## Description
The suggestions pipeline was generating grok patterns that had empty
string patterns, leading to a `too_small` error when generating a
pipeline suggestion. This PR filters out any patterns that have empty
string inputs, which resolved the error we have been seeing.

## Before

https://github.com/user-attachments/assets/c8cdb277-d0f0-4272-b94d-0aa244c841a9

## After

https://github.com/user-attachments/assets/8864ad1a-51c9-4b6a-b11c-e3d48668a5ad
@kibanamachine kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Feb 19, 2026
@kibanamachine
Copy link
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add a backport:* label or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 251113 locally
cc: @couvq

@couvq couvq added backport:skip This PR does not require backporting and removed backport missing Added to PRs automatically when the are determined to be missing a backport. backport:version Backport to applied version labels labels Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting Feature:Streams This is the label for the Streams Project release_note:fix Team:obs-onboarding Observability Onboarding Team v9.3.1 v9.4.0

5 participants