Fix ML calendar event update scalability issues by valeriy42 · Pull Request #136886 · elastic/elasticsearch

valeriy42 · 2025-10-21T15:09:49Z

Fixes issue where calendar events failed to update some jobs when associated with large numbers of jobs (>1000) due to queue capacity limits and sequential processing.

Problem: UpdateJobProcessNotifier has a 1000-item queue and processes updates sequentially. It uses offer() on the queue, which silently drops updates when the queue is full.

However, calendar/filter updates don't need ordering guarantees. Hence, JobManager.submitJobEventUpdate() can bypass the queue and avoid the bottleneck of the queue size.

Another problem is the "fire-and-forget" pattern: submitJobEventUpdate() returns immediately without waiting for the update to complete. I introduce RefCountingListener to track the calendar updates. We start a background thread that updates the jobs and tracks succeeded, failed, and skipped jobs, while the request is returned immediately to prevent a timeout.

Finally, if the problem with failed job updates persists, I enhanced the logging throughout the system to create a trace for future diagnostics.

Refactor JobManager.submitJobEventUpdate() to bypass UpdateJobProcessNotifier queue
Use RefCountingListener for parallel calendar/filter updates
Add comprehensive logging throughout the system
Create CalendarScalabilityIT integration tests
Add helper methods to base test class

Fixes #129777

- Refactor JobManager.submitJobEventUpdate() to bypass UpdateJobProcessNotifier queue - Use RefCountingListener for parallel calendar/filter updates - Add comprehensive logging throughout the system - Create CalendarScalabilityIT integration tests - Add helper methods to base test class Fixes issue where calendar events failed to update some jobs when associated with large numbers of jobs (>1000) due to queue capacity limits and sequential processing.

elasticsearchmachine · 2025-10-21T15:17:40Z

Hi @valeriy42, I've created a changelog YAML for you.

…g to API calls and processing job updates asynchronously in the background.

…thub.com/valeriy42/elasticsearch into bugfix/limited-update-notification-queue

…e handling in JobManager to include skipped updates. Update logging to reflect skipped updates during background calendar processing.

…hods and updating job creation visibility. Enhance ScheduledEventsIT to verify asynchronous calendar updates and add a plugin for tracking UpdateProcessAction calls.

…the updated logging package. This change improves consistency and aligns with recent codebase updates.

elasticsearchmachine · 2025-10-22T13:54:58Z

Pinging @elastic/ml-core (Team:ML)

valeriy42 · 2025-10-31T12:33:00Z

@DonalEvans , @benwtrent , @davidkyle thank you for your comments. I introduced the suggested changed. Looking forward to your new feedback.

davidkyle

LGTM

valeriy42 · 2025-11-11T09:24:33Z

The follow-up process optimization is captured in Issue #137872.

benwtrent · 2025-11-11T13:43:19Z

I was only concerned about the noisy logging, I shall remove my review request, if dave k says its good, its good.

Fixes issue where calendar events failed to update some jobs when associated with large numbers of jobs (>1000) due to queue capacity limits and sequential processing. Problem: UpdateJobProcessNotifier has a 1000-item queue and processes updates sequentially. It uses offer() on the queue, which silently drops updates when the queue is full. However, calendar/filter updates don't need ordering guarantees. Hence, JobManager.submitJobEventUpdate() can bypass the queue and avoid the bottleneck of the queue size. Another problem is the "fire-and-forget" pattern: submitJobEventUpdate() returns immediately without waiting for the update to complete. I introduce RefCountingListener to track the calendar updates. We start a background thread that updates the jobs and tracks succeeded, failed, and skipped jobs, while the request is returned immediately to prevent a timeout. Finally, if the problem with failed job updates persists, I enhanced the logging throughout the system to create a trace for future diagnostics. Refactor JobManager.submitJobEventUpdate() to bypass UpdateJobProcessNotifier queue Use RefCountingListener for parallel calendar/filter updates Add comprehensive logging throughout the system Create CalendarScalabilityIT integration tests Add helper methods to base test class

elasticsearchmachine · 2025-11-13T10:26:47Z

💚 Backport successful

Status	Branch	Result
✅	9.2
✅	8.19
✅	9.1

Fixes issue where calendar events failed to update some jobs when associated with large numbers of jobs (>1000) due to queue capacity limits and sequential processing. Problem: UpdateJobProcessNotifier has a 1000-item queue and processes updates sequentially. It uses offer() on the queue, which silently drops updates when the queue is full. However, calendar/filter updates don't need ordering guarantees. Hence, JobManager.submitJobEventUpdate() can bypass the queue and avoid the bottleneck of the queue size. Another problem is the "fire-and-forget" pattern: submitJobEventUpdate() returns immediately without waiting for the update to complete. I introduce RefCountingListener to track the calendar updates. We start a background thread that updates the jobs and tracks succeeded, failed, and skipped jobs, while the request is returned immediately to prevent a timeout. Finally, if the problem with failed job updates persists, I enhanced the logging throughout the system to create a trace for future diagnostics. Refactor JobManager.submitJobEventUpdate() to bypass UpdateJobProcessNotifier queue Use RefCountingListener for parallel calendar/filter updates Add comprehensive logging throughout the system Create CalendarScalabilityIT integration tests Add helper methods to base test class

…-json * upstream/main: (158 commits) Cleanup files from repo root folder (elastic#138030) Implement OpenShift AI integration for chat completion, embeddings, and reranking (elastic#136624) Optimize AsyncSearchErrorTraceIT to avoid failures (elastic#137716) Removes support for null TransportService in RemoteClusterService (elastic#137939) Mute org.elasticsearch.index.mapper.DateFieldMapperTests testSortShortcuts elastic#138018 rest-api-spec: fix type of enums (elastic#137521) Update Gradle wrapper to 9.2.0 (elastic#136155) Add RCS Strong Verification Documentation (elastic#137822) Use docvalue skippers on dimension fields (elastic#137029) Introduce INDEX_SHARD_COUNT_FORMAT (elastic#137210) Mute org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT testCreatesChatCompletion_AndThenCreatesTextEmbedding elastic#138012 Fix ES|QL search context creation to use correct results type (elastic#137994) Improve Snapshot Logging (elastic#137470) Support extra output field in TOP function (elastic#135434) Remove NumericDoubleValues class (elastic#137884) [ML] Fix ML calendar event update scalability issues (elastic#136886) Task may be unregistered outside of the trace context in exceptional cases. (elastic#137865) Refine workaround for S3 repo analysis known issue (elastic#138000) Additional DEBUG logging on authc failures (elastic#137941) Cleanup index resolution (elastic#137867) ...

valeriy42 added >bug v9.3.0 auto-backport Automatically create backport pull requests when merged v8.19.6 v9.1.6 v9.2.1 v8.19.7 v9.1.7 :ml Machine learning and removed v9.1.6 v8.19.6 labels Oct 21, 2025

[CI] Auto commit changes from spotless

6957a46

valeriy42 and others added 10 commits October 21, 2025 17:17

Update docs/changelog/136886.yaml

0311f6c

checkstyle

9eda793

Merge branch 'main' into bugfix/limited-update-notification-queue

defa8c2

[CI] Auto commit changes from spotless

8514752

Progress-Based Response for calendar updates by immediately respondin…

ddd9e53

…g to API calls and processing job updates asynchronously in the background.

Merge branch 'bugfix/limited-update-notification-queue' of https://gi…

10fea2a

…thub.com/valeriy42/elasticsearch into bugfix/limited-update-notification-queue

spotless

771b430

Remove CalendarScalabilityIT integration tests and refactor job updat…

95a07dd

…e handling in JobManager to include skipped updates. Update logging to reflect skipped updates during background calendar processing.

Refactor integration tests for ML job handling by removing unused met…

613bcce

…hods and updating job creation visibility. Enhance ScheduledEventsIT to verify asynchronous calendar updates and add a plugin for tracking UpdateProcessAction calls.

Refactor logging imports in TransportPostCalendarEventsAction to use …

8bbe8af

…the updated logging package. This change improves consistency and aligns with recent codebase updates.

valeriy42 marked this pull request as ready for review October 22, 2025 13:54

elasticsearchmachine added the Team:ML Meta label for the ML team label Oct 22, 2025

davidkyle self-requested a review October 22, 2025 14:18

valeriy42 added 2 commits October 22, 2025 16:46

fix logger check

8361516

Merge branch 'main' into bugfix/limited-update-notification-queue

ed18e77

valeriy42 requested a review from benwtrent October 31, 2025 12:32

DonalEvans approved these changes Nov 3, 2025

View reviewed changes

davidkyle approved these changes Nov 5, 2025

View reviewed changes

elasticsearchmachine added v9.1.8 v9.2.2 v8.19.8 and removed v9.1.7 v9.2.1 v8.19.7 labels Nov 6, 2025

valeriy42 added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Nov 11, 2025

Merge branch 'main' into bugfix/limited-update-notification-queue

5167a59

valeriy42 mentioned this pull request Nov 11, 2025

[ML] Optimize the job calendar update procedure #137872

Open

benwtrent removed their request for review November 11, 2025 13:43

Merge branch 'main' into bugfix/limited-update-notification-queue

9102abf

valeriy42 merged commit edddc82 into elastic:main Nov 13, 2025
35 checks passed

valeriy42 deleted the bugfix/limited-update-notification-queue branch November 13, 2025 10:25

valeriy42 mentioned this pull request Nov 13, 2025

[9.2] [ML] Fix ML calendar event update scalability issues (#136886) #138005

Merged

This was referenced Nov 13, 2025

[8.19] [ML] Fix ML calendar event update scalability issues (#136886) #138006

Merged

[9.1] [ML] Fix ML calendar event update scalability issues (#136886) #138007

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ML calendar event update scalability issues#136886

Fix ML calendar event update scalability issues#136886
valeriy42 merged 23 commits intoelastic:mainfrom
valeriy42:bugfix/limited-update-notification-queue

valeriy42 commented Oct 21, 2025 •

edited

Loading

elasticsearchmachine commented Oct 21, 2025

elasticsearchmachine commented Oct 22, 2025

valeriy42 commented Oct 31, 2025

davidkyle left a comment

valeriy42 commented Nov 11, 2025

benwtrent commented Nov 11, 2025

Uh oh!

elasticsearchmachine commented Nov 13, 2025

Labels

5 participants

Conversation

valeriy42 commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

elasticsearchmachine commented Oct 21, 2025

elasticsearchmachine commented Oct 22, 2025

valeriy42 commented Oct 31, 2025

davidkyle left a comment

Choose a reason for hiding this comment

valeriy42 commented Nov 11, 2025

benwtrent commented Nov 11, 2025

Uh oh!

elasticsearchmachine commented Nov 13, 2025

💚 Backport successful

Labels

5 participants

valeriy42 commented Oct 21, 2025 •

edited

Loading