Skip to content

Fix ML calendar event update scalability issues#136886

Merged
valeriy42 merged 23 commits intoelastic:mainfrom
valeriy42:bugfix/limited-update-notification-queue
Nov 13, 2025
Merged

Fix ML calendar event update scalability issues#136886
valeriy42 merged 23 commits intoelastic:mainfrom
valeriy42:bugfix/limited-update-notification-queue

Conversation

@valeriy42
Copy link
Contributor

@valeriy42 valeriy42 commented Oct 21, 2025

Fixes issue where calendar events failed to update some jobs when associated with large numbers of jobs (>1000) due to queue capacity limits and sequential processing.

Problem: UpdateJobProcessNotifier has a 1000-item queue and processes updates sequentially. It uses offer() on the queue, which silently drops updates when the queue is full.

However, calendar/filter updates don't need ordering guarantees. Hence, JobManager.submitJobEventUpdate() can bypass the queue and avoid the bottleneck of the queue size.

Another problem is the "fire-and-forget" pattern: submitJobEventUpdate() returns immediately without waiting for the update to complete. I introduce RefCountingListener to track the calendar updates. We start a background thread that updates the jobs and tracks succeeded, failed, and skipped jobs, while the request is returned immediately to prevent a timeout.

Finally, if the problem with failed job updates persists, I enhanced the logging throughout the system to create a trace for future diagnostics.

  • Refactor JobManager.submitJobEventUpdate() to bypass UpdateJobProcessNotifier queue
  • Use RefCountingListener for parallel calendar/filter updates
  • Add comprehensive logging throughout the system
  • Create CalendarScalabilityIT integration tests
  • Add helper methods to base test class

Fixes #129777

- Refactor JobManager.submitJobEventUpdate() to bypass UpdateJobProcessNotifier queue
- Use RefCountingListener for parallel calendar/filter updates
- Add comprehensive logging throughout the system
- Create CalendarScalabilityIT integration tests
- Add helper methods to base test class

Fixes issue where calendar events failed to update some jobs when associated
with large numbers of jobs (>1000) due to queue capacity limits and sequential processing.
@valeriy42 valeriy42 added >bug v9.3.0 auto-backport Automatically create backport pull requests when merged v8.19.6 v9.1.6 v9.2.1 v8.19.7 v9.1.7 :ml Machine learning and removed v9.1.6 v8.19.6 labels Oct 21, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @valeriy42, I've created a changelog YAML for you.

valeriy42 and others added 10 commits October 21, 2025 17:17
…g to API calls and processing job updates asynchronously in the background.
…e handling in JobManager to include skipped updates. Update logging to reflect skipped updates during background calendar processing.
…hods and updating job creation visibility. Enhance ScheduledEventsIT to verify asynchronous calendar updates and add a plugin for tracking UpdateProcessAction calls.
…the updated logging package. This change improves consistency and aligns with recent codebase updates.
@valeriy42 valeriy42 marked this pull request as ready for review October 22, 2025 13:54
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Oct 22, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@davidkyle davidkyle self-requested a review October 22, 2025 14:18
@valeriy42 valeriy42 requested a review from benwtrent October 31, 2025 12:32
@valeriy42
Copy link
Contributor Author

@DonalEvans , @benwtrent , @davidkyle thank you for your comments. I introduced the suggested changed. Looking forward to your new feedback.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@valeriy42 valeriy42 added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Nov 11, 2025
@valeriy42
Copy link
Contributor Author

The follow-up process optimization is captured in Issue #137872.

@benwtrent
Copy link
Member

I was only concerned about the noisy logging, I shall remove my review request, if dave k says its good, its good.

@benwtrent benwtrent removed their request for review November 11, 2025 13:43
@valeriy42 valeriy42 merged commit edddc82 into elastic:main Nov 13, 2025
35 checks passed
@valeriy42 valeriy42 deleted the bugfix/limited-update-notification-queue branch November 13, 2025 10:25
valeriy42 added a commit to valeriy42/elasticsearch that referenced this pull request Nov 13, 2025
Fixes issue where calendar events failed to update some jobs when associated with large numbers of jobs (>1000) due to queue capacity limits and sequential processing.

Problem: UpdateJobProcessNotifier has a 1000-item queue and processes updates sequentially. It uses offer() on the queue, which silently drops updates when the queue is full.

However, calendar/filter updates don't need ordering guarantees. Hence, JobManager.submitJobEventUpdate() can bypass the queue and avoid the bottleneck of the queue size.

Another problem is the "fire-and-forget" pattern: submitJobEventUpdate() returns immediately without waiting for the update to complete. I introduce RefCountingListener to track the calendar updates. We start a background thread that updates the jobs and tracks succeeded, failed, and skipped jobs, while the request is returned immediately to prevent a timeout.

Finally, if the problem with failed job updates persists, I enhanced the logging throughout the system to create a trace for future diagnostics.

Refactor JobManager.submitJobEventUpdate() to bypass UpdateJobProcessNotifier queue
Use RefCountingListener for parallel calendar/filter updates
Add comprehensive logging throughout the system
Create CalendarScalabilityIT integration tests
Add helper methods to base test class
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
9.2
8.19
9.1
valeriy42 added a commit to valeriy42/elasticsearch that referenced this pull request Nov 13, 2025
Fixes issue where calendar events failed to update some jobs when associated with large numbers of jobs (>1000) due to queue capacity limits and sequential processing.

Problem: UpdateJobProcessNotifier has a 1000-item queue and processes updates sequentially. It uses offer() on the queue, which silently drops updates when the queue is full.

However, calendar/filter updates don't need ordering guarantees. Hence, JobManager.submitJobEventUpdate() can bypass the queue and avoid the bottleneck of the queue size.

Another problem is the "fire-and-forget" pattern: submitJobEventUpdate() returns immediately without waiting for the update to complete. I introduce RefCountingListener to track the calendar updates. We start a background thread that updates the jobs and tracks succeeded, failed, and skipped jobs, while the request is returned immediately to prevent a timeout.

Finally, if the problem with failed job updates persists, I enhanced the logging throughout the system to create a trace for future diagnostics.

Refactor JobManager.submitJobEventUpdate() to bypass UpdateJobProcessNotifier queue
Use RefCountingListener for parallel calendar/filter updates
Add comprehensive logging throughout the system
Create CalendarScalabilityIT integration tests
Add helper methods to base test class
valeriy42 added a commit to valeriy42/elasticsearch that referenced this pull request Nov 13, 2025
Fixes issue where calendar events failed to update some jobs when associated with large numbers of jobs (>1000) due to queue capacity limits and sequential processing.

Problem: UpdateJobProcessNotifier has a 1000-item queue and processes updates sequentially. It uses offer() on the queue, which silently drops updates when the queue is full.

However, calendar/filter updates don't need ordering guarantees. Hence, JobManager.submitJobEventUpdate() can bypass the queue and avoid the bottleneck of the queue size.

Another problem is the "fire-and-forget" pattern: submitJobEventUpdate() returns immediately without waiting for the update to complete. I introduce RefCountingListener to track the calendar updates. We start a background thread that updates the jobs and tracks succeeded, failed, and skipped jobs, while the request is returned immediately to prevent a timeout.

Finally, if the problem with failed job updates persists, I enhanced the logging throughout the system to create a trace for future diagnostics.

Refactor JobManager.submitJobEventUpdate() to bypass UpdateJobProcessNotifier queue
Use RefCountingListener for parallel calendar/filter updates
Add comprehensive logging throughout the system
Create CalendarScalabilityIT integration tests
Add helper methods to base test class
elasticsearchmachine pushed a commit that referenced this pull request Nov 13, 2025
Fixes issue where calendar events failed to update some jobs when associated with large numbers of jobs (>1000) due to queue capacity limits and sequential processing.

Problem: UpdateJobProcessNotifier has a 1000-item queue and processes updates sequentially. It uses offer() on the queue, which silently drops updates when the queue is full.

However, calendar/filter updates don't need ordering guarantees. Hence, JobManager.submitJobEventUpdate() can bypass the queue and avoid the bottleneck of the queue size.

Another problem is the "fire-and-forget" pattern: submitJobEventUpdate() returns immediately without waiting for the update to complete. I introduce RefCountingListener to track the calendar updates. We start a background thread that updates the jobs and tracks succeeded, failed, and skipped jobs, while the request is returned immediately to prevent a timeout.

Finally, if the problem with failed job updates persists, I enhanced the logging throughout the system to create a trace for future diagnostics.

Refactor JobManager.submitJobEventUpdate() to bypass UpdateJobProcessNotifier queue
Use RefCountingListener for parallel calendar/filter updates
Add comprehensive logging throughout the system
Create CalendarScalabilityIT integration tests
Add helper methods to base test class
elasticsearchmachine pushed a commit that referenced this pull request Nov 13, 2025
Fixes issue where calendar events failed to update some jobs when associated with large numbers of jobs (>1000) due to queue capacity limits and sequential processing.

Problem: UpdateJobProcessNotifier has a 1000-item queue and processes updates sequentially. It uses offer() on the queue, which silently drops updates when the queue is full.

However, calendar/filter updates don't need ordering guarantees. Hence, JobManager.submitJobEventUpdate() can bypass the queue and avoid the bottleneck of the queue size.

Another problem is the "fire-and-forget" pattern: submitJobEventUpdate() returns immediately without waiting for the update to complete. I introduce RefCountingListener to track the calendar updates. We start a background thread that updates the jobs and tracks succeeded, failed, and skipped jobs, while the request is returned immediately to prevent a timeout.

Finally, if the problem with failed job updates persists, I enhanced the logging throughout the system to create a trace for future diagnostics.

Refactor JobManager.submitJobEventUpdate() to bypass UpdateJobProcessNotifier queue
Use RefCountingListener for parallel calendar/filter updates
Add comprehensive logging throughout the system
Create CalendarScalabilityIT integration tests
Add helper methods to base test class
elasticsearchmachine pushed a commit that referenced this pull request Nov 13, 2025
Fixes issue where calendar events failed to update some jobs when associated with large numbers of jobs (>1000) due to queue capacity limits and sequential processing.

Problem: UpdateJobProcessNotifier has a 1000-item queue and processes updates sequentially. It uses offer() on the queue, which silently drops updates when the queue is full.

However, calendar/filter updates don't need ordering guarantees. Hence, JobManager.submitJobEventUpdate() can bypass the queue and avoid the bottleneck of the queue size.

Another problem is the "fire-and-forget" pattern: submitJobEventUpdate() returns immediately without waiting for the update to complete. I introduce RefCountingListener to track the calendar updates. We start a background thread that updates the jobs and tracks succeeded, failed, and skipped jobs, while the request is returned immediately to prevent a timeout.

Finally, if the problem with failed job updates persists, I enhanced the logging throughout the system to create a trace for future diagnostics.

Refactor JobManager.submitJobEventUpdate() to bypass UpdateJobProcessNotifier queue
Use RefCountingListener for parallel calendar/filter updates
Add comprehensive logging throughout the system
Create CalendarScalabilityIT integration tests
Add helper methods to base test class
szybia added a commit to szybia/elasticsearch that referenced this pull request Nov 13, 2025
…-json

* upstream/main: (158 commits)
  Cleanup files from repo root folder (elastic#138030)
  Implement OpenShift AI integration for chat completion, embeddings, and reranking (elastic#136624)
  Optimize AsyncSearchErrorTraceIT to avoid failures (elastic#137716)
  Removes support for null TransportService in RemoteClusterService (elastic#137939)
  Mute org.elasticsearch.index.mapper.DateFieldMapperTests testSortShortcuts elastic#138018
  rest-api-spec: fix type of enums (elastic#137521)
  Update Gradle wrapper to 9.2.0 (elastic#136155)
  Add RCS Strong Verification Documentation (elastic#137822)
  Use docvalue skippers on dimension fields (elastic#137029)
  Introduce INDEX_SHARD_COUNT_FORMAT (elastic#137210)
  Mute org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT testCreatesChatCompletion_AndThenCreatesTextEmbedding elastic#138012
  Fix ES|QL search context creation to use correct results type (elastic#137994)
  Improve Snapshot Logging (elastic#137470)
  Support extra output field in TOP function (elastic#135434)
  Remove NumericDoubleValues class (elastic#137884)
  [ML] Fix ML calendar event update scalability issues (elastic#136886)
  Task may be unregistered outside of the trace context in exceptional cases. (elastic#137865)
  Refine workaround for S3 repo analysis known issue (elastic#138000)
  Additional DEBUG logging on authc failures (elastic#137941)
  Cleanup index resolution (elastic#137867)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug cloud-deploy Publish cloud docker image for Cloud-First-Testing :ml Machine learning Team:ML Meta label for the ML team v8.19.8 v9.1.8 v9.2.2 v9.3.0

5 participants