Skip to content

Reduce locking when persisting ML job statistics.#141519

Merged
jan-elastic merged 8 commits intoelastic:mainfrom
jan-elastic:reduce-DataCountsReporter-locking
Jan 30, 2026
Merged

Reduce locking when persisting ML job statistics.#141519
jan-elastic merged 8 commits intoelastic:mainfrom
jan-elastic:reduce-DataCountsReporter-locking

Conversation

@jan-elastic
Copy link
Contributor

@jan-elastic jan-elastic commented Jan 29, 2026

fixes #140511

@jan-elastic jan-elastic added >bug :ml Machine learning Team:ML Meta label for the ML team v9.4.0 labels Jan 29, 2026
@jan-elastic jan-elastic requested a review from valeriy42 January 29, 2026 14:09
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @jan-elastic, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Hi @jan-elastic, I've updated the changelog YAML for you.


public synchronized DataCounts runningTotalStats() {
public DataCounts runningTotalStats() {
totalRecordStats.calcProcessedFieldCount(getAnalysedFieldsPerRecord());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this "update the object upon read" is a bit strange.

I've considered to:

  • move analysedFieldsPerRecord to DataCounts
  • make the calcProcessedFieldCount private
  • call it when you call setXyz() (Xyz is something that influences processFieldCount)

What do you think about this?

I didn't do it, because I wanted the changes to be as minimal as possible.

Copy link
Contributor

@valeriy42 valeriy42 Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't make this change because this would increase the overhead of the method reportRecordWritten() by about 20%. This method lies on the hot path and is executed thousands of times per minute. It should be kept as fast as possible, even at the cost of a more cumbersome design.

synchronized (this) {
unreportedStats = statsToReport;
}
unreportedStats.set(statsToReport);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following race scenario between finishReporting and writeUnreportedCounts is possible:

Thread A (finishReporting):

1. unreportedStats.set(null)              // Clear unreported stats
2. statsToReport = runningTotalStats()
3. persistDataCounts(..., false) → returns false (previous persist in progress)

Thread B (writeUnreportedCounts):

4. getAndSet(null) → returns null         // Sees null, does nothing

Thread A:

5. unreportedStats.set(statsToReport)     // Sets stats that should have been persisted

Result: Stats are in unreportedStats but won't be persisted until the next writeUnreportedCounts() call

This is not very bad, since the next flushJob() will persist them (datafeeds flush regularly, typically every few minutes). But it's probably worth documenting at this point that we guarantee only eventual consistency.


public synchronized DataCounts runningTotalStats() {
public DataCounts runningTotalStats() {
totalRecordStats.calcProcessedFieldCount(getAnalysedFieldsPerRecord());
Copy link
Contributor

@valeriy42 valeriy42 Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't make this change because this would increase the overhead of the method reportRecordWritten() by about 20%. This method lies on the hot path and is executed thousands of times per minute. It should be kept as fast as possible, even at the cost of a more cumbersome design.

@valeriy42 valeriy42 self-requested a review January 30, 2026 08:52
@jan-elastic jan-elastic force-pushed the reduce-DataCountsReporter-locking branch from bcf196d to 8579849 Compare January 30, 2026 08:57
Copy link
Contributor Author

@jan-elastic jan-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test LGTM.

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for fixing this!

@jan-elastic jan-elastic force-pushed the reduce-DataCountsReporter-locking branch from 1f2078f to 1c74a01 Compare January 30, 2026 10:38
@valeriy42 valeriy42 added auto-backport Automatically create backport pull requests when merged v9.3.1 v8.19.12 v9.2.6 labels Jan 30, 2026
@jan-elastic jan-elastic merged commit 726299b into elastic:main Jan 30, 2026
35 checks passed
jan-elastic added a commit to jan-elastic/elasticsearch that referenced this pull request Jan 30, 2026
* Reduce locking when persisting ML job statistics.

* Update docs/changelog/141519.yaml

* Update docs/changelog/141519.yaml

* add unit test

* improved timeout exception handling.

* count down latch on exception

* fix javadoc in unittest

* [CI] Auto commit changes from spotless

---------

Co-authored-by: Valeriy Khakhutskyy <1292899+valeriy42@users.noreply.github.com>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
9.3
8.19
9.2
jan-elastic added a commit to jan-elastic/elasticsearch that referenced this pull request Jan 30, 2026
* Reduce locking when persisting ML job statistics.

* Update docs/changelog/141519.yaml

* Update docs/changelog/141519.yaml

* add unit test

* improved timeout exception handling.

* count down latch on exception

* fix javadoc in unittest

* [CI] Auto commit changes from spotless

---------

Co-authored-by: Valeriy Khakhutskyy <1292899+valeriy42@users.noreply.github.com>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
jan-elastic added a commit to jan-elastic/elasticsearch that referenced this pull request Jan 30, 2026
* Reduce locking when persisting ML job statistics.

* Update docs/changelog/141519.yaml

* Update docs/changelog/141519.yaml

* add unit test

* improved timeout exception handling.

* count down latch on exception

* fix javadoc in unittest

* [CI] Auto commit changes from spotless

---------

Co-authored-by: Valeriy Khakhutskyy <1292899+valeriy42@users.noreply.github.com>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
elasticsearchmachine pushed a commit that referenced this pull request Jan 30, 2026
* Reduce locking when persisting ML job statistics.

* Update docs/changelog/141519.yaml

* Update docs/changelog/141519.yaml

* add unit test

* improved timeout exception handling.

* count down latch on exception

* fix javadoc in unittest

* [CI] Auto commit changes from spotless

---------

Co-authored-by: Valeriy Khakhutskyy <1292899+valeriy42@users.noreply.github.com>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
jan-elastic added a commit that referenced this pull request Jan 30, 2026
* Reduce locking when persisting ML job statistics.

* Update docs/changelog/141519.yaml

* Update docs/changelog/141519.yaml

* add unit test

* improved timeout exception handling.

* count down latch on exception

* fix javadoc in unittest

* [CI] Auto commit changes from spotless

---------

Co-authored-by: Valeriy Khakhutskyy <1292899+valeriy42@users.noreply.github.com>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
elasticsearchmachine pushed a commit that referenced this pull request Jan 30, 2026
* Reduce locking when persisting ML job statistics.

* Update docs/changelog/141519.yaml

* Update docs/changelog/141519.yaml

* add unit test

* improved timeout exception handling.

* count down latch on exception

* fix javadoc in unittest

* [CI] Auto commit changes from spotless

---------

Co-authored-by: Valeriy Khakhutskyy <1292899+valeriy42@users.noreply.github.com>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
elasticsearchmachine pushed a commit that referenced this pull request Jan 30, 2026
* Reduce locking when persisting ML job statistics.

* Update docs/changelog/141519.yaml

* Update docs/changelog/141519.yaml

* add unit test

* improved timeout exception handling.

* count down latch on exception

* fix javadoc in unittest

* [CI] Auto commit changes from spotless

---------

Co-authored-by: Valeriy Khakhutskyy <1292899+valeriy42@users.noreply.github.com>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 1, 2026
* Reduce locking when persisting ML job statistics.

* Update docs/changelog/141519.yaml

* Update docs/changelog/141519.yaml

* add unit test

* improved timeout exception handling.

* count down latch on exception

* fix javadoc in unittest

* [CI] Auto commit changes from spotless

---------

Co-authored-by: Valeriy Khakhutskyy <1292899+valeriy42@users.noreply.github.com>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
breskeby added a commit that referenced this pull request Feb 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :ml Machine learning Team:ML Meta label for the ML team v8.19.12 v9.2.6 v9.3.1 v9.4.0

3 participants