Skip to content

Ensure queries return correctly during rolling upgrades of stateful cluster with RF 3 and only 3 nodes. #2503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 22, 2020

Conversation

tomwilkie
Copy link
Contributor

@tomwilkie tomwilkie commented Apr 22, 2020

Signed-off-by: Tom Wilkie tom@grafana.com

What this PR does:

  • Use a real ring with mock KV when testing distributor. This is to tease out errors in the replication logic.
  • Extend the distributor tests to cover the case of RF=3 with 2 ingesters.
  • Fix ring.GetAll to correctly calculate maxErrors when RF=3 and 2 ingesters.

Which issue(s) this PR fixes:
Fixes #2504

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
This is to teast out errors in the replication logic.

Signed-off-by: Tom Wilkie <tom@grafana.com>
Signed-off-by: Tom Wilkie <tom@grafana.com>
@pull-request-size pull-request-size bot added size/L and removed size/M labels Apr 22, 2020
…RF=3 and #ingesters=2.

Signed-off-by: Tom Wilkie <tom@grafana.com>
Copy link
Contributor

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow tricky bug, but the fix makes sense! Thanks for adding the test!

Signed-off-by: Tom Wilkie <tom@grafana.com>
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ring changes LGTM. Due to the change in the distributor tests prepare() we need to also fix the tests TestDistributor_PushIngestionRateLimiter, TestDistributor_Push_LabelRemoval, TestDistributor_Push_ShouldGuaranteeShardingTokenConsistencyOverTheTime. I've also left a couple of comments in the code.

Signed-off-by: Tom Wilkie <tom@grafana.com>
Signed-off-by: Tom Wilkie <tom@grafana.com>
Signed-off-by: Tom Wilkie <tom@grafana.com>
Signed-off-by: Tom Wilkie <tom@grafana.com>
…eding when it shouldn't

Signed-off-by: Tom Wilkie <tom@grafana.com>
@tomwilkie tomwilkie marked this pull request as ready for review April 22, 2020 19:00
Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I added a minor suggestion but it's not consequential at all.

Co-Authored-By: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Tom Wilkie <tom@grafana.com>
@tomwilkie tomwilkie force-pushed the rf-three-with-two-ingesters branch from 3eb03c0 to 2619f25 Compare April 22, 2020 19:54
@tomwilkie tomwilkie merged commit fa0fc64 into master Apr 22, 2020
@tomwilkie tomwilkie deleted the rf-three-with-two-ingesters branch April 22, 2020 21:00
@bboreham
Copy link
Contributor

Does this fix #1290 ?

gouthamve pushed a commit to gouthamve/cortex that referenced this pull request Apr 23, 2020
…luster with RF 3 and only 3 nodes. (cortexproject#2503)

* Use a real ring with mock KV when testing distributor.

This is to teast out errors in the replication logic.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Extend distributor test to cover the case RF=3 with 2 ingesters.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Ensure ring correctly calculates the number of allowed failures when RF=3 and #ingesters=2.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Add changelog and review feedback.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Refactor some distributor tests to try and get them to pass.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Speed up tests but polling more frequently.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Fix same bug on the write path.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Tidy up the distributor tests.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Make test correctly handle RF3 and 2 ingesters - previously was succeeding when it shouldn't

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Update pkg/ring/ring.go

Co-Authored-By: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Tom Wilkie <tom@grafana.com>

Co-authored-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
gouthamve added a commit that referenced this pull request Apr 23, 2020
…luster with RF 3 and only 3 nodes. (#2503) (#2512)

* Use a real ring with mock KV when testing distributor.

This is to teast out errors in the replication logic.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Extend distributor test to cover the case RF=3 with 2 ingesters.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Ensure ring correctly calculates the number of allowed failures when RF=3 and #ingesters=2.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Add changelog and review feedback.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Refactor some distributor tests to try and get them to pass.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Speed up tests but polling more frequently.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Fix same bug on the write path.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Tidy up the distributor tests.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Make test correctly handle RF3 and 2 ingesters - previously was succeeding when it shouldn't

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Update pkg/ring/ring.go

Co-Authored-By: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Tom Wilkie <tom@grafana.com>

Co-authored-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

Co-authored-by: Tom Wilkie <tomwilkie@users.noreply.github.com>
Co-authored-by: Jacob Lisi <jacob.t.lisi@gmail.com>
@bboreham
Copy link
Contributor

bboreham commented Apr 23, 2020

I can't see that this fixes #1290.

AFAICS r.strategy.ShouldExtendReplicaSet() during a rolling update will still end up with 4 ingesters being passed in to Filter() which then computes minSuccess=3, so we cannot tolerate a single failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 participants