Add repair for checkpoint/WAL #2105

codesome · 2020-02-10T14:10:08Z

@gouthamve suggested me to add it to the release and that v0.6.2 would be cut with this.

Things changed during checkpointing: If the old checkpoint was X and the new was Y, we were deleting the segments before Y. Now in this PR, I have changed it to delete WAL segments to delete before X so that we can recover from checkpoint X even if checkpoint Y gets corrupt.

How the replay works now with repair:

Attempt replay of checkpoint Y. If the checkpoint Y is corrupt, delete the checkpoint Y and recover from checkpoint X instead.
If checkpoint X is also corrupt, it is a hard fail right now.
Depending on which checkpoint was recovered, we start replaying the WAL from either segment X or Y.
If WAL is corrupt, we do the usual Prometheus repair to discard everything after the corrupt record. If the repair fails its again a hard error.

I will be adding a test for repair now.

Checklist

Tests updated
~~Documentation added~~
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pkg/ingester/wal.go

pkg/ingester/metrics.go

pkg/ingester/wal.go

codesome · 2020-02-12T11:28:00Z

I have rebased this PR and it is now pointing to master (following the discussion on the slack)

codesome · 2020-02-13T07:39:01Z

Following @pracucci's comments on having only 1 or no checkpoint on disk and after fixing some code, I have added the case of having 0 or 1 checkpoint in the unit test now.

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

gouthamve

LGTM with nit. Will approve after comments are fixed.

pkg/ingester/wal.go

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

codesome · 2020-03-02T15:05:53Z

@gouthamve done

pkg/ingester/metrics.go

pkg/ingester/wal.go

pracucci

LGTM! It's a bit tricky to me, so my personal confidence level is not super high, but all in all LGTM.

pkg/ingester/wal.go

pstibrany · 2020-03-03T14:55:16Z

pkg/ingester/wal.go

+	stateCache []map[string]*userState, seriesCache []map[string]map[uint64]*memorySeries) (*userStates, int, error) {
+
+	// Use a local userStates, so we don't need to worry about locking.
+	userStates := newUserStates(ingester.limiter, ingester.cfg, ingester.metrics)


Isn't this going to affect real ingester metrics? Is that a problem?

I will check about that, good catch

Cross checked with how transfer handles them

cortex/pkg/ingester/transfer.go

Lines 65 to 151 in 5d7b05c

// TransferChunks receives all the chunks from another ingester.

func (i *Ingester) TransferChunks(stream client.Ingester_TransferChunksServer) error {

fromIngesterID := ""

seriesReceived := 0

xfer := func() error {

userStates := newUserStates(i.limiter, i.cfg, i.metrics)

for {

wireSeries, err := stream.Recv()

if err == io.EOF {

break

}

if err != nil {

return errors.Wrap(err, "TransferChunks: Recv")

}

// We can't send "extra" fields with a streaming call, so we repeat

// wireSeries.FromIngesterId and assume it is the same every time

// round this loop.

if fromIngesterID == "" {

fromIngesterID = wireSeries.FromIngesterId

level.Info(util.Logger).Log("msg", "processing TransferChunks request", "from_ingester", fromIngesterID)

// Before transfer, make sure 'from' ingester is in correct state to call ClaimTokensFor later

err := i.checkFromIngesterIsInLeavingState(stream.Context(), fromIngesterID)

if err != nil {

return errors.Wrap(err, "TransferChunks: checkFromIngesterIsInLeavingState")

}

}

descs, err := fromWireChunks(wireSeries.Chunks)

if err != nil {

return errors.Wrap(err, "TransferChunks: fromWireChunks")

}

state, fp, series, err := userStates.getOrCreateSeries(stream.Context(), wireSeries.UserId, wireSeries.Labels, nil)

if err != nil {

return errors.Wrapf(err, "TransferChunks: getOrCreateSeries: user %s series %s", wireSeries.UserId, wireSeries.Labels)

}

prevNumChunks := len(series.chunkDescs)

err = series.setChunks(descs)

state.fpLocker.Unlock(fp) // acquired in getOrCreateSeries

if err != nil {

return errors.Wrapf(err, "TransferChunks: setChunks: user %s series %s", wireSeries.UserId, wireSeries.Labels)

}

seriesReceived++

memoryChunks.Add(float64(len(series.chunkDescs) - prevNumChunks))

receivedChunks.Add(float64(len(descs)))

}

if seriesReceived == 0 {

level.Error(util.Logger).Log("msg", "received TransferChunks request with no series", "from_ingester", fromIngesterID)

return fmt.Errorf("TransferChunks: no series")

}

if fromIngesterID == "" {

level.Error(util.Logger).Log("msg", "received TransferChunks request with no ID from ingester")

return fmt.Errorf("no ingester id")

}

if err := i.lifecycler.ClaimTokensFor(stream.Context(), fromIngesterID); err != nil {

return errors.Wrap(err, "TransferChunks: ClaimTokensFor")

}

i.userStatesMtx.Lock()

defer i.userStatesMtx.Unlock()

i.userStates = userStates

return nil

}

if err := i.transfer(stream.Context(), xfer); err != nil {

return err

}

// Close the stream last, as this is what tells the "from" ingester that

// it's OK to shut down.

if err := stream.SendAndClose(&client.TransferChunksResponse{}); err != nil {

level.Error(util.Logger).Log("msg", "Error closing TransferChunks stream", "from_ingester", fromIngesterID, "err", err)

return err

}

level.Info(util.Logger).Log("msg", "Successfully transferred chunks", "from_ingester", fromIngesterID, "series_received", seriesReceived)

return nil

}

Looks like it's the same here too (including the memory chunks metrics). So I guess all good here.

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

gouthamve · 2020-03-05T09:27:11Z

Thanks Ganesh!

codesome marked this pull request as ready for review February 11, 2020 11:34

gouthamve reviewed Feb 11, 2020

View reviewed changes

pkg/ingester/wal.go Outdated Show resolved Hide resolved

pracucci reviewed Feb 12, 2020

View reviewed changes

codesome force-pushed the wal-repair branch 2 times, most recently from 76c6088 to fbfc3d7 Compare February 12, 2020 11:23

codesome changed the base branch from release-0.6 to master February 12, 2020 11:23

codesome force-pushed the wal-repair branch from fbfc3d7 to 102f93a Compare February 12, 2020 11:26

codesome force-pushed the wal-repair branch 2 times, most recently from a07998b to 930e545 Compare February 13, 2020 07:37

Add automatic repair for checkpoint/WAL

8e19648

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

codesome force-pushed the wal-repair branch from 930e545 to 8e19648 Compare February 17, 2020 09:01

Merge remote-tracking branch 'upstream/master' into wal-repair

bcc6a80

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

gouthamve reviewed Mar 2, 2020

View reviewed changes

pkg/ingester/wal.go Outdated Show resolved Hide resolved

pkg/ingester/wal.go Outdated Show resolved Hide resolved

Ganesh Vernekar added 2 commits March 2, 2020 20:19

Fix review comments

528b865

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

Merge remote-tracking branch 'upstream/master' into wal-repair

95d8b7b

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

sandeepsukhani added a commit to grafana/cortex that referenced this pull request Mar 3, 2020

Merge PR cortexproject#2105 into r77

a9bc62a

gouthamve approved these changes Mar 3, 2020

View reviewed changes

pracucci reviewed Mar 3, 2020

View reviewed changes

pkg/ingester/metrics.go Show resolved Hide resolved

pkg/ingester/wal.go Outdated Show resolved Hide resolved

pkg/ingester/wal.go Outdated Show resolved Hide resolved

pracucci approved these changes Mar 3, 2020

View reviewed changes

pstibrany reviewed Mar 3, 2020

View reviewed changes

Fix review comments

2dc029b

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

codesome force-pushed the wal-repair branch from 043bc5c to 2dc029b Compare March 5, 2020 07:41

gouthamve merged commit 28362da into cortexproject:master Mar 5, 2020

codesome deleted the wal-repair branch March 5, 2020 10:16

sandeepsukhani mentioned this pull request Mar 5, 2020

fix broken tests in wal #2216

Merged

1 task

codesome mentioned this pull request Apr 6, 2020

Automatic repair for the WAL in ingester #2084

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add repair for checkpoint/WAL #2105

Add repair for checkpoint/WAL #2105

Uh oh!

codesome commented Feb 10, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codesome commented Feb 12, 2020

codesome commented Feb 13, 2020

gouthamve left a comment

Uh oh!

Uh oh!

codesome commented Mar 2, 2020

Uh oh!

Uh oh!

Uh oh!

pracucci left a comment

Uh oh!

pstibrany Mar 3, 2020

codesome Mar 4, 2020

codesome Mar 5, 2020

gouthamve commented Mar 5, 2020

	// TransferChunks receives all the chunks from another ingester.
	func (i *Ingester) TransferChunks(stream client.Ingester_TransferChunksServer) error {
	fromIngesterID := ""
	seriesReceived := 0
	xfer := func() error {
	userStates := newUserStates(i.limiter, i.cfg, i.metrics)

	for {
	wireSeries, err := stream.Recv()
	if err == io.EOF {
	break
	}
	if err != nil {
	return errors.Wrap(err, "TransferChunks: Recv")
	}

	// We can't send "extra" fields with a streaming call, so we repeat
	// wireSeries.FromIngesterId and assume it is the same every time
	// round this loop.
	if fromIngesterID == "" {
	fromIngesterID = wireSeries.FromIngesterId
	level.Info(util.Logger).Log("msg", "processing TransferChunks request", "from_ingester", fromIngesterID)

	// Before transfer, make sure 'from' ingester is in correct state to call ClaimTokensFor later
	err := i.checkFromIngesterIsInLeavingState(stream.Context(), fromIngesterID)
	if err != nil {
	return errors.Wrap(err, "TransferChunks: checkFromIngesterIsInLeavingState")
	}
	}
	descs, err := fromWireChunks(wireSeries.Chunks)
	if err != nil {
	return errors.Wrap(err, "TransferChunks: fromWireChunks")
	}

	state, fp, series, err := userStates.getOrCreateSeries(stream.Context(), wireSeries.UserId, wireSeries.Labels, nil)
	if err != nil {
	return errors.Wrapf(err, "TransferChunks: getOrCreateSeries: user %s series %s", wireSeries.UserId, wireSeries.Labels)
	}
	prevNumChunks := len(series.chunkDescs)

	err = series.setChunks(descs)
	state.fpLocker.Unlock(fp) // acquired in getOrCreateSeries
	if err != nil {
	return errors.Wrapf(err, "TransferChunks: setChunks: user %s series %s", wireSeries.UserId, wireSeries.Labels)
	}

	seriesReceived++
	memoryChunks.Add(float64(len(series.chunkDescs) - prevNumChunks))
	receivedChunks.Add(float64(len(descs)))
	}

	if seriesReceived == 0 {
	level.Error(util.Logger).Log("msg", "received TransferChunks request with no series", "from_ingester", fromIngesterID)
	return fmt.Errorf("TransferChunks: no series")
	}

	if fromIngesterID == "" {
	level.Error(util.Logger).Log("msg", "received TransferChunks request with no ID from ingester")
	return fmt.Errorf("no ingester id")
	}

	if err := i.lifecycler.ClaimTokensFor(stream.Context(), fromIngesterID); err != nil {
	return errors.Wrap(err, "TransferChunks: ClaimTokensFor")
	}

	i.userStatesMtx.Lock()
	defer i.userStatesMtx.Unlock()

	i.userStates = userStates

	return nil
	}

	if err := i.transfer(stream.Context(), xfer); err != nil {
	return err
	}

	// Close the stream last, as this is what tells the "from" ingester that
	// it's OK to shut down.
	if err := stream.SendAndClose(&client.TransferChunksResponse{}); err != nil {
	level.Error(util.Logger).Log("msg", "Error closing TransferChunks stream", "from_ingester", fromIngesterID, "err", err)
	return err
	}
	level.Info(util.Logger).Log("msg", "Successfully transferred chunks", "from_ingester", fromIngesterID, "series_received", seriesReceived)

	return nil
	}

Add repair for checkpoint/WAL #2105

Add repair for checkpoint/WAL #2105

Uh oh!

Conversation

codesome commented Feb 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codesome commented Feb 12, 2020

codesome commented Feb 13, 2020

gouthamve left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codesome commented Mar 2, 2020

Uh oh!

Uh oh!

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

pstibrany Mar 3, 2020

Choose a reason for hiding this comment

codesome Mar 4, 2020

Choose a reason for hiding this comment

codesome Mar 5, 2020

Choose a reason for hiding this comment

gouthamve commented Mar 5, 2020

codesome commented Feb 10, 2020 •

edited

Loading