Skip to content

[memberlist] Add support to join memberlist via SRV records #2788

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 2, 2020

Conversation

periklis
Copy link
Contributor

@periklis periklis commented Jun 25, 2020

What this PR does:

This PR enables lookup for members to join via SRV records.

Which issue(s) this PR fixes:
Follow-up on discussion thread here

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

cc @pstibrany

@periklis periklis force-pushed the memberlist-srv-records branch from 1b6dace to 5457d8f Compare June 25, 2020 07:32
@@ -475,6 +487,28 @@ func (m *KV) joinMembersOnStartup(ctx context.Context, members []string) error {
return lastErr
}

// Provides a simple SRV lookup to join a memberlist cluster w/o knowning members' addresses upfront.
// Note: Overwrites flag `memberlist.join` if provided in addition to `memberlist.join-service`.
func (m *KV) lookupMembers(srv string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could reuse https://cortexmetrics.io/docs/configuration/arguments/#dns-service-discovery instead of having another way to configure service discovery. The DNS service discovery we're currently using is based on Thanos one: you can see an example of its usage at pkg/querier/blocks_store_balanced_set.go.

Would it make sense to use it for gossip too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pracucci & @pstibrany Giving this a thought for the last couple of days, I believe it is a good idea to expand memberlist joins to use the same DNS service discovery in addition to our joinMembers flag. Although a simple net.lookupSRV is enough the use cases provided by Thanos' DNS provider make sense also for memberlists:

  • Joining a cluster means resolving the addresses of all members upfront. However, leaving the resolution mechanism to a third-party dependency is good from a separation of concerns perspective and adds testability. OTOH my simple net.lookupSRV would need a function pointer handle to make it testable, that's not a big deal though.
  • Re-Joining the cluster - e.g. periodically - should work the same as joining anew with the only difference of re-establishing connections.

AFAICS, the DNS service discovery mechanism is free of any caching of members and in addition to that the ultimate list of which members to connect to is maintained on memberlist side. In addition, reading through the hashicorp lib I don't see a blocker regarding how to resolve member addresses. WDYT?

@pracucci FWIW, using the DNS service disovery is not bound to DNS caching timeouts or so like that would demand to rejoin the cluster based on these timeouts right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS, the DNS service discovery mechanism is free of any caching of members and in addition to that the ultimate list of which members to connect to is maintained on memberlist side. In addition, reading through the hashicorp lib I don't see a blocker regarding how to resolve member addresses. WDYT?

The way the Thanos DNS provider works is that Addresses() returns the result of the last Resolve(). For this specific use case you could just call Resolve() and then Address() in sequence.

FWIW, using the DNS service disovery is not bound to DNS caching timeouts or so like that would demand to rejoin the cluster based on these timeouts right?

I'm not sure about the question. The DNS service discovery just uses the Go's DNS Lookup under the hood, so there's real difference compared to your current implementation, except that it would be more generic and we would reuse the same mechanism we have.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern question is rather a concern of the sort: Using the thanos DNS provider do I need to implement re-joins based on DNS TTLs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't bring TTLs into the question here.

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic I'd like to see is that we use dns.NewProvider, pass configured JoinMembers to (*Provider).Resolve method, and use the result ((*Provider).Addresses()) to call memberlist.Join method. If RejoinInterval in memberlist client is configured, we would periodically repeat the process.

For this to work we perhaps don't even need extra options, as it should be backwards compatible.

What do you think, what are possible problems with such approach?

@@ -134,6 +135,7 @@ type KVConfig struct {
DeadNodeReclaimTime time.Duration `yaml:"dead_node_reclaim_time"`

// List of members to join
JoinService string `yaml:"join_service"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use this as a boolean flag to enable SRV records lookup for names in join_members?

@@ -475,6 +487,28 @@ func (m *KV) joinMembersOnStartup(ctx context.Context, members []string) error {
return lastErr
}

// Provides a simple SRV lookup to join a memberlist cluster w/o knowning members' addresses upfront.
// Note: Overwrites flag `memberlist.join` if provided in addition to `memberlist.join-service`.
func (m *KV) lookupMembers(srv string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't bring TTLs into the question here.

@periklis
Copy link
Contributor Author

I think the logic I'd like to see is that we use dns.NewProvider, pass configured JoinMembers to (*Provider).Resolve method, and use the result ((*Provider).Addresses()) to call memberlist.Join method. If RejoinInterval in memberlist client is configured, we would periodically repeat the process.

For this to work we perhaps don't even need extra options, as it should be backwards compatible.

What do you think, what are possible problems with such approach?

Indeed the boolean flag with the present JoinMembers flag will be sufficient and backward compatible.

@pstibrany
Copy link
Contributor

Indeed the boolean flag with the present JoinMembers flag will be sufficient and backward compatible.

Maybe we can even do without it (I suggested bool flag before writing my other comment).

@periklis
Copy link
Contributor Author

Indeed the boolean flag with the present JoinMembers flag will be sufficient and backward compatible.

Maybe we can even do without it (I suggested bool flag before writing my other comment).

How would that work? Scan for dns(srv|srvnoa)+ prefixes in joinMembers? Try DNS and fallback on err?

@pstibrany
Copy link
Contributor

Maybe we can even do without it (I suggested bool flag before writing my other comment).

How would that work? Scan for dns(srv|srvnoa)+ prefixes in joinMembers? Try DNS and fallback on err?

Yes, or just look for ‘+’, which is not allowed in host names. If there is no +, we pass hostnames to memberlist directly. If there is, we do our own resolution.

@periklis
Copy link
Contributor Author

Good! I think we have a plan. I will amend this PR on Monday.

@periklis periklis force-pushed the memberlist-srv-records branch from 5457d8f to 85c3540 Compare June 29, 2020 08:54
@periklis
Copy link
Contributor Author

@pracucci @pstibrany PTAL the last version is based on the DNS service discovery mechanism. Btw. I am trying to figure out how to write a proper integration test for this.

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @periklis for working on this! The new design LGTM. I left few comments about implementation details.

CHANGELOG.md Outdated
@@ -159,6 +159,7 @@
* `cortex_querier_storegateway_refetches_per_query`
* [ENHANCEMENT] Delete requests can now be canceled #2555
* [ENHANCEMENT] Table manager can now provision tables for delete store #2546
* [ENHANCEMENT] Memberlist members can join cluster via SRV records. #2788
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this to the "master / unreleased" section above, please?

@@ -377,6 +382,10 @@ func (m *KV) starting(_ context.Context) error {
var errFailedToJoinCluster = errors.New("failed to join memberlist cluster on startup")

func (m *KV) running(ctx context.Context) error {
// Lookup SRV records for given service to discover members.
// Overwrite any given `memberlist.join` members if any found via SRV lookup.
m.discoverMembers()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also call it before the call to m.memberlist.Join(m.cfg.JoinMembers) done in the loop below?

@@ -475,6 +484,35 @@ func (m *KV) joinMembersOnStartup(ctx context.Context, members []string) error {
return lastErr
}

// Provides a dns-based member disovery to join a memberlist cluster w/o knowning members' addresses upfront.
func (m *KV) discoverMembers() {
if len(m.cfg.JoinMembers) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can circuit break:

if len(m.cfg.JoinMembers) == 0 {
     return
}
Comment on lines 494 to 496
if i == 0 {
util.WarnExperimentalUse("DNS-based memberlist service discovery")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to mark it as experimental. We can get rid of it.

@@ -164,7 +165,7 @@ func (cfg *KVConfig) RegisterFlags(f *flag.FlagSet, prefix string) {
f.BoolVar(&cfg.RandomizeNodeName, prefix+"memberlist.randomize-node-name", true, "Add random suffix to the node name.")
f.DurationVar(&cfg.StreamTimeout, prefix+"memberlist.stream-timeout", 0, "The timeout for establishing a connection with a remote node, and for read/write operations. Uses memberlist LAN defaults if 0.")
f.IntVar(&cfg.RetransmitMult, prefix+"memberlist.retransmit-factor", 0, "Multiplication factor used when sending out messages (factor * log(N+1)).")
f.Var(&cfg.JoinMembers, prefix+"memberlist.join", "Other cluster members to join. Can be specified multiple times.")
f.Var(&cfg.JoinMembers, prefix+"memberlist.join", "Other cluster members to join. Can be specified multiple times. Or:\nEXPERIMENTAL: List in DNS Service Discovery format: https://cortexmetrics.io/docs/configuration/arguments/#dns-service-discovery")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to mark it experimental.

Suggested change
f.Var(&cfg.JoinMembers, prefix+"memberlist.join", "Other cluster members to join. Can be specified multiple times. Or:\nEXPERIMENTAL: List in DNS Service Discovery format: https://cortexmetrics.io/docs/configuration/arguments/#dns-service-discovery")
f.Var(&cfg.JoinMembers, prefix+"memberlist.join", "Other cluster members to join. It can be an IP, hostname or an entry specified in the DNS Service Discovery format (see https://cortexmetrics.io/docs/configuration/arguments/#dns-service-discovery for more details).")
util.WarnExperimentalUse("DNS-based memberlist service discovery")
}

err := m.provider.Resolve(context.TODO(), []string{member})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context.Background() instead of context.TODO()

@@ -475,6 +484,35 @@ func (m *KV) joinMembersOnStartup(ctx context.Context, members []string) error {
return lastErr
}

// Provides a dns-based member disovery to join a memberlist cluster w/o knowning members' addresses upfront.
func (m *KV) discoverMembers() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to not manipulate the config. I would suggest to pick in input m.cfg.JoinMembers and return a list of addresses and then passing these addresses to the m.joinMembersOnStartup() and m.memberlist.Join() instead of m.cfg.JoinMembers.

@periklis periklis force-pushed the memberlist-srv-records branch from 85c3540 to a086636 Compare June 30, 2020 14:56
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left some small comments.

Comment on lines 385 to 388
// Lookup SRV records for given service to discover members.
// Overwrite any given `memberlist.join` members if any found via SRV lookup.
members := m.discoverMembers(m.cfg.JoinMembers)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be inside if len(m.cfg.JoinMembers) > 0 condition?

for _, member := range members {
if strings.Contains(member, "+") {

err := m.provider.Resolve(context.Background(), []string{member})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we resolve all "+" members at once, instead of one by one?

@periklis periklis force-pushed the memberlist-srv-records branch from a086636 to 98f117b Compare July 1, 2020 10:55
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @periklis for patiently addressing all comments. Very much appreciated! I have one last comment (I just realized it) and then we should be good to go to me.

func NewKV(cfg KVConfig) *KV {
cfg.TCPTransport.MetricsRegisterer = cfg.MetricsRegisterer
cfg.TCPTransport.MetricsNamespace = cfg.MetricsNamespace

mlkv := &KV{
cfg: cfg,
provider: dns.NewProvider(util.Logger, cfg.MetricsRegisterer, dns.GolangResolverType),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Prometheus registerer passed to the DNS provider should be something like this (to keep it consistent with other components and avoid "duplicated metrics registration"):

dnsProviderRegisterer := prometheus.WrapRegistererWithPrefix("cortex_", prometheus.WrapRegistererWith(prometheus.Labels{
		"name": "memberlist",
	}, cfg.MetricsRegisterer))

@pstibrany Make sense to you? Is it safe to use an hardcoded memberlist name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IC, the cfg.MetricsRegisterer is only the defaultRegisterer up in the stack pkg/cortex/modules.go for memberlist. It makes sense to me to separate the provider metrics per component.

@periklis periklis force-pushed the memberlist-srv-records branch from 98f117b to 14110ae Compare July 1, 2020 11:16
@periklis
Copy link
Contributor Author

periklis commented Jul 1, 2020

Thanks @periklis for patiently addressing all comments. Very much appreciated! I have one last comment (I just realized it) and then we should be good to go to me.

You are welcome! Although, I believe I should have catched most of them, if I put more focus time on this PR ;)

@periklis periklis force-pushed the memberlist-srv-records branch from 14110ae to 550ab54 Compare July 1, 2020 11:19
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Periklis for this work. I’ve left few small comments, but it’s almost ready.

func NewKV(cfg KVConfig) *KV {
cfg.TCPTransport.MetricsRegisterer = cfg.MetricsRegisterer
cfg.TCPTransport.MetricsNamespace = cfg.MetricsNamespace

mr := prometheus.WrapRegistererWithPrefix("cortex_", prometheus.WrapRegistererWith(prometheus.Labels{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We can use extprom.WrapRegistererWith from Thanos to avoid wrapping cfg.MetricsRegisterer when it is nil.

@@ -201,6 +202,9 @@ type KV struct {

cfg KVConfig

// dns discovery provider
provider *dns.Provider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would call this dnsProvider or resolver. provider sounds too generic,.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, i like to read members always with the given type without duplications between type and name, e.g. provider of type dns.Provider. I admit there is a small duplication here but it is prefixed with dns. so for me not an issue. Same for resolver, it duplicates method names: resolver.resolve(members). Later reminds me on the old Java dictatorship everything is a class, whereas if I can build something as a simple function why calling it via a class with the same name.

Calling out provider of type dns.Provider resolve(members) sounds ok. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal, my problem is that later in the code, m.provider seems too generic -- it's not clear what it actually provides without checking the type.

@@ -475,6 +489,33 @@ func (m *KV) joinMembersOnStartup(ctx context.Context, members []string) error {
return lastErr
}

// Provides a dns-based member disovery to join a memberlist cluster w/o knowning members' addresses upfront.
func (m *KV) discoverMembers(members []string) []string {
if len(m.cfg.JoinMembers) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be len(members).

@periklis
Copy link
Contributor Author

periklis commented Jul 2, 2020

@pstibrany Furthermore I missed to pass the context from running to discoverMembers.

@periklis periklis force-pushed the memberlist-srv-records branch from 550ab54 to 841b338 Compare July 2, 2020 07:01
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!


err := m.provider.Resolve(ctx, resolve)
if err != nil {
level.Error(util.Logger).Log("msg", "failed to resolve members from SRV records", "addrs", strings.Join(resolve, ","))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd drop from SRV records part from the message, resolver may support different ways of resolving.

This PR enables dns based discovery for members to join via SRV
records. Member lookup is only supported on memberlist initial
creation and not part of any periodic re-joins.

Signed-off-by: Periklis Tsirakidis <periklis@redhat.com>
@periklis periklis force-pushed the memberlist-srv-records branch from 841b338 to 474af35 Compare July 2, 2020 07:10
@pstibrany pstibrany merged commit 0ea5a8b into cortexproject:master Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 participants