Skip to content

Limit concurrent TLS handshakes#136386

Merged
DaveCTurner merged 17 commits intoelastic:mainfrom
DaveCTurner:2025/10/10/tls-handshake-throttle
Oct 22, 2025
Merged

Limit concurrent TLS handshakes#136386
DaveCTurner merged 17 commits intoelastic:mainfrom
DaveCTurner:2025/10/10/tls-handshake-throttle

Conversation

@DaveCTurner
Copy link
Contributor

@DaveCTurner DaveCTurner commented Oct 10, 2025

Introduces a per-event-loop limit on the number of TLS handshakes
running at once. When at the limit, subsequent TLS handshakes are
delayed and processed in LIFO order.

Closes ES-12457

Introduces a per-event-loop limit on the number of TLS handshakes
running at once. When at the limit, subsequent TLS handshakes are
delayed and processed in LIFO order.
@DaveCTurner DaveCTurner requested a review from mhl-b October 10, 2025 12:54
@DaveCTurner DaveCTurner added >enhancement :Distributed/Network Http and internode communication implementations v9.3.0 labels Oct 10, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Oct 10, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@DaveCTurner DaveCTurner removed the request for review from mhl-b October 10, 2025 13:04
@DaveCTurner
Copy link
Contributor Author

Oh well CI didn't like this very much at all :/

@DaveCTurner DaveCTurner marked this pull request as draft October 10, 2025 13:05
@mhl-b
Copy link
Contributor

mhl-b commented Oct 10, 2025

I think using throttler per event-loop is smart.

But absolute number of in-flight-handshakes looks hard to tune. Ultimately we want handshakes do not exhaust CPU (stating obvious), hence number of handshakes derives from CPU usage. Why not to use a better metric proxy to CPU that requires less tuning? For example how much time we spent handshaking in last N seconds.

I think there are too many factors that can impact number of connections: hardware, JDK version, CPU used for other needs, etc.

@DaveCTurner DaveCTurner marked this pull request as ready for review October 13, 2025 10:52
@DaveCTurner
Copy link
Contributor Author

Ultimately we want handshakes do not exhaust CPU (stating obvious)

It's not obvious, and indeed that's not actually the goal at all. We want to avoid the situation where we're doing work on TLS handshakes that ultimately time out, because that work is a pointless waste of CPU. I added a comment in ead6075 to describe things in more detail, and to give some justification behind the defaults I settled on.

You're right that the actual numbers depend on all sorts of factors but a maximum handshake rate of 400Hz seems like a reasonable starting point. I can see some value in auto-tuning: these defaults assume we can achieve a handshake rate of at least 100Hz, but if the CPU is more than 75% busy with other things then that isn't a valid assumption. With a handshake rate lower than 100Hz we will start to hit client timeouts on the 1000 in_progress handshakes which means we're back into wasting-CPU territory.

My hunch is that this isn't a case which happens (ignoring bugs that just clog up the transport worker thread which are a separate thing) and it adds quite some complexity so I'd rather wait until we see evidence it's needed before doing it.

I could however be convinced that we want a different split than 1000+1000 by default. For instance if we we went with a 200+1800 split then we'd be able to cope with a 20Hz average handshake rate without any in_progress handshakes timing out.

@DaveCTurner DaveCTurner requested a review from mhl-b October 13, 2025 11:11
@mhl-b
Copy link
Contributor

mhl-b commented Oct 14, 2025

No questions about implementation and reasoning, some nits, but thats later. The 200+1800 split sounds good. But I'm still not very clear how does this solution solve our existing problems.

The whole design is plausible based on assumptions of 2.5ms CPU time per handshake and transport threads would schedule more or less evenly on CPU's. It's reasonable when system not under stress, but neither is strong enough. CPU time per handshake varies a lot when CPU usage approaches limit. It would be interesting to see how you choose 2.5ms per handshake. Also our over-threading can lead to multiple transport threads would run on a single core. So throttler-per-thread might not be as good as a global throttler, especially on smaller nodes (2 CPUs, ~100 threads)

My thinking about requirements. There are 2 parts: prevention and reaction. We try not over-commit, but over-commit can happen due to lots of external factors, so we need to react when overload happens. I would expect we do these:

  • allow burst of connections (burst credit), for example node start-up
  • throttle consistently high connection rate (exhaust burst credit)
  • detect overload and throttle more
  • a smaller nodes (2/4CPU's) more prone to handshake storm, so we might add more scrutiny here and relax large nodes

So exposing number of connections as settings might not be a right move. There should be component that can tell how many connections right now is OK looking at surroundings, and dynamically change it. I'm afraid there is no simple solution (a single knob) that will work with our range of cases.

@DaveCTurner
Copy link
Contributor Author

I still think you're trying to solve a different problem here. We're not trying to shed load in order to free up CPU for other work. We're only trying to avoid doing work that we can tell in advance will turn out to be pointless.

Copy link
Contributor

@mhl-b mhl-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for ClientHello and per-thread-throttler smells like YAGNI. It adds complexity to the code, harder to read, but benefits are not obvious unless measured in real case. My hunch if we simply admit 400(200) handshakes * transport_workers and reject others that would be good enough, accepted ones will be queued anyway. That will remove whole action-listener dance, a simple atomic counter accept or not-accept connection. I understand the theory behind current proposal, thanks for explaining on the call, I think simple counter will work about the same.

And it's easier to think about and tweak. Just a number of how many handshakes in flight. Too many? Clients see timeouts? CPU is high? reduce number. Too many rejections and CPU is running free? increase number.

acceptChannelPredicate.setBoundAddress(boundAddress());
}

tlsHandshakeThrottleManager.start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should start before binding server? Might trip some tests in future where connection accepted before manager starts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting the manager is just about registering settings handlers and metrics, and the tests all wait for everything to start before opening any connections, so I don't think it matters in practice, but conceptually makes sense so I moved it in 6f1c4d3.

* {@link #handleHandshakeCompletion} to trigger either the processing of another handshake or a decrement of
* {@link #inProgressHandshakesCount}.
*/
private static class HandshakeCompletionWatcher extends SimpleUserEventChannelHandler<SslCompletionEvent> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SslHandshakeCompletionEvent is more precise, SslCompletionEvent includes close_notify too. A follow up question from chat: does SslHandshakeCompletionEvent waits for another round-trip or fires after ClientHello?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks yes that does seem slightly better. Presumably someone could otherwise trick us into considering the handshake complete early by fabricating a pre-handshake close_notify?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather future proofing. It works fine because handler is removed, but it's a loose relationship

protected void eventReceived(ChannelHandlerContext ctx, SslCompletionEvent evt) {
ctx.pipeline().remove(HandshakeCompletionWatcher.this);
if (evt.isSuccess()) {
handshakeCompletePromise.onResponse(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should remove this handler from pipeline too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that what ctx.pipeline().remove(HandshakeCompletionWatcher.this); does?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, haha

* Promise which accumulates the messages received until we receive a full handshake. Completed when we receive a full
* handshake, at which point all the delayed messages are pushed down the pipeline for actual processing.
*/
private final SubscribableListener<Void> handshakeStartedPromise = new SubscribableListener<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer throttle manager returns a handshakeStartFuture, that will provide handshakeCompletePromise, so it's clear that when start is failed there is nothing to complete. Something like

Future<Promise<HandshakeComplete>> canStartFuture = manager.register()
canStartFuture.addListener(f -> {
  if (f.isFailure()) 
    closeChannel();
  else {
    promise = f.get();
    channel.pipeline.add(new SslHandler()); // dont add handler until ready to proceed
    channel.pipeline.addAfter(ssl, HandshakeCompleteHandler(promise));
  }
})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah it's a little more Netty-esque to do it that way, but then should we be delaying the creation of the whole rest of the pipeline too? Would be a bit weird to come back and just insert the SslHandler in the middle of a mostly-pre-existing pipeline. Likewise should we wait for the AcceptChannelHandler to accept the channel before setting anything else up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see problems with adding/removing handler after init phase. Ideally we should drop connection before constructing channel. Doesn't work with current solution

Copy link
Contributor

@mhl-b mhl-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. My previous comment about using counter might not work and LIFO is a good choice here. Left another comment about max-in-progress.

*/
public static final Setting<Integer> SETTING_HTTP_NETTY_TLS_HANDSHAKES_MAX_IN_PROGRESS = intSetting(
"http.netty.tls_handshakes.max_in_progress",
1000, // See [NOTE: TLS Handshake Throttling] above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can max_in_progress be greater than number of transport workers? If we assume that ClientHello produces successful handshake event, without doing round-trip, then every in-progress-handshake will be completed in the same thread execution?
Maybe use only max_delayed, without in max_in_progress?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the max per transport worker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, same question. Can it be larger than 1 at any time for a worker?

Comment on lines +360 to +365
// watch for the completion of this channel's initial handshake at which point we can release one for another
// channel
.addLast(
"initial-tls-handshake-completion-watcher",
tlsHandshakeThrottle.newHandshakeCompletionWatcher(handshakeCompletePromise)
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my education:

  1. Is it possible to use SslHander#handshakeFuture() for the completion watching without the need of a separate handler?
  2. Instead of handling the specific ClientHello message, is it possible to accumulate generic bytes when there is no capacity for the TLS handling and forward them when a handshake from a different channel finishes?

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. in principle yes but we also have to watch for the channel closing before processing a handshake, and yet we want to remove all this machinery once the handshake completes. This is all surmountable, Netty promises do in fact support removeListener(), but IMO it's not really any simpler than just having a single handler to remove on completion.
  2. if we just forwarded generic bytes without first checking that they comprised a complete ClientHello message, we'd have to count that as an in-flight handshake, and then it would be possible for a malicious client to consume all the available in-flight handshake slots by sending partial handshakes and never completing them.
@DaveCTurner DaveCTurner enabled auto-merge (squash) October 22, 2025 12:29
@DaveCTurner DaveCTurner merged commit 3542bce into elastic:main Oct 22, 2025
34 checks passed
fzowl pushed a commit to voyage-ai/elasticsearch that referenced this pull request Nov 3, 2025
Introduces a per-event-loop limit on the number of TLS handshakes
running at once. When at the limit, subsequent TLS handshakes are
delayed and processed in LIFO order.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Network Http and internode communication implementations >enhancement Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.3.0

4 participants