Skip to content

Simplify zone awareness#9148

Open
naemono wants to merge 35 commits intoelastic:mainfrom
naemono:simplify-zone-awareness
Open

Simplify zone awareness#9148
naemono wants to merge 35 commits intoelastic:mainfrom
naemono:simplify-zone-awareness

Conversation

@naemono
Copy link
Contributor

@naemono naemono commented Feb 18, 2026

Summary

Resolves #3933

This PR introduces spec.nodeSets[].zoneAwareness as first-class API support for Elasticsearch zone awareness in ECK. The goal is to make the common zone-aware deployment path concise and safe by having the operator automatically wire:

  • topology spread constraints (maxSkew=1, whenUnsatisfiable=DoNotSchedule),
  • required node affinity (when zones are specified),
  • downward node labels and ZONE env var injection,
  • node.attr.zone and cluster.routing.allocation.awareness.attributes Elasticsearch config defaults.

This replaces the error-prone manual wiring across config, podTemplate, and the eck.k8s.elastic.co/downward-node-labels annotation while preserving user override behavior where appropriate.

Minimal example

spec:
  nodeSets:
  - name: default
    count: 3
    zoneAwareness: {}

With explicit zone restriction

spec:
  nodeSets:
  - name: hot
    count: 3
    zoneAwareness:
      zones: ["us-east-1a", "us-east-1b", "us-east-1c"]

To customize maxSkew or whenUnsatisfiable, provide a matching topology spread constraint via podTemplate.spec.topologySpreadConstraints with the same topologyKey. The operator will preserve the user-provided constraint and not inject its own default for that key.

Risks

  • One-time rolling restart on upgrade for affected NodeSets. Enabling zoneAwareness on any NodeSet changes the StatefulSet Pod template (topology spread constraints, ZONE env var, node affinity rules, node.attr.zone/cluster.routing.allocation.awareness.attributes config defaults). StatefulSet template changes trigger a controlled rolling restart.
  • Cluster-wide config impact in mixed clusters. When any NodeSet enables zoneAwareness, all NodeSets in the cluster receive zone-related Elasticsearch config (node.attr.zone: ${ZONE}, cluster.routing.allocation.awareness.attributes: k8s_node_name,zone) and the ZONE env var. This ensures shard awareness is consistent across the cluster but means non-zoneAware NodeSets also gain these settings on first adoption.

What Changed

API / CRD

  • Added nodeSets[].zoneAwareness field (ZoneAwareness struct) with two optional sub-fields:
    • topologyKey (defaults to topology.kubernetes.io/zone)
    • zones (optional list; +listType=set, MinItems=1 when specified)
  • Added helper methods: TopologyKeyOrDefault(), HasZoneAwareness(), ZoneAwarenessTopologyKey().
  • Extended DownwardNodeLabels() and DownwardNodeLabelsHashInput() to include zone-awareness-derived topology keys alongside the existing annotation-based labels. The hash input preserves the raw annotation value for backward compatibility to avoid spurious rolling restarts on operator upgrade.

Scheduling

  • Zone-aware NodeSets receive:
    • A TopologySpreadConstraint with maxSkew=1, whenUnsatisfiable=DoNotSchedule for the configured topology key. User-provided constraints for the same key are preserved (the operator skips its default).
    • A required node affinity In expression when zones are specified, restricting placement to the listed zone values.
  • Non-zoneAware NodeSets in a cluster where other NodeSets enable zone awareness receive:
    • A required node affinity Exists guard on the cluster topology key, ensuring pods land only on nodes that carry the label (prevents init-container deadlock from missing annotations).

Elasticsearch config

  • When clusterHasZoneAwareness is true, the operator merges zone awareness defaults (node.attr.zone: ${ZONE}, cluster.routing.allocation.awareness.attributes: k8s_node_name,zone) into the ES config for all NodeSets, before user config is applied. Explicit user config takes precedence.

Init container / scripts ConfigMap

  • DownwardNodeLabels() now includes the zone-awareness topology key. The shared init script waits for all expected annotations (including the zone topology key annotation) before the Elasticsearch container starts. The Downward API volume and operator-side annotation patching provide the values.

PodTemplateBuilder additions

  • WithTopologySpreadConstraints(...): appends constraints only when no existing constraint shares the same topologyKey. Fills in a missing LabelSelector on an existing constraint when the provided constraint has one.
  • WithRequiredNodeAffinityMatchExpressions(...): appends required node selector requirements by key without duplicating. For Exists operator, skips only when an existing expression on the same key already guarantees label existence (In, Exists, Gt, Lt).
  • WithPreferredNodeAffinityMatchExpressions(...): appends preferred scheduling terms by key.

Validation

  • All zone-aware NodeSets must use the same topology key. The operator generates a single shared init script that waits for all topology key annotations. Mixed keys would cause a pod deadlock (the init script waits for an annotation the operator never sets because the node lacks the other key's label). The validZoneAwarenessTopologyKeys validation enforces this.
  • Zone-awareness topology keys are exempt from --exposed-node-labels when the flag is empty (the default). When --exposed-node-labels is explicitly configured, zone-awareness topology keys must also be allowed by that policy. This allows zone awareness to work out of the box without requiring flag configuration.

Sample manifest

  • Updated config/samples/elasticsearch/elasticsearch.yaml with commented-out zoneAwareness example.

Assumptions and Edge-Case Policy

Non-zoneAware NodeSets when others are zoneAware

This is allowed. When at least one NodeSet enables zone awareness and another does not:

  • All NodeSets receive zone env/config (ZONE env var, node.attr.zone, allocation awareness attributes) for cluster-wide consistency.
  • Non-zoneAware NodeSets get a required node affinity Exists guard on the cluster topology key, ensuring pods only schedule on nodes that carry the label.
  • Non-zoneAware NodeSets do not get topology spread constraints automatically.

Mixed topology keys across NodeSets

Not allowed. All zone-aware NodeSets must share the same topology key. This is enforced by the validating webhook. The restriction exists because the operator uses a single shared init script and annotation set for the entire cluster.

User overrides

  • Existing user-provided topology spread constraints for the same key are preserved; the operator does not inject a duplicate.
  • Existing required node affinity expressions for the same key are not overwritten.
  • User-provided ZONE env var in podTemplate is preserved.
  • Explicit user Elasticsearch config (node.attr.zone, cluster.routing.allocation.awareness.attributes) takes precedence over operator defaults.
  • Legacy manual zone-awareness configurations (annotation + manual config/podTemplate) remain supported without modification.

Backward Compatibility

  • Existing manual zone-aware specs (annotation + manual config/podTemplate) are supported without modification.
  • Migration to zoneAwareness is incremental and optional.
  • Enabling zoneAwareness on any NodeSet triggers a one-time rolling restart of all NodeSets in the cluster due to the addition of zone-related config and env vars.
  • The DownwardNodeLabelsHashInput() method preserves the raw annotation value for hash computation, avoiding spurious rolling restarts for existing clusters that already use the eck.k8s.elastic.co/downward-node-labels annotation without zoneAwareness.

AI Usage

Cursor/gpt-5.3 codex used during:

  • Planning/Spec creation
  • Code writing
  • Test plan writing/execution
  • PR Description generation
  • PR review process (ran this through tons of review using differing models to catch any edge cases missed)

Significant manual cleanup was required after AI-generated code and tests.

@prodsecmachine
Copy link
Collaborator

prodsecmachine commented Feb 18, 2026

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@botelastic botelastic bot added the triage label Feb 18, 2026
Adding tests for new zone awareness behavior.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@github-actions
Copy link

github-actions bot commented Feb 18, 2026

🔍 Preview links for changed docs

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@naemono naemono added the >feature Adds or discusses adding a feature to the product label Feb 20, 2026
@botelastic botelastic bot removed the triage label Feb 20, 2026
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@naemono naemono changed the title [WIP] Simplify zone awareness Feb 20, 2026
@naemono naemono marked this pull request as ready for review February 20, 2026 15:37
@naemono naemono requested a review from a team as a code owner February 20, 2026 15:37
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@pebrc
Copy link
Collaborator

pebrc commented Feb 27, 2026

Potential pod deadlock when zone-aware NodeSets use different topology keys

DownwardNodeLabels() collects topology keys from ALL zone-aware NodeSets into a single union set. This set is used to generate one shared init script (via ReconcileScriptsConfigMap) that ALL pods use, regardless of which NodeSet they belong to. The init script busy-waits until every annotation in the set exists on the pod.

However, getPodAnnotations() in node_labels.go returns an error when the node is missing any expected label -- and a node hosting NodeSet A's pods will only have A's topology key, not B's. This means the operator never sets the annotations, and the init containers wait forever.

Example configuration that triggers this (passes validation):

nodeSets:
- name: hot
  zoneAwareness:
    topologyKey: "custom.io/rack"
- name: warm
  zoneAwareness:
    topologyKey: "custom.io/room"

The chain is:

Possible fixes: either tighten validation to require all zone-aware NodeSets share the same topology key, or make the init script / annotation system per-NodeSet-aware.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

nodesets.
Move review changes.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@naemono
Copy link
Contributor Author

naemono commented Feb 27, 2026

Possible fixes: either tighten validation to require all zone-aware NodeSets share the same topology key, or make the init script / annotation system per-NodeSet-aware.

I required the same topology key here.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Deal with more ordering issues causing rolling restarts

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Use given context
Add some tests

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>feature Adds or discusses adding a feature to the product

5 participants