Skip to content

[server][da-vinci] Integrate OTel metrics into ServerHttpRequestStats #2541

Open
m-nagarajan wants to merge 12 commits intolinkedin:mainfrom
m-nagarajan:addOtelMetricsForAggServerReadOtelStats
Open

[server][da-vinci] Integrate OTel metrics into ServerHttpRequestStats #2541
m-nagarajan wants to merge 12 commits intolinkedin:mainfrom
m-nagarajan:addOtelMetricsForAggServerReadOtelStats

Conversation

@m-nagarajan
Copy link
Contributor

@m-nagarajan m-nagarajan commented Feb 27, 2026

Problem Statement

Server-side read request metrics in ServerHttpRequestStats only used Tehuti sensors, with no OpenTelemetry (OTel) instrumentation.

Solution

Refactor ServerHttpRequestStats to colocate OTel MetricEntityState fields alongside Tehuti Sensor fields, following the pattern established by RouterHttpRequestStats. A single MetricEntityState.record() call writes to both OTel and Tehuti systems, eliminating the need for a separate OTel stats class.

Key changes:

  • ServerHttpRequestStats: Replace raw Sensor fields with MetricEntityState fields for 17 OTel metric entities. Recording methods now accept HTTP status enum dimensions for OTel. Tehuti-only sensors (small/large value breakdown, read-compute efficiency, early termination, misrouted requests) are retained as plain Sensor fields.
  • ServerReadOtelMetricEntity: New enum defining 17 OTel metric entities with dimensional attributes (store, cluster, request method, HTTP status, chunking status, compute operation type).
  • VeniceChunkingStatus, VeniceComputeOperationType: New dimension enums for VENICE_CHUNKING_STATUS and VENICE_READ_COMPUTE_OPERATION_TYPE.
  • VeniceMetricsDimensions: Add VENICE_CHUNKING_STATUS and VENICE_READ_COMPUTE_OPERATION_TYPE entries.
  • ServerStatsContext: Compute HTTP status enums in successRequest()/errorRequest() and pass them to unified recording methods.
  • AggServerHttpRequestStats: Update delegating methods (recordErrorRequest, recordErrorRequestLatency) for new signatures.
  • Response stats classes: Add getResponseValueSize() to enable aggregate OTel value size recording. Unified recording flows through ServerHttpRequestStats methods that handle both systems.
  • ServerMetricEntity: Register ServerReadOtelMetricEntity in SERVER_METRIC_ENTITIES.

Per-category OTel config override:

  • ConfigKeys.SERVER_READ_OTEL_STATS_ENABLED (default: true): Controls whether read OTel metrics are emitted. When set to false, OpenTelemetryMetricsSetup.build() short-circuits before any OTel instrument creation, and MetricEntityState.record() skips OTel via the existing otelMetric == null fast path. Tehuti metrics remain fully operational.
  • OpenTelemetryMetricsSetup.Builder.setOtelEnabledOverride(): New per-category gate applied on top of the global OTel config. OTel metrics are emitted only when both the global config and the override are enabled. Setting the override to false disables OTel for the stats class being built; setting it to true cannot re-enable OTel when the global config is disabled.
  • The existing SERVER_INGESTION_OTEL_STATS_ENABLED config is also threaded through setOtelEnabledOverride() for consistency and defense-in-depth, ensuring the same global+override AND semantics for both read and ingestion paths.

Compute-only metric naming and dimension cleanup:

All metrics that fire exclusively for read-compute requests use a storage_engine.read.compute.* naming convention and omit VENICE_REQUEST_METHOD from their dimensions (since it is always COMPUTE). A computeBaseDimensionsMap (with VENICE_REQUEST_METHOD removed) and corresponding computeBaseAttributes are built once in the constructor and shared across these metrics:

  • STORAGE_ENGINE_READ_COMPUTE_DESERIALIZATION_TIME — dims: store, cluster, chunking status. Chunking status is retained because deserialization operates on potentially chunked values.
  • STORAGE_ENGINE_READ_COMPUTE_SERIALIZATION_TIME — dims: store, cluster. Chunking status omitted because serialization happens after chunk reassembly.
  • STORAGE_ENGINE_READ_COMPUTE_EXECUTION_COUNT — dims: store, cluster, operation type. Per-operation-type count (dot product, cosine similarity, etc.).
  • STORAGE_ENGINE_READ_COMPUTE_EXECUTION_TIME — dims: store, cluster. Records compute execution latency. Does not differentiate by VENICE_READ_COMPUTE_OPERATION_TYPE because ComputeUtils.computeResult() runs all operations in a single loop without per-operation timing. The dimension can be added here and per-operation timing instrumented in ComputeUtils if a per-type latency breakdown is needed in the future.

Dimension enum renames and cleanup:

  • VeniceIngestionFailureReason: Renamed BOOTSTRAP_TIMEOUTSERVING_VERSION_BOOTSTRAP_TIMEOUT (serving version partition bootstrap exceeded time limit; task stays alive) and PUSH_TIMEOUTFUTURE_VERSION_PUSH_TIMEOUT (future version ingestion task killed due to bootstrap timeout). Dimension values updated accordingly (bootstrap_timeoutserving_version_bootstrap_timeout, push_timeoutfuture_version_push_timeout).
  • VeniceWriteComputeOperationVenicePartialUpdateOperation: Renamed enum, dimension (VENICE_WRITE_COMPUTE_OPERATION_PHASEVENICE_PARTIAL_UPDATE_OPERATION_PHASE, value venice.partial_update.operation_phase), and associated OTel metric entities (WRITE_COMPUTE_TIMEPARTIAL_UPDATE_TIME, WRITE_COMPUTE_CACHE_HIT_COUNTPARTIAL_UPDATE_CACHE_HIT_COUNT) along with all recording methods and field names to use "partial update" consistently.
  • VeniceIngestionSourceComponent: Removed LEADER_CONSUMER and FOLLOWER_CONSUMER values — consumers are only meaningful as destinations, not sources. Updated Javadoc on both VeniceIngestionSourceComponent and VeniceIngestionDestinationComponent to explain the asymmetric value sets.

Design decisions:

  • OTel disabled for total stats: OpenTelemetryMetricsSetup returns otelRepository=null for total stats, so MetricEntityState records Tehuti only.
  • 429 handling: recordBasicMetrics() records Tehuti metrics (including responseSize and valueSize) for all requests including 429. OTel metrics requiring HTTP status dimensions are only recorded in successRequest()/errorRequest(), which are not called for 429 responses.
  • Unified Tehuti+OTel for compute execution latency: readComputeQueryTimeMetric is a single MetricEntityStateBase field mapped to STORAGE_ENGINE_READ_COMPUTE_EXECUTION_TIME that records to both Tehuti and OTel. This avoids the need for separate Tehuti-only and OTel-only fields for the same measurement.

Code changes

  • Added new code behind a config. Config names and defaults:
    • server.read.otel.stats.enabled — default true
    • server.ingestion.otel.stats.enabled — default true (existing, now also threaded through setOtelEnabledOverride)
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Tests added/updated:

  • ServerReadOtelMetricEntityTest: Validates all 17 OTel metric entity definitions (name, type, unit, dimensions).
  • ServerStatsContextTest: Comprehensive tests for success/error/429 recording paths, OTel dimension propagation, boundary conditions (default responseSize=-1, valueSize=0), composite recorder delegation, merge operations for compute stats with size profiling.
  • AggServerHttpRequestStatsTest: Updated for new method signatures, per-store-to-total propagation tests, and status triplet deduplication.
  • StorageReadRequestHandlerTest: Updated for method renames.
  • ServerMetricEntityTest: Updated entity count for new ServerReadOtelMetricEntity.
  • VeniceChunkingStatusTest, VeniceComputeOperationTypeTest, VenicePartialUpdateOperationTest, VeniceMetricsDimensionsTest: Tests for new and renamed dimension enums.
  • OpenTelemetryMetricsSetupTest: DataProvider-based test covering the full interaction matrix between global OTel config and otelEnabledOverride (6 combinations, parallel-safe).
  • VeniceServerConfigTest: Tests for readOtelStatsEnabled and ingestionOtelStatsEnabled defaults and explicit overrides.
  • IngestionOtelStatsTest: Tests for ingestionOtelStatsEnabled override disabling OTel when global config is enabled.

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

The following OTel dimension names, dimension values, and metric names were renamed. These were introduced very recently and are not yet consumed by any dashboards or alerts, so they are directly modified without backwards-compatibility shims.

Dimension renames:

Before After
VENICE_WRITE_COMPUTE_OPERATION_PHASE (venice.write_compute.operation_phase) VENICE_PARTIAL_UPDATE_OPERATION_PHASE (venice.partial_update.operation_phase)

Dimension value renames (VeniceIngestionFailureReason):

Before After
bootstrap_timeout serving_version_bootstrap_timeout
push_timeout future_version_push_timeout

OTel metric name renames (IngestionOtelMetricEntity):

Before After
ingestion.write_compute.time ingestion.partial_update.time
ingestion.write_compute.cache.hit_count ingestion.partial_update.cache.hit_count
… using

Router Pattern

Refactor ServerHttpRequestStats to colocate OTel MetricEntityState fields
alongside
Tehuti Sensor fields, following the pattern established by
RouterHttpRequestStats.
A single MetricEntityState.record() call writes to both OTel and Tehuti
systems.

Key changes:
- ServerHttpRequestStats: Replace raw Sensor fields with MetricEntityState
fields for
  16 OTel metric entities (READ_CALL_TIME, READ_CALL_COUNT,
STORAGE_ENGINE_QUERY_TIME,
  etc.). Recording methods now accept HTTP status enum dimensions.
- ServerReadOtelMetricEntity: New enum defining 16 OTel metric entities with
dimensions
  (store, cluster, request method, HTTP status, chunking status, compute
operation type).
- VeniceChunkingStatus, VeniceComputeOperationType: New dimension enums for
  VENICE_CHUNKING_STATUS and VENICE_COMPUTE_OPERATION_TYPE.
- VeniceMetricsDimensions: Add VENICE_CHUNKING_STATUS and
VENICE_COMPUTE_OPERATION_TYPE.
- ServerStatsContext: Compute HTTP status enums in
successRequest/errorRequest and pass
  to unified recording methods. OTel responseSize/valueSize recorded with
dimensions;
  Tehuti responseSize recorded in recordBasicMetrics (runs for all requests
incl. 429).
- AggServerHttpRequestStats: Update delegating methods for new signatures.
- Response stats classes: Remove OTel-specific recording methods; unified
recording
  flows through ServerHttpRequestStats methods that handle both systems.
- ServerMetricEntity: Register ServerReadOtelMetricEntity in
SERVER_METRIC_ENTITIES.
- Comprehensive test coverage for all recording paths, merge operations, and
boundary
  conditions.
Copilot AI review requested due to automatic review settings February 27, 2026 10:24
@m-nagarajan m-nagarajan changed the title [server][da-vinci] Integrate OTel metrics into ServerHttpRequestStats using Router Pattern Feb 27, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds OpenTelemetry (OTel) instrumentation to server-side read request stats by refactoring ServerHttpRequestStats to follow the established “Router Pattern” where OTel MetricEntityState and Tehuti Sensor recording coexist in one stats class, enabling unified recording APIs and richer dimensional metrics.

Changes:

  • Refactor ServerHttpRequestStats to use OTel MetricEntityState* fields (alongside retained Tehuti-only sensors) and update server recording paths to pass HTTP status dimensions.
  • Introduce new OTel metric entity definitions (ServerReadOtelMetricEntity) and new dimension enums (VeniceChunkingStatus, VeniceComputeOperationType) plus VeniceMetricsDimensions updates.
  • Extend read-response stats recorders with getResponseValueSize() to support aggregate OTel value size recording and update ingestion failure metrics to use more specific VeniceIngestionFailureReason values.

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
services/venice-server/src/test/java/com/linkedin/venice/stats/AggServerHttpRequestStatsTest.java Updates tests for new status-dimensioned recording APIs and compute metric method renames.
services/venice-server/src/test/java/com/linkedin/venice/listener/StorageReadRequestHandlerTest.java Updates mocks/verifications for renamed compute metric recording methods.
services/venice-server/src/test/java/com/linkedin/venice/grpc/ServerStatsContextTest.java Expands unit coverage for success/error/429 paths and OTel size recording behavior.
services/venice-server/src/main/java/com/linkedin/venice/stats/ServerHttpRequestStats.java Core refactor: integrate OTel MetricEntityState* with Tehuti sensors and add OTel-only size recording helpers.
services/venice-server/src/main/java/com/linkedin/venice/stats/AggServerHttpRequestStats.java Wires new constructor signature and updated error-recording method signatures.
services/venice-server/src/main/java/com/linkedin/venice/listener/response/stats/SingleGetResponseStats.java Adds getResponseValueSize() and clarifies Tehuti-only per-key value size recording.
services/venice-server/src/main/java/com/linkedin/venice/listener/response/stats/ReadResponseStatsRecorder.java Adds default getResponseValueSize() API for aggregate value-size reporting.
services/venice-server/src/main/java/com/linkedin/venice/listener/response/stats/MultiGetResponseStatsWithSizeProfiling.java Tracks/merges aggregate value size for OTel while keeping per-key Tehuti recording unmerged.
services/venice-server/src/main/java/com/linkedin/venice/listener/response/stats/ComputeResponseStats.java Renames compute metric recorders and exposes aggregate getResponseValueSize().
services/venice-server/src/main/java/com/linkedin/venice/listener/response/ParallelMultiKeyResponseWrapper.java Composite recorder now exposes merged getResponseValueSize() for aggregate OTel recording.
services/venice-server/src/main/java/com/linkedin/venice/listener/ServerStatsContext.java Computes HTTP status enums/categories and records OTel-only response/value size with status dimensions.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java Adds assertions for new dimension names across naming formats.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceIngestionSourceComponentTest.java Updates expected dimension values after source component enum value set change.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceIngestionFailureReasonTest.java Updates expected mappings for renamed/split ingestion failure reasons.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceComputeOperationTypeTest.java Adds tests for new compute operation type dimension enum.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceChunkingStatusTest.java Adds tests for new chunking-status dimension enum.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java Registers new dimensions for chunking status and compute operation type.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceIngestionSourceComponent.java Removes consumer-only values from “source” enum and clarifies source/destination semantics.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceIngestionFailureReason.java Replaces generic timeouts with more specific failure reasons for ingestion metrics.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceIngestionDestinationComponent.java Updates docs to match paired asymmetric enum design with the source component enum.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceComputeOperationType.java Adds new dimension enum for read-compute operation type.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceChunkingStatus.java Adds new dimension enum for chunked vs unchunked read behavior.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/IngestionOtelStatsTest.java Extends ingestion OTel tests (imports and additional metric recording validations).
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerReadOtelMetricEntityTest.java Adds tests validating the 16 new server-read OTel metric entity definitions.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java Updates expected server metric entity registry count after adding server-read entities.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerReadOtelMetricEntity.java Defines the 16 OTel metric entities for server-side read observability.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java Registers ServerReadOtelMetricEntity into the server metric entity registry.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTask.java Threads specific VeniceIngestionFailureReason into ingestion exception handling and metrics.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/StoreIngestionService.java Adds an overload to record ingestion failure with version + failure-reason dimension.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTask.java Records serving-version bootstrap timeout failures with the new failure-reason dimension.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/KafkaStoreIngestionService.java Implements the new recordIngestionFailure(store, version, reason) overload.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/helix/StateModelIngestionProgressNotifier.java Records bootstrap-to-online timeout failures with version + specific failure-reason dimension.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…s line

count

LeaderFollowerStoreIngestionTask:
- Use null-safe field access for stats recording in checkLongRunningTaskState,
  since tests partially mock the task via doCallRealMethod() leaving fields
null
- Use getVersionIngestionStats() getter in produceToLocalKafka to match test
  mock stubbing pattern

ServerHttpRequestStats:
- Extract createComputeOpMetric helper for 4 identical compute operation
metrics
- Remove section-header organizational comments and trim verbose Javadocs
- Reduces per-file added lines from 546 to 478 (under 500 CI limit)
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…size

metrics

Eliminate split recording where Tehuti and OTel were recorded at different
call sites for responseSize and valueSize. Thread HTTP status enum dimensions
through the ReadResponseStatsRecorder hierarchy so both systems are recorded
in a single MetricEntityState.record() call.

Also make MetricEntityState.recordOtelMetric() package-private to prevent
external callers from bypassing the unified record() API, and add a null
guard for responseStatus in recordBasicMetrics() to fix a SpotBugs finding.

Key changes:
- ServerHttpRequestStats: Replace 4 split methods (recordResponseSize +
  recordResponseSizeOtelOnly, recordValueSizeInByte +
  recordValueSizeInByteOtelOnly) with 2 unified methods
- ReadResponseStatsRecorder: Add HTTP status enum params to recordMetrics()
  and recordUnmergedMetrics() signatures
- ServerStatsContext: Compute HTTP status enums once in recordBasicMetrics()
  and pass through all recording paths; add responseStatus null guard
- MetricEntityState: Change recordOtelMetric visibility from public to
  package-private
- Update all response stats subclasses, tests, and callers
Copilot AI review requested due to automatic review settings February 27, 2026 19:10
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 37 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 37 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. Please take a look

independently

Add SERVER_READ_OTEL_STATS_ENABLED config that allows disabling read OTel
metrics without affecting other server/DaVinci OTel metrics. The per-category
override is applied on top of the global OTel config — both must be enabled
for OTel to emit.

Also thread the existing SERVER_INGESTION_OTEL_STATS_ENABLED through
OpenTelemetryMetricsSetup.setOtelEnabledOverride() for consistency and
defense-in-depth, ensuring the same global+override AND semantics.
Copilot AI review requested due to automatic review settings February 28, 2026 00:46
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 49 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

this.readComputeSerializationLatency,
isAssembledMultiChunkLargeValue,
0);
consumeDoubleIfAbove(stats::recordReadComputeSerializationLatency, this.readComputeSerializationLatency, 0);
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recordReadComputeSerializationLatency method signature was changed from (double latency, boolean assembledMultiChunkLargeValue) to (double latency). In ComputeResponseStats.recordMetrics, the call site now uses consumeDoubleIfAbove(stats::recordReadComputeSerializationLatency, ...) with the single-arg version, while the old code used consumeDoubleAndBooleanIfAbove. The assembledMultiChunkLargeValue parameter is no longer passed, meaning the serialization time metric is no longer tagged with chunking status in OTel. The STORAGE_ENGINE_QUERY_SERIALIZATION_TIME metric definition also uses VENICE_REQUEST_METHOD only (no chunking status), which is consistent with the new signature. However, the existing Tehuti sensor readComputeLatencyForSmallValueSensor/readComputeLatencyForLargeValueSensor equivalents for serialization are gone; this is a behavioral change worth verifying is intentional (there's no serialization latency breakdown by chunking status in Tehuti either in the new code).

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deserialization has chunking dim whereas serialization doesn't. Its intentional and updated the comments and PR desc.

LEADER_CONSUMER,
/** The follower replica consumer that replicates from leader */
FOLLOWER_CONSUMER;
SOURCE_BROKER;
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VeniceIngestionSourceComponent enum removes LEADER_CONSUMER and FOLLOWER_CONSUMER. These removals are potentially breaking for any existing OTel metric data that was previously emitted with these dimension values, and for any metrics-analysis dashboards or alerts that query these dimension values. Additionally, there is a discrepancy: the updated Javadoc says "LEADER_CONSUMER and FOLLOWER_CONSUMER are only meaningful as destinations", but VeniceIngestionDestinationComponent still contains these values. The test in VeniceIngestionSourceComponentTest.java removes those enum values from the expected mapping, which confirms the intent. However, this is a breaking change for existing OTel pipelines that may have data tagged with leader_consumer or follower_consumer source component values.

Suggested change
SOURCE_BROKER;
SOURCE_BROKER,
/**
* Legacy value kept for backward compatibility with existing OTel metrics that may have
* {@code leader_consumer} as a source component. New code should not use this as a source;
* LEADER_CONSUMER is only meaningful as a destination component.
*/
@Deprecated
LEADER_CONSUMER,
/**
* Legacy value kept for backward compatibility with existing OTel metrics that may have
* {@code follower_consumer} as a source component. New code should not use this as a source;
* FOLLOWER_CONSUMER is only meaningful as a destination component.
*/
@Deprecated
FOLLOWER_CONSUMER;
Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very recent change, so directly modified it whout Deprecated tags. Updated the PR desc.

Comment on lines 22 to +26
CHECKSUM_VERIFICATION_FAILURE,
/** The ingestion task timed out while waiting for bootstrap (initial data load) to complete */
BOOTSTRAP_TIMEOUT,
/** The push job timed out before all data could be ingested */
PUSH_TIMEOUT,
/** A serving (current/backup) version partition's bootstrap exceeded the time limit; the task stays alive */
SERVING_VERSION_BOOTSTRAP_TIMEOUT,
/** A future version's ingestion task was killed because bootstrap timed out (VeniceTimeoutException) */
FUTURE_VERSION_PUSH_TIMEOUT,
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VeniceIngestionFailureReason enum renames BOOTSTRAP_TIMEOUT to SERVING_VERSION_BOOTSTRAP_TIMEOUT and PUSH_TIMEOUT to FUTURE_VERSION_PUSH_TIMEOUT. This is a breaking change for any existing OTel metric data tagged with the old dimension values bootstrap_timeout or push_timeout. Existing dashboards and alerts querying these values will silently stop working. The description in the PR notes these are renamings, but there's no migration strategy documented. If these values have been emitted to production OTel storage, this creates a gap in metric continuity.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very recent change, so directly modified it whout Deprecated tags. Updated the PR desc.

/** {@link VeniceWriteComputeOperation} Write compute operation phase (query, update) */
VENICE_WRITE_COMPUTE_OPERATION("venice.write_compute.operation"),
/** {@link VeniceWriteComputeOperation} Phase of a write-compute (partial update) operation: query or update */
VENICE_WRITE_COMPUTE_OPERATION_PHASE("venice.write_compute.operation_phase"),
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VENICE_WRITE_COMPUTE_OPERATION enum member in VeniceMetricsDimensions is renamed to VENICE_WRITE_COMPUTE_OPERATION_PHASE. This changes the OTel dimension name from venice.write_compute.operation to venice.write_compute.operation_phase. Any existing dashboards, alerts, or metric queries using the old dimension name will break. This is a breaking API change.

Suggested change
VENICE_WRITE_COMPUTE_OPERATION_PHASE("venice.write_compute.operation_phase"),
VENICE_WRITE_COMPUTE_OPERATION_PHASE("venice.write_compute.operation"),
Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very recent change, so directly modified it whout Deprecated tags. Updated the PR desc.

Comment on lines +58 to +59
storeIngestionService
.recordIngestionFailure(storeName, version, VeniceIngestionFailureReason.SERVING_VERSION_BOOTSTRAP_TIMEOUT);
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StateModelIngestionProgressNotifier.waitConsumptionCompleted() and LeaderFollowerStoreIngestionTask.checkLongRunningTaskState() both record SERVING_VERSION_BOOTSTRAP_TIMEOUT for current/backup versions when a bootstrap timeout occurs, potentially causing double-counting of the version-level metric. Previously, the notifier only recorded host-level failure (recordIngestionFailure(storeName)), while checkLongRunningTaskState recorded both levels. Now both record version-level metrics via recordIngestionFailure(storeName, version, SERVING_VERSION_BOOTSTRAP_TIMEOUT). If both timeouts fire for the same partition (they have different thresholds: bootstrapToOnlineTimeoutInHours in the notifier vs getBootstrapTimeoutInMs() in the task), the version-level INGESTION_FAILURE_COUNT metric with SERVING_VERSION_BOOTSTRAP_TIMEOUT reason will be double-counted for current/backup versions.

Suggested change
storeIngestionService
.recordIngestionFailure(storeName, version, VeniceIngestionFailureReason.SERVING_VERSION_BOOTSTRAP_TIMEOUT);
storeIngestionService.recordIngestionFailure(storeName);
Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says "No double-counting in practice. Both paths use the same timeout value (store.getBootstrapToOnlineTimeoutInHours()). The SIT's checkLongRunningTaskState() polls every ~100ms and detects the timeout first. It calls reportError() which notifies StateModelIngestionProgressNotifier.error(), counting down the latch — so waitConsumptionCompleted()'s await() returns true (latch released) and its timeout branch does not execute. Additionally, the Helix path starts its await() after setupNewStorePartition() completes, while consumptionStartTimeInMs is set during setup, so the SIT timer is always ahead. A theoretical race exists if both fire in the same instant due to thread scheduling, but the window is vanishingly small. This pattern is pre-existing" Will check this further along with #2541 (comment)

Comment on lines 309 to 324
if (requestType != RequestType.SINGLE_GET) {
/**
* It is duplicate to have the key count tracking for single-get requests since the key count rate will be same
* as the request rate.
*/
requestKeyCountSensor = registerPerStoreAndTotal(
"request_key_count",
totalStats,
() -> totalStats.requestKeyCountSensor,
new Rate(),
requestKeyCount,
new Avg(),
new Max());
requestKeyCountMetric = MetricEntityStateBase.create(
READ_REQUEST_KEY_COUNT.getMetricEntity(),
otelRepository,
(name, stats) -> registerSensor(
name,
totalStats != null ? new Sensor[] { totalStats.requestKeyCountMetric.getTehutiSensor() } : null,
stats),
ServerTehutiMetricName.REQUEST_KEY_COUNT,
Arrays.asList(new Rate(), new OccurrenceRate(), new Avg(), new Max()),
baseDimensionsMap,
baseAttributes);
} else {
requestKeyCountSensor = null;
requestKeyCountMetric = MetricEntityStateBase
.create(READ_REQUEST_KEY_COUNT.getMetricEntity(), otelRepository, baseDimensionsMap, baseAttributes);
}
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recordRequestKeyCount method now calls requestKeyCountMetric.record(keyCount) unconditionally, but for RequestType.SINGLE_GET, the requestKeyCountMetric is created without a Tehuti sensor (registerTehutiSensorFn is not provided). While this is safe because recordTehutiMetric has a null-sensor guard, OTel will still record the key count for single-get requests. The old behavior with the null-sensor pattern preserved the semantic that single-get key counts are not tracked (since a single-get always has 1 key), but this OTel recording is now silently active. If this is intentional for OTel, a comment explaining the design decision would help.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single-get key count (always 1) is intentionally recorded in OTel, unlike Tehuti which skips it, so that KPS (keys per second) can be measured accurately from a single metric across all request types without special-casing single-get. Added a comment.

Copy link
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2 review — the major issues from round 1 (null guard in recordBasicMetrics, silent defaults in successRequest/errorRequest, unused parameters, test duplication) have been addressed well. A few remaining items below.

HttpResponseStatusCodeCategory.getVeniceHttpResponseStatusCodeCategory(statusCode);
VeniceResponseStatusCategory veniceCategory = VeniceResponseStatusCategory.FAIL;

if (stats == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor — code duplication] The two branches of this if/else are identical except for stats vs currentStats. If one branch is updated and the other is forgotten, they'll silently diverge.

Suggestion:

ServerHttpRequestStats target = (stats != null) ? stats : currentStats;
target.recordErrorRequest(statusEnum, statusCategory, veniceCategory);
target.recordErrorRequestLatency(statusEnum, statusCategory, veniceCategory, elapsedTime);
if (isMisroutedStoreVersion) {
    target.recordMisroutedStoreVersionRequest();
}
HttpResponseStatusCodeCategory.getVeniceHttpResponseStatusCodeCategory(statusCode);
// In Venice, NOT_FOUND (key absent) is a valid/expected response, not an error.
VeniceResponseStatusCategory veniceCategory =
(responseStatus.equals(HttpResponseStatus.OK) || responseStatus.equals(HttpResponseStatus.NOT_FOUND))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Test gap] This Venice-specific classification (NOT_FOUND = SUCCESS) is non-obvious and has no dedicated test covering it. If someone modifies this condition (e.g., removes NOT_FOUND), the dimension values for all key-not-found requests would silently change.

Please add a test like testRecordBasicMetricsNotFoundIsSuccess that sets responseStatus = HttpResponseStatus.NOT_FOUND, calls recordBasicMetrics, and verifies the response size is tagged with VeniceResponseStatusCategory.SUCCESS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned this up by moving this method to VeniceResponseStatusCategory and reused it. Also, added tests.

serverHttpRequestStats.recordFlushLatency(flushLatency);
// Status-dependent metrics require responseStatus for OTel dimensions
if (responseStatus != null) {
int statusCode = responseStatus.code();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor — maintenance] This 4-line status-dimension computation (statusCode, statusEnum, statusCategory) is duplicated identically in successRequest (line 237) and errorRequest (line 252). For every request, it runs twice — once here and once in the success/error path.

Consider computing once and caching as fields when setResponseStatus() is called, or extracting a private helper. This removes the duplication and the redundant computation.

Copy link
Contributor Author

@m-nagarajan m-nagarajan Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned this up a bit. This could be further cleaned across the code base, but I think this is good enough for now. I will explore a short cleanup PR for this after this. WDYT?

setOf(VENICE_STORE_NAME, VENICE_CLUSTER_NAME, VENICE_REQUEST_METHOD, VENICE_CHUNKING_STATUS)
),

STORAGE_ENGINE_QUERY_SERIALIZATION_TIME(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor — confirm intentional] STORAGE_ENGINE_QUERY_SERIALIZATION_TIME does NOT include VENICE_CHUNKING_STATUS in its dimensions, while STORAGE_ENGINE_QUERY_DESERIALIZATION_TIME (line 109) does.

This asymmetry makes sense (serialization operates on computed results which are never chunked, while deserialization operates on raw stored values which may be chunked), but please confirm this is intentional and not an oversight. A brief comment on the serialization entry would help future readers understand the difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments.

totalStats,
() -> totalStats.countOperatorSensor,
avgAndTotal());
dotProductCountMetric = createComputeOpMetric(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor — design] dotProductCountMetric, cosineCountMetric, hadamardCountMetric, and countOperatorCountMetric are four separate MetricEntityStateOneEnum<VeniceComputeOperationType> instances that all share the same OTel entity (STORAGE_ENGINE_COMPUTE_OPERATION_COUNT). Each record method hardcodes a single enum value (e.g., dotProductCountMetric.record(count, VeniceComputeOperationType.DOT_PRODUCT)).

This works (OTel SDK deduplicates by metric name), but it creates 4 instrument instances for what is semantically one metric with an operation-type dimension. A single field with a recordComputeOperationCount(int count, VeniceComputeOperationType opType) method would be cleaner and avoid the null checks scattered across 4 record methods (lines 629-651). The separate Tehuti sensors could still be maintained alongside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates 1 Otel instrument with dimensions as the instrument creation is inside a computeIfAbsent of instruments based on the MetricEntity.getMetricName(). The duplication just reuses the otel metric, but creates new tehuti metrics, one for each operation.

Copy link
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: test duplication in AggServerHttpRequestStatsTest.

singleGetServerStatsFoo.recordErrorRequest();
singleGetServerStatsBar.recordErrorRequest();
singleGetServerStatsFoo.recordSuccessRequest(
HttpResponseStatusEnum.OK,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Test duplication — Minor] The status triplet (HttpResponseStatusEnum.OK, HttpResponseStatusCodeCategory.SUCCESS, VeniceResponseStatusCategory.SUCCESS) is repeated 12 times across this file; the error triplet appears 4 times. This duplication was introduced by this PR (the old API didn't require enum params).

ServerStatsContextTest already demonstrates the right pattern — please apply the same here:

private static final HttpResponseStatusEnum OK_STATUS = HttpResponseStatusEnum.OK;
private static final HttpResponseStatusCodeCategory OK_CATEGORY = HttpResponseStatusCodeCategory.SUCCESS;
private static final VeniceResponseStatusCategory OK_VENICE = VeniceResponseStatusCategory.SUCCESS;

private static final HttpResponseStatusEnum ERROR_STATUS = HttpResponseStatusEnum.INTERNAL_SERVER_ERROR;
private static final HttpResponseStatusCodeCategory ERROR_CATEGORY = HttpResponseStatusCodeCategory.SERVER_ERROR;
private static final VeniceResponseStatusCategory ERROR_VENICE = VeniceResponseStatusCategory.FAIL;

This reduces each call site from 3 lines to 1 and makes the test much more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaned these up, thanks.

Copy link
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM but few minor comments

for (int partition: timeoutPartitions) {
reportError(errorMsg, partition, ex);
}
if (isMetricsEmissionEnabled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get rid of isMetricsEmissionEnabled flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch. So the host level ingestion metrics is emitted only for the largest version number and that breaks the otel versioned data, it only probably records the current/future based on the largest available version number. I will handle this in a follow up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename this to Partial Update to make it consistent with code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah makes sense. Modified the metric names and dimensions accordingly.

*/
private boolean statCallbackExecuted = false;

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is useful. Why delete it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the described firstPartLatency/secondPartLatency anywhere, hence removed. Am I missing something here?

@sushantmane sushantmane added the requested-changes Reviewer requested changes label Feb 28, 2026
Copilot AI review requested due to automatic review settings March 1, 2026 03:22
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 56 out of 56 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 204 to 216
// Shares STORAGE_ENGINE_QUERY_TIME OTel entity with storageEngineQueryTimeMetric.
// Differentiated in OTel by VENICE_REQUEST_METHOD base dimension (COMPUTE vs MULTI_GET/SINGLE_GET).
// Each has its own Tehuti sensor.
readComputeQueryTimeMetric = MetricEntityStateOneEnum.create(
STORAGE_ENGINE_QUERY_TIME.getMetricEntity(),
otelRepository,
registerPerStoreAndTotal(totalStats != null ? totalStats.readComputeQueryTimeMetric : null),
ServerTehutiMetricName.STORAGE_ENGINE_READ_COMPUTE_LATENCY,
Arrays.asList(
TehutiUtils
.getPercentileStatWithAvgAndMax(getName(), getFullMetricName("storage_engine_read_compute_latency"))),
baseDimensionsMap,
VeniceChunkingStatus.class);
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both storageEngineQueryTimeMetric and readComputeQueryTimeMetric share the same OTel metric entity STORAGE_ENGINE_QUERY_TIME and are expected to be differentiated by the VENICE_REQUEST_METHOD base dimension. However, in a ServerHttpRequestStats instance for RequestType.COMPUTE, both metrics would be created with baseDimensionsMap containing VENICE_REQUEST_METHOD=compute. When recordReadComputeLatency and recordDatabaseLookupLatency are both called on a compute request, both will write to the same OTel time series (same metric name + same store+cluster+request_method+chunking_status dimensions), causing double-counting in OTel. Tehuti is fine because the two have distinct sensor names (storage_engine_read_compute_latency vs storage_engine_query_latency). The comment at line 204-206 acknowledges the shared entity but doesn't address the double-counting risk. Consider whether recordDatabaseLookupLatency is actually called for compute requests (if not, this is benign), or whether STORAGE_ENGINE_QUERY_TIME should use different dimension values (e.g., a sub-type dimension) to distinguish the two.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Added storage_engine.read.compute.execution_time to differentiate between these.

Copilot AI review requested due to automatic review settings March 1, 2026 05:56
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 56 out of 56 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@m-nagarajan m-nagarajan added review-addressed Author has addressed review comments and removed requested-changes Reviewer requested changes labels Mar 1, 2026
@m-nagarajan m-nagarajan requested a review from sushantmane March 2, 2026 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-addressed Author has addressed review comments

3 participants