[server][da-vinci] Integrate OTel metrics into ServerHttpRequestStats by m-nagarajan · Pull Request #2541 · linkedin/venice

m-nagarajan · 2026-02-27T10:24:21Z

Problem Statement

Server-side read request metrics in ServerHttpRequestStats only used Tehuti sensors, with no OpenTelemetry (OTel) instrumentation.

Solution

Refactor ServerHttpRequestStats to colocate OTel MetricEntityState fields alongside Tehuti Sensor fields, following the pattern established by RouterHttpRequestStats. A single MetricEntityState.record() call writes to both OTel and Tehuti systems, eliminating the need for a separate OTel stats class.

Key changes:

ServerHttpRequestStats: Replace raw Sensor fields with MetricEntityState fields for 17 OTel metric entities. Recording methods now accept HTTP status enum dimensions for OTel. Tehuti-only sensors (small/large value breakdown, read-compute efficiency, early termination, misrouted requests) are retained as plain Sensor fields.
ServerReadOtelMetricEntity: New enum defining 17 OTel metric entities with dimensional attributes (store, cluster, request method, HTTP status, chunking status, compute operation type).
VeniceChunkingStatus, VeniceComputeOperationType: New dimension enums for VENICE_CHUNKING_STATUS and VENICE_READ_COMPUTE_OPERATION_TYPE.
VeniceMetricsDimensions: Add VENICE_CHUNKING_STATUS and VENICE_READ_COMPUTE_OPERATION_TYPE entries.
ServerStatsContext: Compute HTTP status enums in successRequest()/errorRequest() and pass them to unified recording methods.
AggServerHttpRequestStats: Update delegating methods (recordErrorRequest, recordErrorRequestLatency) for new signatures.
Response stats classes: Add getResponseValueSize() to enable aggregate OTel value size recording. Unified recording flows through ServerHttpRequestStats methods that handle both systems.
ServerMetricEntity: Register ServerReadOtelMetricEntity in SERVER_METRIC_ENTITIES.

Per-category OTel config override:

ConfigKeys.SERVER_READ_OTEL_STATS_ENABLED (default: true): Controls whether read OTel metrics are emitted. When set to false, OpenTelemetryMetricsSetup.build() short-circuits before any OTel instrument creation, and MetricEntityState.record() skips OTel via the existing otelMetric == null fast path. Tehuti metrics remain fully operational.
OpenTelemetryMetricsSetup.Builder.setOtelEnabledOverride(): New per-category gate applied on top of the global OTel config. OTel metrics are emitted only when both the global config and the override are enabled. Setting the override to false disables OTel for the stats class being built; setting it to true cannot re-enable OTel when the global config is disabled.
The existing SERVER_INGESTION_OTEL_STATS_ENABLED config is also threaded through setOtelEnabledOverride() for consistency and defense-in-depth, ensuring the same global+override AND semantics for both read and ingestion paths.

Compute-only metric naming and dimension cleanup:

All metrics that fire exclusively for read-compute requests use a storage_engine.read.compute.* naming convention and omit VENICE_REQUEST_METHOD from their dimensions (since it is always COMPUTE). A computeBaseDimensionsMap (with VENICE_REQUEST_METHOD removed) and corresponding computeBaseAttributes are built once in the constructor and shared across these metrics:

STORAGE_ENGINE_READ_COMPUTE_DESERIALIZATION_TIME — dims: store, cluster, chunking status. Chunking status is retained because deserialization operates on potentially chunked values.
STORAGE_ENGINE_READ_COMPUTE_SERIALIZATION_TIME — dims: store, cluster. Chunking status omitted because serialization happens after chunk reassembly.
STORAGE_ENGINE_READ_COMPUTE_EXECUTION_COUNT — dims: store, cluster, operation type. Per-operation-type count (dot product, cosine similarity, etc.).
STORAGE_ENGINE_READ_COMPUTE_EXECUTION_TIME — dims: store, cluster. Records compute execution latency. Does not differentiate by VENICE_READ_COMPUTE_OPERATION_TYPE because ComputeUtils.computeResult() runs all operations in a single loop without per-operation timing. The dimension can be added here and per-operation timing instrumented in ComputeUtils if a per-type latency breakdown is needed in the future.

Dimension enum renames and cleanup:

VeniceIngestionFailureReason: Renamed BOOTSTRAP_TIMEOUT → SERVING_VERSION_BOOTSTRAP_TIMEOUT (serving version partition bootstrap exceeded time limit; task stays alive) and PUSH_TIMEOUT → FUTURE_VERSION_PUSH_TIMEOUT (future version ingestion task killed due to bootstrap timeout). Dimension values updated accordingly (bootstrap_timeout → serving_version_bootstrap_timeout, push_timeout → future_version_push_timeout).
VeniceWriteComputeOperation → VenicePartialUpdateOperation: Renamed enum, dimension (VENICE_WRITE_COMPUTE_OPERATION_PHASE → VENICE_PARTIAL_UPDATE_OPERATION_PHASE, value venice.partial_update.operation_phase), and associated OTel metric entities (WRITE_COMPUTE_TIME → PARTIAL_UPDATE_TIME, WRITE_COMPUTE_CACHE_HIT_COUNT → PARTIAL_UPDATE_CACHE_HIT_COUNT) along with all recording methods and field names to use "partial update" consistently.
VeniceIngestionSourceComponent: Removed LEADER_CONSUMER and FOLLOWER_CONSUMER values — consumers are only meaningful as destinations, not sources. Updated Javadoc on both VeniceIngestionSourceComponent and VeniceIngestionDestinationComponent to explain the asymmetric value sets.

Design decisions:

OTel disabled for total stats: OpenTelemetryMetricsSetup returns otelRepository=null for total stats, so MetricEntityState records Tehuti only.
429 handling: recordBasicMetrics() records Tehuti metrics (including responseSize and valueSize) for all requests including 429. OTel metrics requiring HTTP status dimensions are only recorded in successRequest()/errorRequest(), which are not called for 429 responses.
Unified Tehuti+OTel for compute execution latency: readComputeQueryTimeMetric is a single MetricEntityStateBase field mapped to STORAGE_ENGINE_READ_COMPUTE_EXECUTION_TIME that records to both Tehuti and OTel. This avoids the need for separate Tehuti-only and OTel-only fields for the same measurement.

Code changes

Added new code behind a config. Config names and defaults:
- server.read.otel.stats.enabled — default true
- server.ingestion.otel.stats.enabled — default true (existing, now also threaded through setOtelEnabledOverride)
Introduced new log lines.
- Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

Code has no race conditions or thread safety issues.
Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

New unit tests added.
New integration tests added.
Modified or extended existing tests.
Verified backward compatibility (if applicable).

Tests added/updated:

ServerReadOtelMetricEntityTest: Validates all 17 OTel metric entity definitions (name, type, unit, dimensions).
ServerStatsContextTest: Comprehensive tests for success/error/429 recording paths, OTel dimension propagation, boundary conditions (default responseSize=-1, valueSize=0), composite recorder delegation, merge operations for compute stats with size profiling.
AggServerHttpRequestStatsTest: Updated for new method signatures, per-store-to-total propagation tests, and status triplet deduplication.
StorageReadRequestHandlerTest: Updated for method renames.
ServerMetricEntityTest: Updated entity count for new ServerReadOtelMetricEntity.
VeniceChunkingStatusTest, VeniceComputeOperationTypeTest, VenicePartialUpdateOperationTest, VeniceMetricsDimensionsTest: Tests for new and renamed dimension enums.
OpenTelemetryMetricsSetupTest: DataProvider-based test covering the full interaction matrix between global OTel config and otelEnabledOverride (6 combinations, parallel-safe).
VeniceServerConfigTest: Tests for readOtelStatsEnabled and ingestionOtelStatsEnabled defaults and explicit overrides.
IngestionOtelStatsTest: Tests for ingestionOtelStatsEnabled override disabling OTel when global config is enabled.

Does this PR introduce any user-facing or breaking changes?

No. You can skip the rest of this section.
Yes. Clearly explain the behavior change and its impact.

The following OTel dimension names, dimension values, and metric names were renamed. These were introduced very recently and are not yet consumed by any dashboards or alerts, so they are directly modified without backwards-compatibility shims.

Dimension renames:

Before	After
`VENICE_WRITE_COMPUTE_OPERATION_PHASE` (`venice.write_compute.operation_phase`)	`VENICE_PARTIAL_UPDATE_OPERATION_PHASE` (`venice.partial_update.operation_phase`)

Dimension value renames (VeniceIngestionFailureReason):

Before	After
`bootstrap_timeout`	`serving_version_bootstrap_timeout`
`push_timeout`	`future_version_push_timeout`

OTel metric name renames (IngestionOtelMetricEntity):

Before	After
`ingestion.write_compute.time`	`ingestion.partial_update.time`
`ingestion.write_compute.cache.hit_count`	`ingestion.partial_update.cache.hit_count`

… using Router Pattern Refactor ServerHttpRequestStats to colocate OTel MetricEntityState fields alongside Tehuti Sensor fields, following the pattern established by RouterHttpRequestStats. A single MetricEntityState.record() call writes to both OTel and Tehuti systems. Key changes: - ServerHttpRequestStats: Replace raw Sensor fields with MetricEntityState fields for 16 OTel metric entities (READ_CALL_TIME, READ_CALL_COUNT, STORAGE_ENGINE_QUERY_TIME, etc.). Recording methods now accept HTTP status enum dimensions. - ServerReadOtelMetricEntity: New enum defining 16 OTel metric entities with dimensions (store, cluster, request method, HTTP status, chunking status, compute operation type). - VeniceChunkingStatus, VeniceComputeOperationType: New dimension enums for VENICE_CHUNKING_STATUS and VENICE_COMPUTE_OPERATION_TYPE. - VeniceMetricsDimensions: Add VENICE_CHUNKING_STATUS and VENICE_COMPUTE_OPERATION_TYPE. - ServerStatsContext: Compute HTTP status enums in successRequest/errorRequest and pass to unified recording methods. OTel responseSize/valueSize recorded with dimensions; Tehuti responseSize recorded in recordBasicMetrics (runs for all requests incl. 429). - AggServerHttpRequestStats: Update delegating methods for new signatures. - Response stats classes: Remove OTel-specific recording methods; unified recording flows through ServerHttpRequestStats methods that handle both systems. - ServerMetricEntity: Register ServerReadOtelMetricEntity in SERVER_METRIC_ENTITIES. - Comprehensive test coverage for all recording paths, merge operations, and boundary conditions.

Copilot

Pull request overview

This PR adds OpenTelemetry (OTel) instrumentation to server-side read request stats by refactoring ServerHttpRequestStats to follow the established “Router Pattern” where OTel MetricEntityState and Tehuti Sensor recording coexist in one stats class, enabling unified recording APIs and richer dimensional metrics.

Changes:

Refactor ServerHttpRequestStats to use OTel MetricEntityState* fields (alongside retained Tehuti-only sensors) and update server recording paths to pass HTTP status dimensions.
Introduce new OTel metric entity definitions (ServerReadOtelMetricEntity) and new dimension enums (VeniceChunkingStatus, VeniceComputeOperationType) plus VeniceMetricsDimensions updates.
Extend read-response stats recorders with getResponseValueSize() to support aggregate OTel value size recording and update ingestion failure metrics to use more specific VeniceIngestionFailureReason values.

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
services/venice-server/src/test/java/com/linkedin/venice/stats/AggServerHttpRequestStatsTest.java	Updates tests for new status-dimensioned recording APIs and compute metric method renames.
services/venice-server/src/test/java/com/linkedin/venice/listener/StorageReadRequestHandlerTest.java	Updates mocks/verifications for renamed compute metric recording methods.
services/venice-server/src/test/java/com/linkedin/venice/grpc/ServerStatsContextTest.java	Expands unit coverage for success/error/429 paths and OTel size recording behavior.
services/venice-server/src/main/java/com/linkedin/venice/stats/ServerHttpRequestStats.java	Core refactor: integrate OTel `MetricEntityState*` with Tehuti sensors and add OTel-only size recording helpers.
services/venice-server/src/main/java/com/linkedin/venice/stats/AggServerHttpRequestStats.java	Wires new constructor signature and updated error-recording method signatures.
services/venice-server/src/main/java/com/linkedin/venice/listener/response/stats/SingleGetResponseStats.java	Adds `getResponseValueSize()` and clarifies Tehuti-only per-key value size recording.
services/venice-server/src/main/java/com/linkedin/venice/listener/response/stats/ReadResponseStatsRecorder.java	Adds default `getResponseValueSize()` API for aggregate value-size reporting.
services/venice-server/src/main/java/com/linkedin/venice/listener/response/stats/MultiGetResponseStatsWithSizeProfiling.java	Tracks/merges aggregate value size for OTel while keeping per-key Tehuti recording unmerged.
services/venice-server/src/main/java/com/linkedin/venice/listener/response/stats/ComputeResponseStats.java	Renames compute metric recorders and exposes aggregate `getResponseValueSize()`.
services/venice-server/src/main/java/com/linkedin/venice/listener/response/ParallelMultiKeyResponseWrapper.java	Composite recorder now exposes merged `getResponseValueSize()` for aggregate OTel recording.
services/venice-server/src/main/java/com/linkedin/venice/listener/ServerStatsContext.java	Computes HTTP status enums/categories and records OTel-only response/value size with status dimensions.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java	Adds assertions for new dimension names across naming formats.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceIngestionSourceComponentTest.java	Updates expected dimension values after source component enum value set change.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceIngestionFailureReasonTest.java	Updates expected mappings for renamed/split ingestion failure reasons.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceComputeOperationTypeTest.java	Adds tests for new compute operation type dimension enum.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceChunkingStatusTest.java	Adds tests for new chunking-status dimension enum.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java	Registers new dimensions for chunking status and compute operation type.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceIngestionSourceComponent.java	Removes consumer-only values from “source” enum and clarifies source/destination semantics.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceIngestionFailureReason.java	Replaces generic timeouts with more specific failure reasons for ingestion metrics.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceIngestionDestinationComponent.java	Updates docs to match paired asymmetric enum design with the source component enum.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceComputeOperationType.java	Adds new dimension enum for read-compute operation type.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceChunkingStatus.java	Adds new dimension enum for chunked vs unchunked read behavior.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/IngestionOtelStatsTest.java	Extends ingestion OTel tests (imports and additional metric recording validations).
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerReadOtelMetricEntityTest.java	Adds tests validating the 16 new server-read OTel metric entity definitions.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java	Updates expected server metric entity registry count after adding server-read entities.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerReadOtelMetricEntity.java	Defines the 16 OTel metric entities for server-side read observability.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java	Registers `ServerReadOtelMetricEntity` into the server metric entity registry.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTask.java	Threads specific `VeniceIngestionFailureReason` into ingestion exception handling and metrics.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/StoreIngestionService.java	Adds an overload to record ingestion failure with version + failure-reason dimension.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTask.java	Records serving-version bootstrap timeout failures with the new failure-reason dimension.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/KafkaStoreIngestionService.java	Implements the new `recordIngestionFailure(store, version, reason)` overload.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/helix/StateModelIngestionProgressNotifier.java	Records bootstrap-to-online timeout failures with version + specific failure-reason dimension.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

services/venice-server/src/test/java/com/linkedin/venice/grpc/ServerStatsContextTest.java

services/venice-server/src/main/java/com/linkedin/venice/stats/ServerHttpRequestStats.java

…s line count LeaderFollowerStoreIngestionTask: - Use null-safe field access for stats recording in checkLongRunningTaskState, since tests partially mock the task via doCallRealMethod() leaving fields null - Use getVersionIngestionStats() getter in produceToLocalKafka to match test mock stubbing pattern ServerHttpRequestStats: - Extract createComputeOpMetric helper for 4 identical compute operation metrics - Remove section-header organizational comments and trim verbose Javadocs - Reduces per-file added lines from 546 to 478 (under 500 CI limit)

Copilot

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…size metrics Eliminate split recording where Tehuti and OTel were recorded at different call sites for responseSize and valueSize. Thread HTTP status enum dimensions through the ReadResponseStatsRecorder hierarchy so both systems are recorded in a single MetricEntityState.record() call. Also make MetricEntityState.recordOtelMetric() package-private to prevent external callers from bypassing the unified record() API, and add a null guard for responseStatus in recordBasicMetrics() to fix a SpotBugs finding. Key changes: - ServerHttpRequestStats: Replace 4 split methods (recordResponseSize + recordResponseSizeOtelOnly, recordValueSizeInByte + recordValueSizeInByteOtelOnly) with 2 unified methods - ReadResponseStatsRecorder: Add HTTP status enum params to recordMetrics() and recordUnmergedMetrics() signatures - ServerStatsContext: Compute HTTP status enums once in recordBasicMetrics() and pass through all recording paths; add responseStatus null guard - MetricEntityState: Change recordOtelMetric visibility from public to package-private - Update all response stats subclasses, tests, and callers

Copilot

Pull request overview

Copilot reviewed 37 out of 37 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...ommon/src/test/java/com/linkedin/venice/stats/dimensions/VeniceComputeOperationTypeTest.java

Copilot

Pull request overview

Copilot reviewed 37 out of 37 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

services/venice-server/src/main/java/com/linkedin/venice/stats/ServerHttpRequestStats.java

sushantmane

Left some comments. Please take a look

services/venice-server/src/main/java/com/linkedin/venice/listener/ServerStatsContext.java

...-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceIngestionFailureReason.java

...lient-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java

...ommon/src/test/java/com/linkedin/venice/stats/dimensions/VeniceComputeOperationTypeTest.java

...-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/IngestionOtelStatsTest.java

services/venice-server/src/test/java/com/linkedin/venice/grpc/ServerStatsContextTest.java

independently Add SERVER_READ_OTEL_STATS_ENABLED config that allows disabling read OTel metrics without affecting other server/DaVinci OTel metrics. The per-category override is applied on top of the global OTel config — both must be enabled for OTel to emit. Also thread the existing SERVER_INGESTION_OTEL_STATS_ENABLED through OpenTelemetryMetricsSetup.setOtelEnabledOverride() for consistency and defense-in-depth, ensuring the same global+override AND semantics.

Copilot

Pull request overview

Copilot reviewed 49 out of 49 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-28T01:02:29Z

...e-server/src/main/java/com/linkedin/venice/listener/response/stats/ComputeResponseStats.java

-        this.readComputeSerializationLatency,
-        isAssembledMultiChunkLargeValue,
-        0);
+    consumeDoubleIfAbove(stats::recordReadComputeSerializationLatency, this.readComputeSerializationLatency, 0);


The recordReadComputeSerializationLatency method signature was changed from (double latency, boolean assembledMultiChunkLargeValue) to (double latency). In ComputeResponseStats.recordMetrics, the call site now uses consumeDoubleIfAbove(stats::recordReadComputeSerializationLatency, ...) with the single-arg version, while the old code used consumeDoubleAndBooleanIfAbove. The assembledMultiChunkLargeValue parameter is no longer passed, meaning the serialization time metric is no longer tagged with chunking status in OTel. The STORAGE_ENGINE_QUERY_SERIALIZATION_TIME metric definition also uses VENICE_REQUEST_METHOD only (no chunking status), which is consistent with the new signature. However, the existing Tehuti sensor readComputeLatencyForSmallValueSensor/readComputeLatencyForLargeValueSensor equivalents for serialization are gone; this is a behavioral change worth verifying is intentional (there's no serialization latency breakdown by chunking status in Tehuti either in the new code).

Deserialization has chunking dim whereas serialization doesn't. Its intentional and updated the comments and PR desc.

Copilot · 2026-02-28T01:02:29Z

...ommon/src/main/java/com/linkedin/venice/stats/dimensions/VeniceIngestionSourceComponent.java

-  LEADER_CONSUMER,
-  /** The follower replica consumer that replicates from leader */
-  FOLLOWER_CONSUMER;
+  SOURCE_BROKER;


The VeniceIngestionSourceComponent enum removes LEADER_CONSUMER and FOLLOWER_CONSUMER. These removals are potentially breaking for any existing OTel metric data that was previously emitted with these dimension values, and for any metrics-analysis dashboards or alerts that query these dimension values. Additionally, there is a discrepancy: the updated Javadoc says "LEADER_CONSUMER and FOLLOWER_CONSUMER are only meaningful as destinations", but VeniceIngestionDestinationComponent still contains these values. The test in VeniceIngestionSourceComponentTest.java removes those enum values from the expected mapping, which confirms the intent. However, this is a breaking change for existing OTel pipelines that may have data tagged with leader_consumer or follower_consumer source component values.

Suggested change

SOURCE_BROKER;

SOURCE_BROKER,

/**

* Legacy value kept for backward compatibility with existing OTel metrics that may have

* {@code leader_consumer} as a source component. New code should not use this as a source;

* LEADER_CONSUMER is only meaningful as a destination component.

*/

@Deprecated

LEADER_CONSUMER,

/**

* Legacy value kept for backward compatibility with existing OTel metrics that may have

* {@code follower_consumer} as a source component. New code should not use this as a source;

* FOLLOWER_CONSUMER is only meaningful as a destination component.

*/

@Deprecated

FOLLOWER_CONSUMER;

This is a very recent change, so directly modified it whout Deprecated tags. Updated the PR desc.

Copilot · 2026-02-28T01:02:30Z

...-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceIngestionFailureReason.java

  CHECKSUM_VERIFICATION_FAILURE,
-  /** The ingestion task timed out while waiting for bootstrap (initial data load) to complete */
-  BOOTSTRAP_TIMEOUT,
-  /** The push job timed out before all data could be ingested */
-  PUSH_TIMEOUT,
+  /** A serving (current/backup) version partition's bootstrap exceeded the time limit; the task stays alive */
+  SERVING_VERSION_BOOTSTRAP_TIMEOUT,
+  /** A future version's ingestion task was killed because bootstrap timed out (VeniceTimeoutException) */
+  FUTURE_VERSION_PUSH_TIMEOUT,


The VeniceIngestionFailureReason enum renames BOOTSTRAP_TIMEOUT to SERVING_VERSION_BOOTSTRAP_TIMEOUT and PUSH_TIMEOUT to FUTURE_VERSION_PUSH_TIMEOUT. This is a breaking change for any existing OTel metric data tagged with the old dimension values bootstrap_timeout or push_timeout. Existing dashboards and alerts querying these values will silently stop working. The description in the PR notes these are renamings, but there's no migration strategy documented. If these values have been emitted to production OTel storage, this creates a gap in metric continuity.

This is a very recent change, so directly modified it whout Deprecated tags. Updated the PR desc.

Copilot · 2026-02-28T01:02:30Z

...lient-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java

-  /** {@link VeniceWriteComputeOperation} Write compute operation phase (query, update) */
-  VENICE_WRITE_COMPUTE_OPERATION("venice.write_compute.operation"),
+  /** {@link VeniceWriteComputeOperation} Phase of a write-compute (partial update) operation: query or update */
+  VENICE_WRITE_COMPUTE_OPERATION_PHASE("venice.write_compute.operation_phase"),


The VENICE_WRITE_COMPUTE_OPERATION enum member in VeniceMetricsDimensions is renamed to VENICE_WRITE_COMPUTE_OPERATION_PHASE. This changes the OTel dimension name from venice.write_compute.operation to venice.write_compute.operation_phase. Any existing dashboards, alerts, or metric queries using the old dimension name will break. This is a breaking API change.

Suggested change

VENICE_WRITE_COMPUTE_OPERATION_PHASE("venice.write_compute.operation_phase"),

VENICE_WRITE_COMPUTE_OPERATION_PHASE("venice.write_compute.operation"),

This is a very recent change, so directly modified it whout Deprecated tags. Updated the PR desc.

Copilot · 2026-02-28T01:02:30Z

...nci-client/src/main/java/com/linkedin/davinci/helix/StateModelIngestionProgressNotifier.java

+        storeIngestionService
+            .recordIngestionFailure(storeName, version, VeniceIngestionFailureReason.SERVING_VERSION_BOOTSTRAP_TIMEOUT);


The StateModelIngestionProgressNotifier.waitConsumptionCompleted() and LeaderFollowerStoreIngestionTask.checkLongRunningTaskState() both record SERVING_VERSION_BOOTSTRAP_TIMEOUT for current/backup versions when a bootstrap timeout occurs, potentially causing double-counting of the version-level metric. Previously, the notifier only recorded host-level failure (recordIngestionFailure(storeName)), while checkLongRunningTaskState recorded both levels. Now both record version-level metrics via recordIngestionFailure(storeName, version, SERVING_VERSION_BOOTSTRAP_TIMEOUT). If both timeouts fire for the same partition (they have different thresholds: bootstrapToOnlineTimeoutInHours in the notifier vs getBootstrapTimeoutInMs() in the task), the version-level INGESTION_FAILURE_COUNT metric with SERVING_VERSION_BOOTSTRAP_TIMEOUT reason will be double-counted for current/backup versions.

Suggested change

storeIngestionService

.recordIngestionFailure(storeName, version, VeniceIngestionFailureReason.SERVING_VERSION_BOOTSTRAP_TIMEOUT);

storeIngestionService.recordIngestionFailure(storeName);

Claude says "No double-counting in practice. Both paths use the same timeout value (store.getBootstrapToOnlineTimeoutInHours()). The SIT's checkLongRunningTaskState() polls every ~100ms and detects the timeout first. It calls reportError() which notifies StateModelIngestionProgressNotifier.error(), counting down the latch — so waitConsumptionCompleted()'s await() returns true (latch released) and its timeout branch does not execute. Additionally, the Helix path starts its await() after setupNewStorePartition() completes, while consumptionStartTimeInMs is set during setup, so the SIT timer is always ahead. A theoretical race exists if both fire in the same instant due to thread scheduling, but the window is vanishingly small. This pattern is pre-existing" Will check this further along with #2541 (comment)

Copilot · 2026-02-28T01:02:30Z

services/venice-server/src/main/java/com/linkedin/venice/stats/ServerHttpRequestStats.java

    if (requestType != RequestType.SINGLE_GET) {
-      /**
-       * It is duplicate to have the key count tracking for single-get requests since the key count rate will be same
-       * as the request rate.
-       */
-      requestKeyCountSensor = registerPerStoreAndTotal(
-          "request_key_count",
-          totalStats,
-          () -> totalStats.requestKeyCountSensor,
-          new Rate(),
-          requestKeyCount,
-          new Avg(),
-          new Max());
+      requestKeyCountMetric = MetricEntityStateBase.create(
+          READ_REQUEST_KEY_COUNT.getMetricEntity(),
+          otelRepository,
+          (name, stats) -> registerSensor(
+              name,
+              totalStats != null ? new Sensor[] { totalStats.requestKeyCountMetric.getTehutiSensor() } : null,
+              stats),
+          ServerTehutiMetricName.REQUEST_KEY_COUNT,
+          Arrays.asList(new Rate(), new OccurrenceRate(), new Avg(), new Max()),
+          baseDimensionsMap,
+          baseAttributes);
    } else {
-      requestKeyCountSensor = null;
+      requestKeyCountMetric = MetricEntityStateBase
+          .create(READ_REQUEST_KEY_COUNT.getMetricEntity(), otelRepository, baseDimensionsMap, baseAttributes);
    }


The recordRequestKeyCount method now calls requestKeyCountMetric.record(keyCount) unconditionally, but for RequestType.SINGLE_GET, the requestKeyCountMetric is created without a Tehuti sensor (registerTehutiSensorFn is not provided). While this is safe because recordTehutiMetric has a null-sensor guard, OTel will still record the key count for single-get requests. The old behavior with the null-sensor pattern preserved the semantic that single-get key counts are not tracked (since a single-get always has 1 key), but this OTel recording is now silently active. If this is intentional for OTel, a comment explaining the design decision would help.

Single-get key count (always 1) is intentionally recorded in OTel, unlike Tehuti which skips it, so that KPS (keys per second) can be measured accurately from a single metric across all request types without special-casing single-get. Added a comment.

sushantmane

Round 2 review — the major issues from round 1 (null guard in recordBasicMetrics, silent defaults in successRequest/errorRequest, unused parameters, test duplication) have been addressed well. A few remaining items below.

sushantmane · 2026-02-28T01:19:37Z

services/venice-server/src/main/java/com/linkedin/venice/listener/ServerStatsContext.java

+        HttpResponseStatusCodeCategory.getVeniceHttpResponseStatusCodeCategory(statusCode);
+    VeniceResponseStatusCategory veniceCategory = VeniceResponseStatusCategory.FAIL;
+
    if (stats == null) {


[Minor — code duplication] The two branches of this if/else are identical except for stats vs currentStats. If one branch is updated and the other is forgotten, they'll silently diverge.

Suggestion:

ServerHttpRequestStats target = (stats != null) ? stats : currentStats; target.recordErrorRequest(statusEnum, statusCategory, veniceCategory); target.recordErrorRequestLatency(statusEnum, statusCategory, veniceCategory, elapsedTime); if (isMisroutedStoreVersion) { target.recordMisroutedStoreVersionRequest(); }

sushantmane · 2026-02-28T01:19:37Z

services/venice-server/src/main/java/com/linkedin/venice/listener/ServerStatsContext.java

+          HttpResponseStatusCodeCategory.getVeniceHttpResponseStatusCodeCategory(statusCode);
+      // In Venice, NOT_FOUND (key absent) is a valid/expected response, not an error.
+      VeniceResponseStatusCategory veniceCategory =
+          (responseStatus.equals(HttpResponseStatus.OK) || responseStatus.equals(HttpResponseStatus.NOT_FOUND))


[Test gap] This Venice-specific classification (NOT_FOUND = SUCCESS) is non-obvious and has no dedicated test covering it. If someone modifies this condition (e.g., removes NOT_FOUND), the dimension values for all key-not-found requests would silently change.

Please add a test like testRecordBasicMetricsNotFoundIsSuccess that sets responseStatus = HttpResponseStatus.NOT_FOUND, calls recordBasicMetrics, and verifies the response size is tagged with VeniceResponseStatusCategory.SUCCESS.

Cleaned this up by moving this method to VeniceResponseStatusCategory and reused it. Also, added tests.

sushantmane · 2026-02-28T01:19:37Z

services/venice-server/src/main/java/com/linkedin/venice/listener/ServerStatsContext.java

-        serverHttpRequestStats.recordFlushLatency(flushLatency);
+    // Status-dependent metrics require responseStatus for OTel dimensions
+    if (responseStatus != null) {
+      int statusCode = responseStatus.code();


[Minor — maintenance] This 4-line status-dimension computation (statusCode, statusEnum, statusCategory) is duplicated identically in successRequest (line 237) and errorRequest (line 252). For every request, it runs twice — once here and once in the success/error path.

Consider computing once and caching as fields when setResponseStatus() is called, or extracting a private helper. This removes the duplication and the redundant computation.

Cleaned this up a bit. This could be further cleaned across the code base, but I think this is good enough for now. I will explore a short cleanup PR for this after this. WDYT?

sushantmane · 2026-02-28T01:19:37Z

...nts/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerReadOtelMetricEntity.java

+      setOf(VENICE_STORE_NAME, VENICE_CLUSTER_NAME, VENICE_REQUEST_METHOD, VENICE_CHUNKING_STATUS)
+  ),
+
+  STORAGE_ENGINE_QUERY_SERIALIZATION_TIME(


[Minor — confirm intentional] STORAGE_ENGINE_QUERY_SERIALIZATION_TIME does NOT include VENICE_CHUNKING_STATUS in its dimensions, while STORAGE_ENGINE_QUERY_DESERIALIZATION_TIME (line 109) does.

This asymmetry makes sense (serialization operates on computed results which are never chunked, while deserialization operates on raw stored values which may be chunked), but please confirm this is intentional and not an oversight. A brief comment on the serialization entry would help future readers understand the difference.

Added comments.

sushantmane · 2026-02-28T01:19:37Z

services/venice-server/src/main/java/com/linkedin/venice/stats/ServerHttpRequestStats.java

-          totalStats,
-          () -> totalStats.countOperatorSensor,
-          avgAndTotal());
+      dotProductCountMetric = createComputeOpMetric(


[Minor — design] dotProductCountMetric, cosineCountMetric, hadamardCountMetric, and countOperatorCountMetric are four separate MetricEntityStateOneEnum<VeniceComputeOperationType> instances that all share the same OTel entity (STORAGE_ENGINE_COMPUTE_OPERATION_COUNT). Each record method hardcodes a single enum value (e.g., dotProductCountMetric.record(count, VeniceComputeOperationType.DOT_PRODUCT)).

This works (OTel SDK deduplicates by metric name), but it creates 4 instrument instances for what is semantically one metric with an operation-type dimension. A single field with a recordComputeOperationCount(int count, VeniceComputeOperationType opType) method would be cleaner and avoid the null checks scattered across 4 record methods (lines 629-651). The separate Tehuti sensors could still be maintained alongside.

This creates 1 Otel instrument with dimensions as the instrument creation is inside a computeIfAbsent of instruments based on the MetricEntity.getMetricName(). The duplication just reuses the otel metric, but creates new tehuti metrics, one for each operation.

sushantmane

Follow-up: test duplication in AggServerHttpRequestStatsTest.

sushantmane · 2026-02-28T01:20:46Z

...ces/venice-server/src/test/java/com/linkedin/venice/stats/AggServerHttpRequestStatsTest.java

-    singleGetServerStatsFoo.recordErrorRequest();
-    singleGetServerStatsBar.recordErrorRequest();
+    singleGetServerStatsFoo.recordSuccessRequest(
+        HttpResponseStatusEnum.OK,


[Test duplication — Minor] The status triplet (HttpResponseStatusEnum.OK, HttpResponseStatusCodeCategory.SUCCESS, VeniceResponseStatusCategory.SUCCESS) is repeated 12 times across this file; the error triplet appears 4 times. This duplication was introduced by this PR (the old API didn't require enum params).

ServerStatsContextTest already demonstrates the right pattern — please apply the same here:

private static final HttpResponseStatusEnum OK_STATUS = HttpResponseStatusEnum.OK; private static final HttpResponseStatusCodeCategory OK_CATEGORY = HttpResponseStatusCodeCategory.SUCCESS; private static final VeniceResponseStatusCategory OK_VENICE = VeniceResponseStatusCategory.SUCCESS; private static final HttpResponseStatusEnum ERROR_STATUS = HttpResponseStatusEnum.INTERNAL_SERVER_ERROR; private static final HttpResponseStatusCodeCategory ERROR_CATEGORY = HttpResponseStatusCodeCategory.SERVER_ERROR; private static final VeniceResponseStatusCategory ERROR_VENICE = VeniceResponseStatusCategory.FAIL;

This reduces each call site from 3 lines to 1 and makes the test much more readable.

cleaned these up, thanks.

sushantmane

Thanks. LGTM but few minor comments

sushantmane · 2026-02-28T01:46:08Z

...ient/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTask.java

        for (int partition: timeoutPartitions) {
          reportError(errorMsg, partition, ex);
        }
+        if (isMetricsEmissionEnabled()) {


can we get rid of isMetricsEmissionEnabled flag?

This is a good catch. So the host level ingestion metrics is emitted only for the largest version number and that breaks the otel versioned data, it only probably records the current/future based on the largest available version number. I will handle this in a follow up PR.

sushantmane · 2026-02-28T02:01:24Z

...t-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceWriteComputeOperation.java

Should we rename this to Partial Update to make it consistent with code?

yeah makes sense. Modified the metric names and dimensions accordingly.

sushantmane · 2026-02-28T02:07:24Z

services/venice-server/src/main/java/com/linkedin/venice/listener/ServerStatsContext.java

   */
  private boolean statCallbackExecuted = false;

-  /**


This comment is useful. Why delete it?

I don't see the described firstPartLatency/secondPartLatency anywhere, hence removed. Am I missing something here?

Copilot

Pull request overview

Copilot reviewed 56 out of 56 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-01T03:26:21Z

services/venice-server/src/main/java/com/linkedin/venice/stats/ServerHttpRequestStats.java

+    // Shares STORAGE_ENGINE_QUERY_TIME OTel entity with storageEngineQueryTimeMetric.
+    // Differentiated in OTel by VENICE_REQUEST_METHOD base dimension (COMPUTE vs MULTI_GET/SINGLE_GET).
+    // Each has its own Tehuti sensor.
+    readComputeQueryTimeMetric = MetricEntityStateOneEnum.create(
+        STORAGE_ENGINE_QUERY_TIME.getMetricEntity(),
+        otelRepository,
+        registerPerStoreAndTotal(totalStats != null ? totalStats.readComputeQueryTimeMetric : null),
+        ServerTehutiMetricName.STORAGE_ENGINE_READ_COMPUTE_LATENCY,
+        Arrays.asList(
+            TehutiUtils
+                .getPercentileStatWithAvgAndMax(getName(), getFullMetricName("storage_engine_read_compute_latency"))),
+        baseDimensionsMap,
+        VeniceChunkingStatus.class);


Both storageEngineQueryTimeMetric and readComputeQueryTimeMetric share the same OTel metric entity STORAGE_ENGINE_QUERY_TIME and are expected to be differentiated by the VENICE_REQUEST_METHOD base dimension. However, in a ServerHttpRequestStats instance for RequestType.COMPUTE, both metrics would be created with baseDimensionsMap containing VENICE_REQUEST_METHOD=compute. When recordReadComputeLatency and recordDatabaseLookupLatency are both called on a compute request, both will write to the same OTel time series (same metric name + same store+cluster+request_method+chunking_status dimensions), causing double-counting in OTel. Tehuti is fine because the two have distinct sensor names (storage_engine_read_compute_latency vs storage_engine_query_latency). The comment at line 204-206 acknowledges the shared entity but doesn't address the double-counting risk. Consider whether recordDatabaseLookupLatency is actually called for compute requests (if not, this is benign), or whether STORAGE_ENGINE_QUERY_TIME should use different dimension values (e.g., a sub-type dimension) to distinguish the two.

Good catch. Added storage_engine.read.compute.execution_time to differentiate between these.

Copilot

Pull request overview

Copilot reviewed 56 out of 56 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...nts/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerReadOtelMetricEntity.java

services/venice-server/src/main/java/com/linkedin/venice/stats/ServerHttpRequestStats.java

services/venice-server/src/main/java/com/linkedin/venice/listener/ServerStatsContext.java

Copilot AI review requested due to automatic review settings February 27, 2026 10:24

Copilot started reviewing on behalf of m-nagarajan February 27, 2026 10:24 View session

m-nagarajan changed the title ~~[server][da-vinci] Integrate OTel metrics into ServerHttpRequestStats using Router Pattern~~ Feb 27, 2026

Copilot AI reviewed Feb 27, 2026

View reviewed changes

services/venice-server/src/test/java/com/linkedin/venice/grpc/ServerStatsContextTest.java Outdated Show resolved Hide resolved

services/venice-server/src/main/java/com/linkedin/venice/stats/ServerHttpRequestStats.java Outdated Show resolved Hide resolved

sixpluszero requested a review from Copilot February 27, 2026 17:22

Copilot started reviewing on behalf of sixpluszero February 27, 2026 17:22 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

m-nagarajan added 2 commits February 27, 2026 09:57

cleanup null check

5298aa4

Copilot AI review requested due to automatic review settings February 27, 2026 19:10

Copilot started reviewing on behalf of m-nagarajan February 27, 2026 19:11 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

...ommon/src/test/java/com/linkedin/venice/stats/dimensions/VeniceComputeOperationTypeTest.java Outdated Show resolved Hide resolved

address comments from copilot

9455c94

m-nagarajan requested a review from Copilot February 27, 2026 19:26

Copilot started reviewing on behalf of m-nagarajan February 27, 2026 19:27 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

services/venice-server/src/main/java/com/linkedin/venice/stats/ServerHttpRequestStats.java Outdated Show resolved Hide resolved

sushantmane reviewed Feb 27, 2026

View reviewed changes

m-nagarajan added 2 commits February 27, 2026 16:07

fix review comments

c771b02

Copilot AI review requested due to automatic review settings February 28, 2026 00:46

Copilot started reviewing on behalf of m-nagarajan February 28, 2026 00:47 View session

Copilot AI reviewed Feb 28, 2026

View reviewed changes

sushantmane reviewed Feb 28, 2026

View reviewed changes

sushantmane added the requested-changes Reviewer requested changes label Feb 28, 2026

m-nagarajan added 2 commits February 28, 2026 18:16

address review comments

c4ff6bb

fix test failures

29252f4

Copilot AI review requested due to automatic review settings March 1, 2026 03:22

Copilot started reviewing on behalf of m-nagarajan March 1, 2026 03:22 View session

Copilot AI reviewed Mar 1, 2026

View reviewed changes

m-nagarajan added 2 commits February 28, 2026 21:38

address copilt comments

bffed04

fix test failures

0ac156f

Copilot AI review requested due to automatic review settings March 1, 2026 05:56

Copilot started reviewing on behalf of m-nagarajan March 1, 2026 05:56 View session

Copilot AI reviewed Mar 1, 2026

View reviewed changes

address comments

92e1b85

m-nagarajan added review-addressed Author has addressed review comments and removed requested-changes Reviewer requested changes labels Mar 1, 2026

m-nagarajan requested a review from sushantmane March 2, 2026 01:22

-  SOURCE_BROKER;
+  SOURCE_BROKER,
+  /**
+   * Legacy value kept for backward compatibility with existing OTel metrics that may have
+   * {@code leader_consumer} as a source component. New code should not use this as a source;
+   * LEADER_CONSUMER is only meaningful as a destination component.
+   */
+  @Deprecated
+  LEADER_CONSUMER,
+  /**
+   * Legacy value kept for backward compatibility with existing OTel metrics that may have
+   * {@code follower_consumer} as a source component. New code should not use this as a source;
+   * FOLLOWER_CONSUMER is only meaningful as a destination component.
+   */
+  @Deprecated
+  FOLLOWER_CONSUMER;

	VENICE_WRITE_COMPUTE_OPERATION_PHASE("venice.write_compute.operation_phase"),
	VENICE_WRITE_COMPUTE_OPERATION_PHASE("venice.write_compute.operation"),

		storeIngestionService
		.recordIngestionFailure(storeName, version, VeniceIngestionFailureReason.SERVING_VERSION_BOOTSTRAP_TIMEOUT);

	storeIngestionService
	.recordIngestionFailure(storeName, version, VeniceIngestionFailureReason.SERVING_VERSION_BOOTSTRAP_TIMEOUT);
	storeIngestionService.recordIngestionFailure(storeName);

Conversation

m-nagarajan commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Statement

Solution

Key changes:

Per-category OTel config override:

Compute-only metric naming and dimension cleanup:

Dimension enum renames and cleanup:

Design decisions:

Code changes

Concurrency-Specific Checks

How was this PR tested?

Does this PR introduce any user-facing or breaking changes?

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

sushantmane left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sushantmane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-nagarajan Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sushantmane left a comment

Choose a reason for hiding this comment

m-nagarajan commented Feb 27, 2026 •

edited

Loading

sushantmane left a comment •

edited

Loading

m-nagarajan Mar 1, 2026 •

edited

Loading