Skip to content

feat: structured telemetry module for observability#2107

Open
Ayush10 wants to merge 1 commit intomicrosoft:mainfrom
Ayush10:feat/issue-2098-observability
Open

feat: structured telemetry module for observability#2107
Ayush10 wants to merge 1 commit intomicrosoft:mainfrom
Ayush10:feat/issue-2098-observability

Conversation

@Ayush10
Copy link

@Ayush10 Ayush10 commented Jan 30, 2026

Summary

Closes #2098

This PR introduces a pluggable telemetry framework for Qlib that provides structured metrics collection and workflow tracing. It ships as a foundational module with proof-of-concept instrumentation, designed to be extended incrementally across the codebase.

Architecture

┌─────────────────────────────────────────────────┐
│  Application Code (data pipeline, models, etc.) │
│                                                 │
│  metrics.counter("cache.hits")                  │
│  with tracer.span("data_loading"):              │
│      ...                                        │
└──────────────┬──────────────────────────────────┘
               │
       ┌───────▼───────┐
       │  QlibMetrics   │  counter / gauge / histogram
       │  QlibTracer    │  span context manager + @traced
       └───────┬───────┘
               │  fan-out to all registered backends
       ┌───────┼───────────────┐
       ▼       ▼               ▼
  Logging   InMemory      Custom Backend
  Backend   Backend       (Prometheus, OTel, etc.)

Core Components (qlib/utils/telemetry.py)

Component Purpose
MetricEvent / SpanEvent Typed dataclasses for metric measurements and trace spans
MetricsBackend (ABC) Interface for pluggable export backends
QlibMetrics Singleton metrics collector (counter, gauge, histogram)
QlibTracer Context-manager-based tracer with parent-child span tracking
LoggingBackend Integrates with existing get_module_logger infrastructure
InMemoryBackend For testing, assertions, and programmatic access with summary()

Design Principles

  • Zero overhead by default — When no backend is registered, all operations are no-ops
  • Non-invasive — Uses decorators and context managers; no changes to function signatures
  • Backward compatible — Works alongside existing TimeInspector and get_module_logger
  • Error-isolated — Backend failures never crash the application
  • Thread-safe — Lock-protected backends + thread-local span stacking

Proof-of-Concept Instrumentation

Three high-value instrumentation points demonstrate the pattern:

  1. DataHandlerLP.setup_data() — Span tracing + row/column gauge metrics
  2. DataHandlerLP._run_proc_l() — Per-processor span tracing with rows in/out
  3. MemCacheUnit.__getitem__() — Cache hit counter

Usage

from qlib.utils.telemetry import metrics, tracer, enable_inmemory_backend

# Enable a backend (opt-in)
backend = enable_inmemory_backend()

# Record metrics
metrics.counter("cache.hits", 1, tags={"cache": "expression"})
metrics.gauge("memory.rss_mb", 1024.5)

# Trace a workflow
with tracer.span("data_loading", tags={"freq": "day"}):
    data = load_data()

# Use as a decorator
@tracer.traced("model_training")
def train_model():
    ...

# Inspect collected data
print(backend.summary())

Suggested Follow-up Work

This PR is intentionally scoped as a foundation. Subsequent PRs could:

  • Instrument model training (fit()/predict()) and backtesting workflows
  • Add a FileBackend for JSON/CSV metric export
  • Add an OpenTelemetryBackend for production observability
  • Instrument ExpressionCache and DatasetCache for cache hit ratios
  • Add CLI flag or config option to auto-enable logging backend

Test Plan

  • 29 new unit tests in tests/test_telemetry.py covering:
    • MetricEvent and SpanEvent defaults
    • QlibMetrics: no-op without backend, counter/gauge/histogram, multiple backends, error isolation
    • QlibTracer: span duration, tags, error recording, nested parent-child spans, histogram emission, @traced decorator, thread safety (10 concurrent threads)
    • InMemoryBackend: filtered queries, clear, summary statistics
    • LoggingBackend: non-raising behavior
    • Module-level singletons and convenience functions
  • All tests pass: python -m pytest tests/test_telemetry.py -v
  • Existing tests unaffected (instrumentation is no-op without backends)
Introduce a pluggable telemetry framework (qlib/utils/telemetry.py) that
provides metrics collection and workflow tracing with zero overhead when
no backend is registered.

Core components:
- QlibMetrics: counter/gauge/histogram with pluggable backends
- QlibTracer: context-manager spans with parent-child tracking
- LoggingBackend: integrates with existing get_module_logger
- InMemoryBackend: for testing and programmatic access

Proof-of-concept instrumentation:
- DataHandlerLP.setup_data: span tracing + row/column gauges
- DataHandlerLP._run_proc_l: per-processor span tracing
- MemCacheUnit: cache hit counter

Includes 29 unit tests covering metrics, tracing, thread safety,
nested spans, error recording, and backend isolation.
@Ayush10 Ayush10 force-pushed the feat/issue-2098-observability branch from 9fcbbd6 to b0f0e1e Compare January 30, 2026 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant