Skip to content

feat(tools): Querytee Goldfish #17959

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

JordanRushing
Copy link
Contributor

@JordanRushing JordanRushing commented Jun 3, 2025

What this PR does / why we need it:

This PR introduces Goldfish, an experimental feature for QueryTee that enables
privacy-compliant query sampling and comparison between multiple Loki cells.

Motivation

When running multiple Loki cells (e.g., during migrations or A/B testing), operators
need visibility into response differences and performance variations between cells.
Goldfish addresses this by sampling queries and comparing their responses without
storing sensitive log content.

Key Features

  • Tenant-based sampling with configurable rates (e.g., 10% default, 50% for specific
    tenants)
  • Privacy-compliant comparison using FNV32 content hashing - no raw log data is stored
  • Performance metrics extraction including execution time, queue time, bytes/lines
    processed
  • Pluggable storage backends starting with CloudSQL support
  • Zero impact when disabled - all functionality is behind feature flags

Implementation

Goldfish integrates seamlessly with QueryTee's existing proxy flow:

  1. Samples queries based on tenant configuration
  2. Captures responses from both backend cells
  3. Extracts performance statistics and computes content hashes
  4. Stores results for offline analysis

Configuration Example

  querytee \
    -goldfish.enabled=true \
    -goldfish.sampling.default-rate=0.1 \
    -goldfish.sampling.tenant-rules="important-tenant:1.0" \
    -goldfish.storage.type=cloudsql \
    -goldfish.storage.cloudsql.connection-name=project:region:instance

Testing

  • Comprehensive unit tests
  • End-to-end test
  • Backward compatibility

This is marked as an experimental feature with the understanding that APIs may evolve
based on operational feedback.

go mod tidy && go mod vendor were run for DB dependencies.

Which issue(s) this PR fixes:

N/A

Special notes for your reviewer:

N/A

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR
Implements a new query sampling and analysis system called Goldfish that
integrates with the existing QueryTee proxy to capture performance metrics
and compare responses between backend cells.

Key features:
- Configurable tenant-based query sampling with random rate limiting
- Privacy-compliant performance statistics extraction (no raw log content)
- Response integrity verification using FNV32 content hashing
- Asynchronous processing to avoid blocking client responses
- CloudSQL storage backend for captured metrics and comparisons
- Comprehensive Prometheus metrics for monitoring sampling decisions
- Backward compatible - existing QueryTee functionality unchanged

The implementation extracts query performance statistics including execution
time, queue time, bytes/lines processed, and processing rates from Loki
responses. All sensitive data is excluded from storage, with only
performance metrics and content hashes retained for analysis.

Goldfish operates as an optional feature behind configuration flags and
maintains complete isolation from existing QueryTee comparison logic.
Add experimental warning to make it clear that the API and configuration
may change in future releases.
- Replace strings.SplitSeq with standard strings.Split for better readability
- Fix response body restoration to properly recreate readable body with bytes.NewReader
- Add missing bytes import to manager.go
- Update CloudSQL configuration to work with CloudSQL proxy sidecar pattern
- Add host, port, user, and password fields for proxy connection
- Replace pgx driver with lib/pq for PostgreSQL connectivity
- Default CloudSQL proxy host to 'cloudsql-proxy' on port 5432
- Remove CloudSQL connection name in favor of standard database connection parameters

This implementation supports the typical CloudSQL deployment pattern where
a proxy sidecar handles authentication and SSL/TLS connections.
@JordanRushing JordanRushing marked this pull request as ready for review June 4, 2025 15:24
@JordanRushing JordanRushing requested a review from a team as a code owner June 4, 2025 15:24
Signed-off-by: Jordan Rushing <rushing.jordan@gmail.com>
- Add POST method support for query endpoints alongside GET
- Fix critical goroutine variable capture bug causing race conditions
- Fix race condition in sampler by protecting config access with mutex
- Add context timeout for async Goldfish processing to prevent hangs
- Enhance logging to show query comparison stats without sensitive data
- Log execution times, bytes processed, and entries returned for both cells
- Add performance ratio logging and significant difference detection
- Improve backend selection logic to properly identify Cell A/B
- Add debug logging for sampling decisions and Goldfish attachment
@trevorwhitney trevorwhitney mentioned this pull request Jun 4, 2025
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 participant