Bendset Dataset

This repository documents Bendset, an anonymized Databend query workload released to make it easier to study large-scale, cloud data warehouse behavior. The dataset captures 39 million production queries executed between 2025‑12‑09 and 2025‑12‑11 and ships with scripts to reproduce the curated CSV files or generate new slices directly from your own Databend history tables.

Overview

Public snapshot: https://sharing.databend.cloud/bendset/bendset-2025-12-11.tar.gz (39 M queries, CSV chunks).
Source tables: system_history.query_history and system_history.profile_history, joined per query to keep execution profiles next to user-visible statistics.
Granularity: one row per query with timing, I/O, resource usage, queueing, result sizes, anonymized database/user identifiers, and profile-derived operator metrics.
License: see LICENSE.

Quick Start

Download and extract the latest archive.

curl -LO https://sharing.databend.cloud/bendset/bendset-2025-12-11.tar.gz
tar -xzf bendset-2025-12-11.tar.gz

(Optional) Regenerate a fresh slice from your Databend history service:
```
export DATABEND_DSN=$YOUR_DATABEND_DSN
uv run bendset.py --dsn "$DATABEND_DSN" --start 2025-12-09 --end 2025-12-11
```
The script streams hourly windows, joins query and profile history, and writes CSV chunks such as result.data_0, result.data_1, … (controlled by --output and --step).

Main Dataset

The main CSV files contain one record per query ordered by event_time. Alongside the raw identifiers (query_id, current_database, sql_user, all hashed), each row includes:

Timing: query_start_time, event_time, query_queued_duration_ms, query_duration_ms.
Classification: query_kind, log_type_name.
Scan/Write volume: scan_rows, scan_bytes, scan_io_bytes_cost_ms, written_rows, written_bytes, written_io_bytes, written_io_bytes_cost_ms.
Results: result_rows, result_bytes.
Resource tracking: peek_memory_usage (aggregated across fragments) and node_num (number of fragments/operators observed).
Profile aggregates pulled from profile_history: cpu_time_sum, scan_bytes_from_local_disk_sum, scan_bytes_from_data_cache_sum, scan_bytes_from_remote_sum, hash_join_nodes, aggregate_partial_nodes, sort_nodes, and produced_rows.

The query_id column uniquely identifies each query and serves as the join key when combining the dataset with other telemetry.

Sample Data

The example.csv file in this repository contains the first few rows of a processed slice and mirrors the layout of the public snapshot. A short excerpt is shown below:

query_id,cpu_time_sum,scan_bytes_from_local_disk_sum,scan_bytes_from_data_cache_sum,scan_bytes_from_remote_sum,hash_join_nodes,aggregate_partial_nodes,sort_nodes,produced_rows,query_start_time,event_time,query_queued_duration_ms,query_duration_ms,query_kind,log_type_name,scan_rows,scan_bytes,scan_io_bytes_cost_ms,written_rows,written_bytes,written_io_bytes,written_io_bytes_cost_ms,result_rows,result_bytes,peek_memory_usage,current_database,sql_user,node_num
019b03d74a747df3a4d250ebb7ef43c1,29116853.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2025-12-09 15:59:59.620594+00:00,2025-12-09 16:00:00.007417+00:00,0.0,386.0,CopyIntoTable,Finish,278.0,152064.0,0.0,278.0,786253.0,6389.0,0.0,0.0,0.0,263573309.0,c21f969b5f03d33d43e04f8f136e7682,269c24d5505ad4801e3238c586a1f52c,1
70010100-95eb-4a5a-8e75-4d4557780776,2572887.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2025-12-09 15:59:59.199122+00:00,2025-12-09 16:00:00.009906+00:00,518.0,810.0,Query,Finish,1123.0,125183.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21085030.0,c21f969b5f03d33d43e04f8f136e7682,1eefadf0ae4d5031dae553197fba763f,1

Use this file as a quick sanity check for downstream pipelines or to verify that parsers interpret timestamps, floats, and hashes correctly.

Column Reference

Column	Description
`query_id`	Unique identifier for every query log entry (UUID or hashed) and join key across tables.
`cpu_time_sum`	Total CPU time consumed by the query as reported by aggregated plan-node statistics.
`scan_bytes_from_local_disk_sum`	Bytes read from local disk caches.
`scan_bytes_from_data_cache_sum`	Bytes served from Databend’s data cache tier.
`scan_bytes_from_remote_sum`	Bytes scanned from remote object storage.
`hash_join_nodes`	Count of plan nodes whose name contains `HashJoin`, a proxy for join complexity.
`aggregate_partial_nodes`	Count of partial aggregation nodes encountered in the profile.
`sort_nodes`	Count of sort operators observed.
`produced_rows`	Rows produced by the root node (matches `OutputRows` in the profile when available).
`query_start_time`	UTC wall-clock timestamp when the query started running.
`event_time`	UTC timestamp when the history event was recorded (typically completion time).
`query_queued_duration_ms`	Milliseconds spent waiting in the scheduler before execution.
`query_duration_ms`	Milliseconds spent executing (between start and finish).
`query_kind`	Statement category such as `Query`, `CopyIntoTable`, etc.
`log_type_name`	Lifecycle stage captured by the history row (e.g., `Finish`, `Exception`).
`scan_rows`	Total rows scanned from storage.
`scan_bytes`	Total bytes scanned from storage.
`scan_io_bytes_cost_ms`	I/O cost estimate in milliseconds for scanning (as reported by history tables).
`written_rows`	Rows persisted to storage by the query.
`written_bytes`	Bytes written to storage.
`written_io_bytes`	Bytes transferred during write I/O operations.
`written_io_bytes_cost_ms`	I/O cost estimate (ms) for writes.
`result_rows`	Rows returned to the client.
`result_bytes`	Bytes returned to the client.
`peek_memory_usage`	Sum of per-fragment peak memory usage, derived from the VARIANT profile data.
`current_database`	MD5 hash of the originating database name for privacy.
`sql_user`	MD5 hash of the authenticated SQL user.
`node_num`	Number of fragments/operators recorded for the query’s execution.

Schema Notes

Every column in the exported CSV matches the names produced by bendset.py. Numeric fields are normalized to floating point numbers to avoid NaN drift.
peek_memory_usage is derived from the VARIANT column in query_history by summing the per-fragment peaks.
produced_rows reflects the output cardinality of the root plan node, when the profile reports OutputRows.
Timestamps are UTC, formatted as %Y-%m-%d %H:%M:%S.

Profile Time-Series Expansion

While the main dataset has aggregated per-query metrics, you can regenerate a time-series expansion by re-running bendset.py with --debug and post-processing the profiles column before aggregation. Each profile entry consists of (timestamp, query_id) pairs describing which queries were active at a given instant, allowing analyses such as concurrent query counts or CPU usage envelopes.

Scripts

bendset.py (default entry point): connects via databend_driver, reads hourly windows from history tables, aggregates operator statistics, and emits CSV chunks. Key options: --dsn, --start, --end, --db (defaults to system_history), --output, --step, and --debug.
main.py: lightweight placeholder used when distributing the package; real workloads should invoke bendset.py.

Limitations & Privacy

Sampling window: the public snapshot only covers three days. Use the script with your own DSN to fetch longer windows.
Privacy: user-facing identifiers (current_database, sql_user) are MD5-hashed to preserve anonymity. Query text is not included.
Schema drift: Future Databend releases may add or rename history columns. Re-run the script against your target version to regenerate consistent data.

Please open an issue or PR if you discover inconsistencies or have suggestions for additional derived metrics.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
bendset.py		bendset.py
example.csv		example.csv
main.py		main.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bendset Dataset

Overview

Quick Start

Main Dataset

Sample Data

Column Reference

Schema Notes

Profile Time-Series Expansion

Scripts

Limitations & Privacy

About

Uh oh!

Releases

Packages

Languages

License

databendlabs/bendset

Folders and files

Latest commit

History

Repository files navigation

Bendset Dataset

Overview

Quick Start

Main Dataset

Sample Data

Column Reference

Schema Notes

Profile Time-Series Expansion

Scripts

Limitations & Privacy

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages