Cloud service providers usually leverage standard benchmarks such as TPC-H and TPC-DS to evaluate and optimize the performance of cloud data analytic systems. However, these benchmarks have fixed query patterns and are unable to effectively generate statistics of the cloud workloads in production. For example, they cannot simulate the real workload with the similar performance metrics such as CPU Time and Scanned Bytes that are important for cost-optimal query optimization in the cloud.
In this paper, we study the problem of synthesizing workloads that are close to the real cloud workloads in terms of performance metrics and operator ratios. The problem is challenging in three folds. First, the original queries are invisible, thus it is hard to generate each query with the exact operators. Second, the original workloads generates performance metric curves with various peaks and valleys, and it is challenging to fit each peak and valley simultaneously. Third, it is challenging to prepare a complete candidate query set since the generation targets have numerous combinations and may be out of distribution. To tackle the problem, we propose a novel workload synthesizer, named Performance-aware cloud analytics Benchmarking (PBench), that can generate workloads from real statistics for cloud analytics benchmarking. To address the first challenge, we leverage standard benchmarks to synthesize workloads such that aggregated performance metrics and operator ratio match those of real workloads. To handle the second challenge, we formulate the problem as an optimization problem and propose a two-phase framework to synthesize the workload by minimizing the generation error gradually. To address the third challenge, we incorporate LLM-enhanced query generation to diversify the candidate query set for improving the synthesizing accuracy. We evaluate PBench over two real-world workloads, Snowset and Red- set. The experimental results demonstrate our method outperforms existing state-of-the-art methods by up to 6x higher accuracy.
PBench: Workload synthesizer with real statistics for cloud analytics benchmarking
2025
Research areas