2 Fixes Vastly Cut TiKV Write Stalls From SST File Ingestion

Sudden write stalls were causing unpredictable performance in TiKV. Learn how we boosted performance and predictability under load.

Oct 24th, 2025 11:00am by Jinpeng Zhang and Wilco Huang

Featued image for: 2 Fixes Vastly Cut TiKV Write Stalls From SST File Ingestion

Image from Besjunior on Shutterstock.

TiKV is an open source, distributed and transactional key-value database. Growing applications demand consistent performance, and unexpected write latency spikes, especially during Sorted String Table (SST) file ingestion, were hurting predictability in TiKV. Yet we found the root cause and delivered two enhancements that virtually eliminate those stalls, while keeping correctness intact. These improvements sharpen TiKV’s performance under high load, heavy data movement or bursty write patterns.

What Was the Problem?

When TiKV ingests external SST files (via IngestExternalFile), it sometimes has to block foreground writes. That’s because SST ingestion must preserve the global sequence order across data in MemTables and data being ingested. If the SST key ranges overlap data in the MemTable, TiKV triggers a MemTable flush, which causes a write stall. Since a single RocksDB instance – the underlying storage for TiKV – covers all regions on a TiKV node, a problem in one region can degrade write latency across the entire node.

Figure 1: Each TiKV node hosts multiple regions, all backed by a single RocksDB instance. During SST ingestion, a write stall in one region affects all others on the same node.

What We Did: Two Major Fixes

Two TiKV improvements were made to improve write latency.

1. Flush Less, Stall Less

First, we changed the way ingestion handles MemTable overlap:

First, attempt ingestion with allow_blocking_flush = false.
If that fails, perform the flush outside the critical write‐stall path.
Then retry ingestion with allow_blocking_flush = true.

Thanks to this optimization (see TiKV#3775), many writes that used to stall now proceed normally. In tests, stall times dropped up to 100 times in worst‐case overlapping scenarios.

2. Remove Stalls via Coordinated Ingestion Plus Safety

To go further, we allowed SST ingestion to proceed with writes still allowed — even when overlap might occur — paired with safety mechanisms:

Allow ingestion with allow_write = true, so foreground writes no longer must stop.
To maintain safety (no conflict between concurrent writes and garbage collection(GC)/ingestion), we added range latches across affected key ranges. This guarantees no overlapping writes are being processed that could violate sequence ordering. (Implemented via TiKV#18096.)

With these, writes don’t stall at all during ingestion in most cases.

Measurable Results

As you can see below, we saw significant improvements in tail latencies and write performance.

1. P9999 write thread wait time dropped from 25 milliseconds to 2ms.

Figure 2: Eliminating write stalls significantly reduced worst-case P9999 wait times by more than 90% for write threads.

2. P99 write latency dropped from 2-4ms to 1ms.

Figure 3: After the optimization, P99 write latency became consistently low and predictable.

What that means in practice: Write operations become far more predictable under load, even during operations like region splitting, rebalancing or GC sweeps. That stability matters a lot in production systems where microlatency spikes ripple into user-visible delays.

Why RocksDB Matters

RocksDB, as the underlying storage for TiKV, enforces the consistency guarantee that global sequence numbers are increasing, even across data in different storage components (MemTables, SST levels, external SSTs). Without careful handling, overlapping key ranges during SST ingestion force MemTable flushes, leading to stalls. Our optimizations honor the same guarantees while changing when and how flushes happen, or avoiding them altogether when possible.

What This Means for You

If you have concerns about tail latency (P99/P999/P9999) due to:

frequent data ingestion (rebalance, migration, batch loading)
sudden bursts of writes

… then these changes in TiKV provide meaningful benefits with lower worst‐case waits, more consistent write latency and fewer surprises in production.

What seemed like a niche issue — write stalls during SST ingestion — actually turned out to be a powerful lever for improvement as we reduced or removed stalls in TiKV in almost all situations.

KubeCon + CloudNativeCon North America 2025 is taking place Nov. 10-13 in Atlanta, Georgia. Register now.

Jinpeng Zhang is a principal engineer at PingCAP, TiKV maintainer and committer, RocksDB contributor, and the author of the book "MariaDB Principles and Implementation." He is mainly engaged in the design and development of cloud native large-scale distributed storage systems,...