2 Fixes Vastly Cut TiKV Write Stalls From SST File Ingestion
TiKV is an open source, distributed and transactional key-value database. Growing applications demand consistent performance, and unexpected write latency spikes, especially during Sorted String Table (SST) file ingestion, were hurting predictability in TiKV. Yet we found the root cause and delivered two enhancements that virtually eliminate those stalls, while keeping correctness intact. These improvements sharpen TiKV’s performance under high load, heavy data movement or bursty write patterns.
What Was the Problem?
When TiKV ingests external SST files (via IngestExternalFile), it sometimes has to block foreground writes. That’s because SST ingestion must preserve the global sequence order across data in MemTables and data being ingested. If the SST key ranges overlap data in the MemTable, TiKV triggers a MemTable flush, which causes a write stall. Since a single RocksDB instance – the underlying storage for TiKV – covers all regions on a TiKV node, a problem in one region can degrade write latency across the entire node.

Figure 1: Each TiKV node hosts multiple regions, all backed by a single RocksDB instance. During SST ingestion, a write stall in one region affects all others on the same node.
What We Did: Two Major Fixes
Two TiKV improvements were made to improve write latency.
1. Flush Less, Stall Less
First, we changed the way ingestion handles MemTable overlap:
- First, attempt ingestion with
allow_blocking_flush = false. - If that fails, perform the flush outside the critical
write‐stallpath. - Then retry ingestion with
allow_blocking_flush = true.
Thanks to this optimization (see TiKV#3775), many writes that used to stall now proceed normally. In tests, stall times dropped up to 100 times in worst‐case overlapping scenarios.
2. Remove Stalls via Coordinated Ingestion Plus Safety
To go further, we allowed SST ingestion to proceed with writes still allowed — even when overlap might occur — paired with safety mechanisms:
- Allow ingestion with
allow_write = true, so foreground writes no longer must stop. - To maintain safety (no conflict between concurrent writes and garbage collection(GC)/ingestion), we added range latches across affected key ranges. This guarantees no overlapping writes are being processed that could violate sequence ordering. (Implemented via TiKV#18096.)
With these, writes don’t stall at all during ingestion in most cases.
Measurable Results
As you can see below, we saw significant improvements in tail latencies and write performance.
1. P9999 write thread wait time dropped from 25 milliseconds to 2ms.

Figure 2: Eliminating write stalls significantly reduced worst-case P9999 wait times by more than 90% for write threads.
2. P99 write latency dropped from 2-4ms to 1ms.

Figure 3: After the optimization, P99 write latency became consistently low and predictable.
What that means in practice: Write operations become far more predictable under load, even during operations like region splitting, rebalancing or GC sweeps. That stability matters a lot in production systems where microlatency spikes ripple into user-visible delays.
Why RocksDB Matters
RocksDB, as the underlying storage for TiKV, enforces the consistency guarantee that global sequence numbers are increasing, even across data in different storage components (MemTables, SST levels, external SSTs). Without careful handling, overlapping key ranges during SST ingestion force MemTable flushes, leading to stalls. Our optimizations honor the same guarantees while changing when and how flushes happen, or avoiding them altogether when possible.
What This Means for You
If you have concerns about tail latency (P99/P999/P9999) due to:
- frequent data ingestion (rebalance, migration, batch loading)
- sudden bursts of writes
… then these changes in TiKV provide meaningful benefits with lower worst‐case waits, more consistent write latency and fewer surprises in production.
What seemed like a niche issue — write stalls during SST ingestion — actually turned out to be a powerful lever for improvement as we reduced or removed stalls in TiKV in almost all situations.
KubeCon + CloudNativeCon North America 2025 is taking place Nov. 10-13 in Atlanta, Georgia. Register now.