Mooncake is a KVCache-centric disaggregated architecture for large language model (LLM) serving. It is the production serving platform for Kimi, a leading LLM service operated by Moonshot AI README.md28 The system separates prefill and decode workloads across different compute clusters and implements a distributed KVCache pool using underutilized CPU, DRAM, and SSD resources README.md4
The repository provides two primary open-source components:
Mooncake integrates with major LLM inference frameworks including vLLM, SGLang, TensorRT-LLM, and LMDeploy to enable disaggregated prefill-decode (PD) architectures and hierarchical KV caching. Mooncake recently joined the PyTorch Ecosystem README.md38 and received the Best Paper Award at FAST 2025 docs/source/index.md50
For detailed information on specific subsystems, see:
Sources: README.md1-31 docs/source/index.md1-27
Mooncake consists of three logical tiers that separate concerns between data movement, storage/control, and application integration.
The following diagram illustrates the relationship between user-facing frameworks, the Python API layer, and the underlying C++ core components.
Tier 1: Data Plane (Transfer Engine)
The TransferEngine class provides a unified API for batched data movement. It abstracts multiple transport protocols (RDMA, NVLink, TCP, Ascend Direct, EFA) through a MultiTransport dispatcher docs/source/getting_started/supported-protocols.md7-18 TENT is the next-generation runtime that adds dynamic path selection and in-runtime failure handling.
Tier 2: Storage and Control Plane (Mooncake Store)
The MasterService manages metadata sharding, segment allocation, and object lifecycle. RealClient coordinates data placement and retrieval. Metadata backends including ETCD, Redis, and HTTP provide high-availability support docs/source/getting_started/supported-protocols.md83
Tier 3: Application Integration Layer
Python bindings (MooncakeStorePyWrapper, TransferEnginePy) expose C++ functionality to ML frameworks. MooncakeDistributedStore provides tensor-centric APIs for PyTorch integration.
Sources: README.md35-43 docs/source/getting_started/supported-protocols.md1-18 MAINTAINERS.md7-8
The Transfer Engine is Mooncake's data movement layer. It supports batched asynchronous transfers with automatic topology-aware path selection and multi-NIC bandwidth aggregation docs/source/getting_started/supported-protocols.md111-116
installTransport().Sources: README.md79-81 docs/source/index.md33 MAINTAINERS.md7
Mooncake Store is a distributed object storage system specialized for KVCache. It implements lease-based object lifecycle management, multi-replica support, and automatic eviction to SSD docs/source/index.md46
RealClient handles put() and get() operations, interacting with the MasterService for metadata and the TransferEngine for data movement.RealClient that persists evicted objects to local disk via FileStorage. This is crucial for maintaining performance when DRAM/VRAM is exhausted.Sources: README.md82-84 docs/source/index.md49 MAINTAINERS.md8 docs/source/image/ssd_offload_overall.png1-10
Mooncake manages objects through a state machine: ALLOCATING → PROCESSING → COMPLETE → EVICTED. Memory is managed across a hierarchy including DRAM, GPU memory (VRAM), CXL, and SSD docs/source/getting_started/supported-protocols.md17
Sources: README.md82-84 docs/source/index.md38
Mooncake enables prefill-decode (PD) disaggregation by separating compute-intensive prefill operations from latency-sensitive decode operations.
| Framework | Integration Point | Role |
|---|---|---|
| vLLM | mooncake_connector_v1.py | PD-disaggregated KV transfer backend README.md43 |
| SGLang | hicache / RadixAttention | Hierarchical KV caching storage backend with SSD offload support README.md46 docs/source/getting_started/examples/sglang-integration-v1.md149-155 |
| TensorRT-LLM | mooncake_utils | Cache transmission for PD-disaggregation README.md42 |
| LMDeploy | PD Backend | PD disaggregation plugin README.md51 |
| TorchSpec | Hidden State Management | Decoupling inference and training README.md34 |
| FlexKV | Mooncake Transfer Engine | Distributed KVCache reuse README.md39 |
| vLLM-Omni | MooncakeStoreConnector | Multi-node omni-modality pipelines README.md37 |
| LightX2V | Transfer Engine | Encoder/Transformer service decoupling README.md35 |
Sources: README.md34-60 docs/source/index.md30-53 docs/source/getting_started/examples/sglang-integration-v1.md1-172
Mooncake supports a wide range of high-performance transport protocols and hardware accelerators through a pluggable architecture. For a full list, see Supported Protocols and Hardware.
Sources: docs/source/getting_started/supported-protocols.md1-18 MAINTAINERS.md11-15 mooncake-transfer-engine/include/gpu_vendor/mlu.h1-37
Refresh this wiki
This wiki was recently refreshed. Please wait 1 day to refresh again.