Overview

Relevant source files

Mooncake is a KVCache-centric disaggregated architecture for large language model (LLM) serving. It is the production serving platform for Kimi, a leading LLM service operated by Moonshot AI README.md28 The system separates prefill and decode workloads across different compute clusters and implements a distributed KVCache pool using underutilized CPU, DRAM, and SSD resources README.md4

The repository provides two primary open-source components:

Transfer Engine / TENT — High-performance data movement across heterogeneous storage and network devices README.md29
Mooncake Store — Distributed KVCache storage with multi-replica support and automatic eviction README.md29

Mooncake integrates with major LLM inference frameworks including vLLM, SGLang, TensorRT-LLM, and LMDeploy to enable disaggregated prefill-decode (PD) architectures and hierarchical KV caching. Mooncake recently joined the PyTorch Ecosystem README.md38 and received the Best Paper Award at FAST 2025 docs/source/index.md50

For detailed information on specific subsystems, see:

System architecture details: System Architecture
Core terminology and data model: Key Concepts and Terminology
Hardware and protocol support: Supported Protocols and Hardware
Performance metrics: Performance Characteristics

Sources: README.md1-31 docs/source/index.md1-27

Three-Tier Architecture

Mooncake consists of three logical tiers that separate concerns between data movement, storage/control, and application integration.

High-Level System Architecture

The following diagram illustrates the relationship between user-facing frameworks, the Python API layer, and the underlying C++ core components.

Tier 1: Data Plane (Transfer Engine) The TransferEngine class provides a unified API for batched data movement. It abstracts multiple transport protocols (RDMA, NVLink, TCP, Ascend Direct, EFA) through a MultiTransport dispatcher docs/source/getting_started/supported-protocols.md7-18 TENT is the next-generation runtime that adds dynamic path selection and in-runtime failure handling.

Tier 2: Storage and Control Plane (Mooncake Store) The MasterService manages metadata sharding, segment allocation, and object lifecycle. RealClient coordinates data placement and retrieval. Metadata backends including ETCD, Redis, and HTTP provide high-availability support docs/source/getting_started/supported-protocols.md83

Tier 3: Application Integration Layer Python bindings (MooncakeStorePyWrapper, TransferEnginePy) expose C++ functionality to ML frameworks. MooncakeDistributedStore provides tensor-centric APIs for PyTorch integration.

Sources: README.md35-43 docs/source/getting_started/supported-protocols.md1-18 MAINTAINERS.md7-8

Core Components

Transfer Engine and TENT

The Transfer Engine is Mooncake's data movement layer. It supports batched asynchronous transfers with automatic topology-aware path selection and multi-NIC bandwidth aggregation docs/source/getting_started/supported-protocols.md111-116

Standard Transfer Engine: Uses a two-phase initialization and requires manual transport installation via installTransport().
TENT (Transfer Engine NEXT): A single-phase initialization runtime that manages transport selection automatically. TENT adds telemetry-driven scheduling and declarative APIs.

Sources: README.md79-81 docs/source/index.md33 MAINTAINERS.md7

Mooncake Store

Mooncake Store is a distributed object storage system specialized for KVCache. It implements lease-based object lifecycle management, multi-replica support, and automatic eviction to SSD docs/source/index.md46

Master Service: Manages metadata across shards with per-shard locks to reduce contention.
Client: The RealClient handles put() and get() operations, interacting with the MasterService for metadata and the TransferEngine for data movement.
Conductor and KV Indexer: Specialized services for tiered KV cache pools (G1/G2/G3) and token hit count tracking.
SSD Offloading: Background subsystem within the RealClient that persists evicted objects to local disk via FileStorage. This is crucial for maintaining performance when DRAM/VRAM is exhausted.

Sources: README.md82-84 docs/source/index.md49 MAINTAINERS.md8 docs/source/image/ssd_offload_overall.png1-10

Object Lifecycle and Memory Hierarchy

Mooncake manages objects through a state machine: ALLOCATING → PROCESSING → COMPLETE → EVICTED. Memory is managed across a hierarchy including DRAM, GPU memory (VRAM), CXL, and SSD docs/source/getting_started/supported-protocols.md17

Sources: README.md82-84 docs/source/index.md38

Framework Integration

Mooncake enables prefill-decode (PD) disaggregation by separating compute-intensive prefill operations from latency-sensitive decode operations.

Framework	Integration Point	Role
vLLM	`mooncake_connector_v1.py`	PD-disaggregated KV transfer backend README.md43
SGLang	`hicache` / `RadixAttention`	Hierarchical KV caching storage backend with SSD offload support README.md46 docs/source/getting_started/examples/sglang-integration-v1.md149-155
TensorRT-LLM	`mooncake_utils`	Cache transmission for PD-disaggregation README.md42
LMDeploy	PD Backend	PD disaggregation plugin README.md51
TorchSpec	Hidden State Management	Decoupling inference and training README.md34
FlexKV	`Mooncake Transfer Engine`	Distributed KVCache reuse README.md39
vLLM-Omni	`MooncakeStoreConnector`	Multi-node omni-modality pipelines README.md37
LightX2V	`Transfer Engine`	Encoder/Transformer service decoupling README.md35

Sources: README.md34-60 docs/source/index.md30-53 docs/source/getting_started/examples/sglang-integration-v1.md1-172

Supported Hardware and Protocols

Mooncake supports a wide range of high-performance transport protocols and hardware accelerators through a pluggable architecture. For a full list, see Supported Protocols and Hardware.

Protocols: RDMA (InfiniBand/RoCE/eRDMA), NVLink, TCP, EFA, CXL, NVMe-oF docs/source/getting_started/supported-protocols.md7-18
Hardware: NVIDIA (CUDA), AMD (HIP/ROCm), Huawei (Ascend/NPU), Moore Threads (MUSA), Cambricon (MLU) docs/source/getting_started/supported-protocols.md15-18 mooncake-transfer-engine/include/gpu_vendor/mlu.h8

Sources: docs/source/getting_started/supported-protocols.md1-18 MAINTAINERS.md11-15 mooncake-transfer-engine/include/gpu_vendor/mlu.h1-37