Building a production-grade cloud-native large model inference platform based on SGlang RBG + Mooncake

Author | Jiu Yu (SGLang Community & Alibaba Cloud), Yang Yanbo (SGLang Community & IFLYTEK), Sun Weixiang (SGLang Community & Xiaohongshu), Song Yang (SGLang Community & Xiaohongshu), Yu Yang (Mooncake & Alibaba Cloud) Background Large Language Model (LLM) inference services are rapidly becoming the core infrastructure for enterprise applications. The key to production-level implementation lies in balancing performance, stability, and cost, and this article focuses on how to build a stable high-performance inference system. Currently, LLM inference architecture is evolving from a monolithic model to a distributed one, with mainstream paths including Prefill-Decode (PD) separation, Attention-FFN (AF) separation, and external KVCache. The fundamental driving force behind this evolution is the memory pressure caused by the expansion of model scale: in long context or high concurrency scenarios, the memory usage of KVCache often exceeds 70%, making it increasingly difficult to rely solely on GPU HBM and CPU DRAM. Decoupling and externalizing KVCache not only breaks through storage capacity bottlenecks but also enables key capabilities such as cross-request cache sharing, elastic scaling, and fault isolation. Especially in machine-driven consumption scenarios like RAG, AI Agent, and long text generation, template-based prompts and reusability have become the norm, making external KVCache a necessary option to ensure low latency, high throughput, and cost-effectiveness. Mooncake, as a mainstream distributed KVCache storage engine in the industry, was born to address this challenge. It provides high throughput and low latency KVCache distributed services for inference frameworks like SGLang through dedicated cache clusters. However, managing distributed KVCache systems like Mooncake in production environments to achieve stable high performance still faces new challenges: To address these pain points, RoleBasedGroup (RBG) has emerged. As a Kubernetes-native API for AI inference, RBG orchestrates multi-role collaboration by treating Mooncake cache and SGLang inference nodes as different roles of the same service, unifying their deployment, upgrades, and elasticity management. With RBG's in-place upgrade and topology awareness capabilities, it can minimize cache loss while ensuring consistency in computation and cache upgrades, scheduling, and scaling strategies, thereby maximizing performance while ensuring stability and operability in production environments. This article aims to clarify how to position Mooncake Store as a supplementary role in the RBG orchestration of SGLang PD separation inference services, systematically achieving production-level external KVCache capabilities. Mooncake: A Distributed KVCache Storage Engine for Large Model Inference Project Address: Mooncake is a high-performance distributed L3 storage backend for SGLang HiCache (hierarchical cache), achieving cross-machine KVCache sharing through RDMA, breaking through the single-machine GPU/CPU cache capacity bottleneck. https://static001.geekbang.org/wechat/images/dd/dd65ea0d9db0d55514786c172df73235.png Core Components: Master Service: Manages the cluster storage pool, metadata, and node lifecycle Store Service: Provides distributed cache storage, supporting multi-replica, striping transmission, and hotspot load balancing Core Features: RDMA acceleration + zero-copy mechanism, achieving high bandwidth and low latency data access Intelligent prefetching and GPU direct transmission, maximizing I/O efficiency Supports PD separation architecture, enhancing token throughput in large-scale clusters Quick Preview: RoleBasedGroup (RBG): An elastic role orchestration engine for large model inference Project Address: 3.1 Core Issues: Five Major Challenges in Large Model Inference Production Deployment Large model inference is evolving into "the most expensive microservice"—requiring both the extreme performance of HPC clusters and the agile elasticity of cloud-native environments. Current production environments face five fundamental challenges: Fundamental Contradiction: Traditional microservices are designed for stateless, weak topology scenarios, while large model inference is a stateful application with strong state, topology awareness, and extreme performance. 3.2 RBG Design Philosophy: Roles as First-Class Citizens, Role Collaboration as Core Scenario RBG originates from the SGLang community, jointly contributed by Xiaohongshu, SuanZhi Future, IFLYTEK, Alibaba Cloud, and Nanjing University. Its core goal is to construct a management paradigm that aligns with LLM inference characteristics, using "roles" as the atomic unit of scheduling orchestration while balancing performance and stability. RBG views an inference service as a topological, stateful, and collaborative " role organism," rather than an isolated collection of Deployments. Based on this concept, RBG proposes the SCOPE core capability framework for production environments: S – Stable: Deterministic operations for topology-aware environments C – Coordination: Cross-role collaboration strategy engine O – Orchestration: Role and service discovery with orchestration semantics P – Performance: Topology-aware high-performance scheduling E – Extensible: Future-oriented declarative abstraction 3.3 SCOPE Core Capability Analysis 3.3.1 Stable: Deterministic operation and maintenance oriented to topology awareness Stability is the cornerstone of RBG. By injecting a globally unique RoleID into each Pod and adhering to the principle of " minimum replacement domain," RBG ensures that operational and maintenance operations are completed within the original GPU-NVLink domain, NUMA nodes, and other hardware topology ranges, minimizing performance fluctuations caused by topology drift. 3.3.2 Coordination: Cross-role coordination strategy engine RBG has a built-in declarative coordination engine that precisely defines the dependencies between roles through mechanisms: Deployment coordination: For example, Prefill and Decode are scheduled in pairs and grouped ready at a specific ratio; Upgrade coordination: Supports "proportional protocol" upgrades to ensure version consistency across multiple roles, avoiding protocol incompatibility caused by partial upgrades; Fault coordination: Predefined linkage strategies trigger automatic remediation or migration of associated roles when a certain role fails; Scaling coordination: Adjusts instances in groups according to role relationship ratios during scaling, maintaining stable throughput and latency performance. This refined coordination capability manages complex distributed inference services as a unified lifecycle entity, significantly reducing operational complexity. 3.3.3 Orchestration: Orchestrated role and service discovery RBG explicitly defines role dependencies and precise startup sequences, achieving orchestrated management. More importantly, it provides built-in service discovery with topology self-awareness, injecting complete topology information (such as role IPs, attributes, relationships, etc.) into environment variables or configuration files during Pod startup. Inference engines (such as SGLang, vLLM, etc.) can directly read the topology view from local configurations without relying on external service discovery systems like etcd or Consul, making service migration across environments more self-contained and significantly reducing integration complexity. 3.3.4 Performance: Topology-aware high-performance scheduling The latency and throughput of a single request are highly dependent on hardware topology and resource affinity. RBG introduces a topology-aware packing strategy that supports multi-dimensional performance optimization: GPU topology priority (e.g.); Affinity and anti-affinity constraints between roles; Layout balance of instances of the same role; Short-circuit read optimization after deployment completion. Through these constraints and strategies, RBG can align as closely as possible with the optimal hardware topology during large-scale deployments without sacrificing stability, thereby ensuring key performance indicators such as TTFT and TPOT. 3.3.5 Extensible: Deployment abstraction oriented to change RBG decouples "role relationship definition" from "deployment/model management/elastics policies" through a declarative API (such as , , etc.) and a plug-in mechanism. When the community evolves to a new architecture (such as a new routing layer form, separated architecture, etc.), there is no need to modify the core code of RBG; it can be quickly implemented by defining new role templates and relationships through YAML. This "declarative API + plug-in mechanism" platform design significantly shortens the production cycle of the new architecture. RBG provides a unified hosting layer for large model inference services through Kubernetes native API, offering a stable, coordinated, orchestrated, high-performance, and extensible solution, representing a new type of deployment and operation abstraction for modern LLM inference workloads. RBG-based Deployment of PD Separation Architecture + Mooncake Inference Services 4.1. Deployment Architecture https://static001.geekbang.org/wechat/images/00/004156b8786cb778a4dd59e76c286e8e.png The RoleBasedGroup can deploy a highly available and elastic SGLang PD separation inference system, with the following core components: The entire system consists of the following core roles: SGLang Router: As a unified request entry and traffic scheduler, it is responsible for receiving user inference requests and intelligently selecting appropriate Prefill and Decode nodes for processing based on load status, context length, and model configuration. Prefill Serving Backend: Dedicated to handling the forward computation of prompts, generating the initial KVCache; it is usually compute-intensive and sensitive to memory bandwidth. Decode Serving Backend: Focused on the autoregressive generation phase of token-by-token decoding, relying on the generated KVCache for efficient inference; it is extremely sensitive to cache access latency. Mooncake Master/Store: As an independent KVCache external storage role, it provides high throughput and low latency distributed caching services, persisting the Key-Value Cache of all inference sessions. It not only breaks through the capacity limitations of single GPU HBM and CPU DRAM but also supports cross-request cache reuse and fine-grained cache eviction strategies (such as LRU + high water level eviction) These roles do not operate in isolation but are closely integrated through the native multi-role collaboration capabilities provided by RBG. Additionally, EngineRuntime, as a Sidecar injected by RBG into the engine service Pod, serves as a bridge between the inference engine and the upper orchestration system, providing critical runtime capabilities such as service registration and metadata reporting, dynamic LoRA loading/unloading, traffic status control, and observability integration. 4.2. Deploying Mooncake + SGLang PD Decoupled Inference Service via RBG Install RBG: https://github.com/sgl-project/rbg/blob/main/doc/install.md For image preparation, see Appendix 8.1 Service Deployment After preparing the container image, use the following yaml to deploy the SGLang PD decoupled inference service with KVCache Offloading capability based on RBG: For explanations of the environment variables involved in the yaml, please refer to: Check Deployment Results: View the network and location information of one instance of the Mooncake Store role: 4.3. Benchmark Test Results: Significant Acceleration with Multi-Level Caching Testing in multi-turn dialogue scenarios indicates that the multi-level caching architecture is crucial for improving KVCache hit rates and inference performance: Baseline (only GPU memory): Low cache hit rate, average TTFT 5.91s, P90 12.16s, system throughput limited, InputToken throughput only 6576.85 token/s L2DRAMHiCache: Hit rate increased to 40.62%, average TTFT reduced to 3.77s (↓36.2%), P90 reduced to 10.88s, InputToken throughput increased to 10054.21 token/s (↑52.89%) L3 Mooncake Cache: Hit rate further surged, average TTFT reduced to 2.58s (↓56.3%), P90 significantly improved to 6.97s (↓42.7%), InputToken throughput increased to 15022.80 token/s (↑49.41%) https://static001.geekbang.org/wechat/images/d3/d326af6e0cc7e598f2ccfad950cb842f.png Overall throughput indicators of service in multi-turn dialogue testing scenarios https://static001.geekbang.org/wechat/images/fe/feeceb94ef783e938e3c1b7d5469c3a5.png https://static001.geekbang.org/wechat/images/12/12ddbc578afc3fb17286a1646cac7686.png KVCache hit rate and corresponding TTFT indicators in multi-turn dialogue testing scenarios For testing details, please refer to Appendix 8.2 Achieving smooth upgrades of the Mooncake version through in-place upgrade capabilities Since the transfer-engine built into Mooncake must maintain strict version consistency with the transfer-engine in the SGLang Serving Backend (Prefill/Decode) to ensure compatibility of the KVCache transmission protocol, Mooncake needs to synchronize version updates during the inference engine upgrade. However, as a stateful caching service, the KVCache data in Mooncake typically resides only in memory. In the traditional Kubernetes rolling update process, when the old Pod is terminated, the cached data in its memory is immediately lost; and the new Pod must go through the process of rescheduling and recreation. This leads to all active inference sessions relying on that node's cache being forcibly interrupted, requiring a complete Prefill computation to be executed again—this process not only incurs significant computational overhead but can also trigger: Significant spikes in P99 first token latency (rising from seconds to tens of seconds); A cliff-like drop in system throughput due to a large number of requests queuing for Prefill; Severe fluctuations in user experience, undermining the service stability in the production environment. Solution: Mooncake local cache persistence + RBG in-place upgrade: Mooncake local cache persistence: In PR#1031 of the Mooncake community, Mooncake supports persisting KVCache metadata and hot data snapshots on node ShareMemory and local disks (or high-performance NVMe), ensuring that the cache state can be quickly restored after process restarts, avoiding Prefill recomputation caused by cache invalidation; RBG on-the-spot upgrade: By utilizing RBG's refined role control capabilities, the Mooncake role can be upgraded without rebuilding the Pod, instead replacing the container image in place and reusing the node's local disk or shared memory, thereby retaining the already persisted cache data and achieving "seamless" version switching. The combination of the two allows the KVCache state to be maintained during the joint upgrade process of Serving Backend and Mooncake, ensuring that active sessions do not need to revert to the Prefill stage, effectively avoiding latency spikes and throughput declines, and ensuring end-to-end stability and high availability of large model inference services during version iterations. https://static001.geekbang.org/wechat/images/97/976b5a0c8353e77676ac18d45ed5dedd.png We updated the engine version of the newly deployed service from version v0.5.5 to v0.5.6. By checking the Pod status, it can be observed that after the Mooncake Store role image version update, only one container restart occurred. The reason for the restart can be confirmed by checking the Pod events: Confirming the status of the restarted Mooncake instance reveals that the network and topology information of the Pod did not change after the on-the-spot upgrade. Coupled with the cache persistence capability provided by Mooncake, it can be ensured that the KVCache cache before the restart was not lost, and the expected recovery was completed after the on-the-spot upgrade. Summary and Outlook This article systematically elaborates on how to build a production-grade stable high-performance PD separation inference service through the collaborative design of RoleBasedGroup (RBG) and Mooncake. The conclusions are as follows: RBG redefines the orchestration paradigm of LLM inference services: By treating multi-role collaboration (PD separation, Mooncake cache) and topology-aware scheduling as first-class citizens, RBG not only addresses the complexity of distributed deployment but also tackles the industry challenge of "smooth evolution of stateful cache services" through its on-the-spot upgrade capability, achieving the production-grade goals of seamless upgrades and stable services. Mooncake unlocks infinite possibilities for KVCache: As an L3 cache layer, Mooncake enhances cache hit rates through distributed memory pools and RDMA acceleration, reducing TTFT by 56.3%, improving P90 latency by 42.7%, while increasing the average GPU utilization from less than 30% to a level of sustainable elastic scaling, truly balancing performance and cost The hierarchical caching architecture is the only way for long-context reasoning: From GPU HBM → DRAM → Mooncake's three-level caching system, its effectiveness has been proven in benchmarks, especially in machine-driven scenarios such as multi-turn dialogue, RAG, and AI Agents, where the marginal cost reduction effect brought by cache reuse will become increasingly significant. The practice of RBG + Mooncake shows that only by deeply integrating high-performance system design with cloud-native operational capabilities can large model inference truly transition from "usable" to "user-friendly," from "laboratory" to "production-level." We look forward to advancing this paradigm together with the community to lay the foundation for the next generation of AI infrastructure. Acknowledgment Xiaohongshu: Sun Weixiang, Song Yang, Xiong Feng IFLYTEK: Yang Yanbo Qujing Technology: Yang Ke Mooncake: Ma Teng, Cai Shangming Alibaba Cloud: Yi Zhai, Bai Cun, Dong Chuan Appendix 8.1 Image Building In the deployment example used in this article, we can directly use the official container image from the SGLang community lmsysorg/sglang:v0.5.5 (mooncake-transfer-engine >= 0.3.7), which already includes the relevant dependencies for Mooncake by default. If there are customization needs, you can refer to the Dockerfile provided in the link to build a specific version of the Mooncake image yourself: 8.2 Benchmark Testing 8.2.1 Environment Configuration https://static001.geekbang.org/wechat/images/bb/bb8c59398986f4a562839168ae59b7b4.png 8.2.2 Testing Method Simulate multi-turn dialogue scenarios using the multi-turn dialogue pressure testing tool provided by HiCache, testing the inference service with L3 Mooncake + L2 Hicache enabled in the KVCache reusable scenario, compared to the inference service with only L2 Hicache enabled and without Hicache, in terms of throughput metrics and SLO metrics. Test Objects https://static001.geekbang.org/wechat/images/a3/a313fcabaff6d24845051ea8aa0a8815.png Test command Group records: https://static001.geekbang.org/wechat/images/5c/5cc8fb0f81df0999cda99e931b3d07f9.png