--- title: "How Singapore's travel giant Grab is reshaping the Kafka streaming platform with AutoMQ" type: "News" locale: "zh-CN" url: "https://longbridge.com/zh-CN/news/266690870.md" description: "How Singapore's travel giant Grab is reshaping the Kafka streaming platform with AutoMQ" datetime: "2025-11-20T00:32:49.000Z" locales: - [zh-CN](https://longbridge.com/zh-CN/news/266690870.md) - [en](https://longbridge.com/en/news/266690870.md) - [zh-HK](https://longbridge.com/zh-HK/news/266690870.md) --- > 支持的语言: [English](https://longbridge.com/en/news/266690870.md) | [繁體中文](https://longbridge.com/zh-HK/news/266690870.md) # How Singapore's travel giant Grab is reshaping the Kafka streaming platform with AutoMQ Coban is the real-time data streaming platform team of Grab, dedicated to building an ecosystem around Kafka to serve various business areas of Grab. The platform acts as the entry point to Grab's data lake, collecting data from different services for storage and subsequent analysis. It supports real-time processing and analysis of events, which is crucial for many applications and services. The platform can handle several TB of data streams per hour, with high throughput, low latency, and high availability. https://static001.geekbang.org/infoq/8f/8f217d60d2c921b82733be57918a0c61.webp Figure 1: Grab's data streaming platform In addition to stability and performance, cost efficiency is also a key focus for the team. This article will introduce how the Coban team has improved the efficiency of Grab's data streaming platform and effectively reduced costs by introducing AutoMQ. ## Pain Points Statement In the past, the main challenges encountered on the data streaming platform included the following four aspects: - Difficulty in scaling computing resources: One of the main challenges is the issue of scaling computing resources, especially during partition migration, which can easily lead to a surge in resource usage, affecting operational flexibility. - Disks cannot be independently scaled, leading to increased operational complexity: The disk usage of various proxy nodes varies significantly, and when increasing storage space, it requires either cluster expansion or adding disks to proxy nodes, but neither option is ideal. - Over-provisioning based on peak demand, leading to resource waste: The current resource configuration is based on peak demand, which results in suboptimal utilization of cloud resources during non-peak periods, thereby increasing costs and reducing efficiency. - High-risk partition rebalancing: During cluster maintenance, partition rebalancing often leads to prolonged latency increases, which in turn affects the overall performance of the system and user experience. In the face of these challenges, a solution that can effectively address the above issues is needed. Based on this, the team proposed the following requirements and chose AutoMQ: - Good elasticity: The ability to dynamically adjust computing resources is desired, to meet the demands during peak periods and adapt to changes during troughs without causing system interruptions. - Separation of storage and computation: The ability to independently scale storage is needed to efficiently respond to the elastic demands of the business and continuous growth. - High compatibility with Kafka: Seamless integration with Grab's existing data streaming platform is crucial to avoid large-scale system overhauls and interruptions. - Fast and stable partition migration capability: The ability to quickly reallocate large partitions during traffic surges is essential for ensuring system performance and reliability. - Low latency: Supporting existing latency-sensitive Kafka use cases is a priority to ensure a smooth user experience https://static001.geekbang.org/infoq/e3/e356763f4edceadd1b6b8c4a0ff23bce.webp Figure 2: Wishlist for the New Data Streaming Platform ## Solution To address the challenges mentioned earlier and meet business needs, the team introduced **AutoMQ**, a cloud-native Kafka solution with high elasticity and outstanding performance. https://static001.geekbang.org/infoq/24/24dfdfcd63896c63569eeb33d9d0116f.webp Figure 3: New Data Streaming Architecture Using AutoMQ Figure 3 shows the new architecture of the data streaming platform after the introduction of AutoMQ. Since AutoMQ is 100% compatible with Apache Kafka®, it allows for a smooth transition from the existing architecture to the new AutoMQ-based architecture. AutoMQ employs a shared storage architecture based on EBS WAL and S3. By using fixed-size EBS as WAL, the system can provide extremely high performance and very low latency write capabilities without incurring additional costs. Meanwhile, all written data is stored in S3 Bucket, fully leveraging the high reliability, elasticity, and cost advantages brought by S3. ## Why Choose AutoMQ? **Clusters Can Scale Quickly and Efficiently** In the previous architecture, Kafka used a replication mechanism; however, the computational elasticity under this approach was not ideal. When data migration occurs between nodes, data needs to be transferred between different Brokers, often leading to performance fluctuations and operational challenges. In AutoMQ, data is stored in a shared storage layer across Brokers. When the cluster needs to scale up or down, AutoMQ does not require partition data to be migrated between Brokers; it can complete partition reassignment in just a few seconds, achieving truly fast and smooth cluster expansion. **AutoMQ Uses On-Demand Scalable S3 Shared Storage** AutoMQ saves data through object storage (such as S3). S3 is an on-demand scalable storage service, and when a longer data retention period is needed, there is no longer a need to manually scale Brokers or local disks as in the past, significantly reducing operational costs and complexity **Rapid Partition Reallocation Capability** The process of reallocating large-scale partitions using AutoMQ is very quick, requiring only a small amount of metadata to be synchronized for the switch. This advantage stems from the cloud-native architecture design of AutoMQ. Unlike Apache Kafka® which relies on the ISR multi-replica mechanism to ensure data persistence, AutoMQ entrusts data persistence to cloud storage services. Since cloud storage itself has multi-replica and erasure coding mechanisms, it inherently provides high reliability and high availability, allowing AutoMQ to avoid introducing a multi-replica structure at the Broker layer. AutoMQ follows the **"Cloud-First"** design philosophy, shifting from a traditional hardware-dependent architecture to a cloud service-centric architecture, fully unleashing the elasticity and performance potential of the cloud. **Low Latency** Low latency is key to Grab's real-time data streaming platform enhancing user experience. Although object storage services (like S3) are not designed for low-latency writes, AutoMQ cleverly utilizes fixed-size (10GB) EBS block storage to achieve millisecond-level (single-digit milliseconds) write latency. It bypasses the write overhead of the file system using **Direct I/O** technology and relies on the cloud-native architecture to avoid network overhead caused by internal partition replicas, thus achieving extremely high write performance and stability. **100% Kafka Compatibility** AutoMQ reuses the computing layer code of Apache Kafka® and has passed all official test cases, achieving true full compatibility (100% Compatibility). This means that a smooth transition to AutoMQ can be made without adjusting the existing Kafka infrastructure or rewriting client code, significantly reducing the cost and risk of architectural migration. ## Evaluation and Production Environment Deployment To ensure that AutoMQ meets expectations, the team conducted a comprehensive evaluation from three dimensions: **performance, reliability, and cost-effectiveness**. First is **performance testing**. Multiple rounds of benchmark tests were conducted under different configurations, such as adjusting the replication factor and producer acknowledgment configuration, to assess whether its performance under various load scenarios meets the requirements. Through these tests, the aim is to gain insights into AutoMQ's performance characteristics and identify potential optimization areas or details to pay attention to during use. In terms of **reliability**, various test cases and benchmark testing scenarios were also designed to verify the system's performance under different types of failures. For example, simulating smooth failover during planned maintenance and emergency recovery under sudden infrastructure failures, ensuring that the system can maintain stable operation under various circumstances Finally, evaluate the **cost-effectiveness**. AutoMQ performed excellently in all tests, passing all benchmark tests and use case validations. Based on these results, the team is confident in its feasibility in a production environment and has decided to formally deploy AutoMQ in Grab's actual business scenarios. In the past, the team primarily used the open-source community's Kafka Operator Strimzi to manage and operate Kafka clusters on Kubernetes. To support integration with AutoMQ, the functionality of this Operator was extended to include the creation, mounting, and authorization of WAL Volumes, achieving seamless integration between AutoMQ and Strimzi. Additionally, the team systematically studied AutoMQ-related knowledge, such as S3 storage mechanisms and WAL-related metrics, to use and manage the AutoMQ cluster more efficiently in the production environment. ## Usage Effects After introducing AutoMQ, the data flow platform achieved significant improvements in several areas: - Throughput significantly increased: As data replication migrated from Broker-to-Broker to cloud storage replication, a threefold increase in single-core CPU throughput was observed. In terms of throughput, this cluster has now become one of the largest clusters in Grab's internal service matrix. - Cost-effectiveness improved: Preliminary statistics show that overall cost-effectiveness has increased threefold. - Partition reallocation efficiency improved: In the past, the time required for partition reallocation across the entire cluster could take up to 6 hours, whereas now it can be completed in less than 1 minute. https://static001.geekbang.org/infoq/17/17478d2ee0722a73c62000a753053d77.webp Figure 4: Performance when scaling Brokers under the previous setup https://static001.geekbang.org/infoq/77/7728c927ed013f10ff479568556ae134.webp Figure 5: Performance when scaling Brokers under the new AutoMQ setup Figures 4 and 5 show the differences in key performance indicators when executing Broker elastic scaling between the old and new architectures. The new architecture using AutoMQ not only allows for rapid scaling but also results in smaller performance fluctuations, making the overall cluster more stable After introducing the shared storage architecture of AutoMQ, the speed of partition reassignment has achieved remarkable improvements compared to the past architecture—each allocation now takes only a few seconds to complete. As a result, the stability of the Broker has also been enhanced, as there is no longer a need to copy data between Brokers during the partition migration process. This means that there will no longer be spikes in I/O and network utilization, and data does not need to move between Brokers, leading to a more stable cluster and eliminating performance spikes during operational maintenance. Due to the very fast speed of partition reassignment, the impact on clients has been significantly reduced, with almost no increase in latency for either producers or consumers during scaling or migration. Additionally, thanks to the shared storage architecture, storage resources can now be scaled independently. In the old architecture, if additional storage capacity was needed, new Broker nodes had to be added or the disk for a single Broker had to be expanded. This not only led to unnecessary waste of computing resources (the computing power of newly added nodes is often idle), but also triggered cluster rebalancing during scaling. This operation would cause increased latency for producer and consumer clients and a decrease in system stability. ## Future Outlook Since the introduction of AutoMQ, significant benefits have been gained in multiple areas. The team plans to further optimize overall efficiency. First, they hope to further **improve the utilization of computing resources**. AutoMQ has a built-in feature that has not yet been enabled—**Self-Balancing**. This feature is similar to a commonly used open-source tool for Apache Kafka® called Cruise Control. "Self-Balancing" will automatically trigger partition rebalancing periodically as needed, allowing the cluster's computing power to flexibly respond during peak and low periods, thus achieving more efficient resource scheduling. Secondly, they will continue to **optimize cost-effectiveness**. Since they can now tolerate more frequent interruptions and the cost of executing partition reassignment is no longer high or can be almost ignored, they can focus on auto scaling and spot instances to achieve cost savings. During business peak periods, the cluster can scale up to cope; while during non-peak or low-load periods, it can automatically scale down, further improving resource utilization efficiency and cost-effectiveness. The team is also exploring the use of AutoMQ's **S3 WAL** streaming storage engine to reduce cross-availability zone traffic between clients and Brokers. Additionally, AutoMQ offers a feature called **Table Topic**, which allows streaming data of Topics to be directly stored in S3 in Iceberg table format, fully utilizing the latest S3 Table feature released by AWS. The team plans to conduct in-depth research on this to reduce some unnecessary data pipeline redundancy after introducing Table Topic Finally, in light of AutoMQ's outstanding performance within Grab, the team plans to promote its application in more business scenarios and migrate more data stream use cases to AutoMQ to further unleash its potential ### 相关股票 - [Grab (GRAB.US)](https://longbridge.com/zh-CN/quote/GRAB.US.md) ## 相关资讯与研究 - [CGS International Reaffirms Their Buy Rating on Grab (GRAB)](https://longbridge.com/zh-CN/news/281555540.md) - [WeRide and Grab Officially Launch Singapore's First Autonomous Public Ride Service in Punggol | WRD Stock News](https://longbridge.com/zh-CN/news/281292162.md) - [Grab to provide delivery-partners with fuel vouchers amid rising fuel expenses](https://longbridge.com/zh-CN/news/281501120.md) - [Grab And Delivery Hero Talks Put Undervalued Super App Back In Focus](https://longbridge.com/zh-CN/news/280512029.md) - [Is It Time To Reconsider Grab Holdings (GRAB) After Recent Share Price Weakness](https://longbridge.com/zh-CN/news/281502307.md)