Google is pushing its AI chip strategy into a new phase.
At the Google Cloud Next 2026 conference held in Las Vegas on Wednesday, Google Cloud unveiled two new models of its eighth-generation Tensor Processing Unit (TPU): the TPU 8t, designed specifically for training, and the TPU 8i, optimized for inference. This marks Google's first split of training and inference tasks onto separate chips, signaling a major shift in its AI hardware roadmap.
Both chips are scheduled for official release later in 2026. Compared to the seventh-generation Ironwood TPU released last November, the TPU 8t delivers 2.8 times the performance at the same price point, while the TPU 8i offers an 80% performance increase. Both chips have improved performance-per-watt by more than double compared to the previous generation: the TPU 8t by 124% and the TPU 8i by 117%.
<img src="https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/6e25a111-5ecc-426a-9e42-99aa03285687.png?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg" alt="" original-src="https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/6e25a111-5ecc-426a-9e42-99aa03285687.png"/>
Amin Vahdat, Senior Vice President and Chief Technology Officer of AI and Infrastructure at Google, stated that with the rise of AI agents, &#34;the industry will benefit from chips specifically optimized for the distinct needs of training and inference.&#34; Alphabet CEO Sundar Pichai also noted in a blog post that this architecture aims to &#34;provide the massive throughput and low latency required to run millions of agents simultaneously in a cost-effective manner.&#34;
<h2>Why Split Into Two Chips?</h2>
Splitting the eighth-generation TPU into two models is a direct response to the increasingly divergent trends in AI workloads. Pre-training, post-training, and real-time inference have significantly differentiated in their computational characteristics: training tasks pursue extreme throughput and scale expansion, while inference tasks are more sensitive to latency and concurrency. A single chip struggles to optimize efficiency for both scenarios simultaneously.
According to Google's technical blog, the design philosophy of the eighth-generation TPU revolves around three pillars: scalability, reliability, and efficiency. While both chips share the core DNA of Google's AI software stack, each has been specially optimized for different bottlenecks.
Both chips integrate Axion CPUs based on the Arm architecture to eliminate host-side bottlenecks caused by data preprocessing latency, ensuring that TPU compute units remain fully utilized.
<h2>TPU 8t: The Compute Engine for Massive-Scale Training</h2>
Positioned as a dedicated accelerator for pre-training and embedding-intensive workloads, the TPU 8t can &#34;compress state-of-the-art model development cycles from months to weeks,&#34; according to Google.
In terms of scale, up to 9,600 TPU 8t chips can be combined into a single superpod, and distributed training can be extended across a single cluster of over 1 million TPU chips using the JAX and Pathways frameworks.
At the chip level, the TPU 8t introduces three key technological innovations.
First is the SparseCore accelerator, which handles irregular memory access patterns in embedding lookups, offloading data-dependent global aggregation operations from Matrix Multiplication Units (MXUs) to avoid the zero-operation bottlenecks common in general-purpose chips.
Second is native FP4 support, which doubles MXU throughput via 4-bit floating-point numbers while reducing the energy consumption of data movement, allowing larger model layers to reside in local hardware buffers.
Third is a more balanced Vector Processing Unit (VPU) expansion design, enabling better pipeline overlap between vector operations like quantization and softmax with matrix multiplication, thereby improving overall chip utilization.
<img src="https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/f83719c6-b03b-440a-98a2-863d8a38e49d.png?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg" alt="" original-src="https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/f83719c6-b03b-440a-98a2-863d8a38e49d.png"/>
At the network level, Google introduced a new Virgo network architecture for the TPU 8t, utilizing high-radix switches and a flat, two-layer non-blocking topology. This increases Data Center Network (DCN) bandwidth by up to four times and Inter-Chip Interface (ICI) bandwidth by two times compared to the previous generation. A single Virgo network can connect over 134,000 TPU 8t chips, providing up to 47 petabits per second of non-blocking bidirectional bandwidth, with total compute power exceeding 1.6 million ExaFlops.
<img src="https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/b2ba2516-b44e-480c-996e-398fa340fe8c.png?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg" alt="" original-src="https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/b2ba2516-b44e-480c-996e-398fa340fe8c.png"/>
Regarding storage, the TPU 8t introduces TPUDirect RDMA and TPUDirect Storage technologies. These allow data to bypass the host CPU and transfer directly between the TPU's high-bandwidth memory (HBM), network cards, and high-speed storage. Storage access speeds are 10 times faster than the seventh-generation Ironwood TPU, ensuring that MXUs remain fully utilized when processing large-scale multimodal datasets.
<img src="https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/ca1660d5-8f5b-4286-a4a6-484ba28bc992.png?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg" alt="" original-src="https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/ca1660d5-8f5b-4286-a4a6-484ba28bc992.png"/>
<h2>TPU 8i: The Low-Latency Expert for High-Concurrency Inference</h2>
Designed for post-training stages and high-concurrency inference scenarios, the TPU 8i places its architectural focus on reducing latency and enhancing concurrency per chip.
On-chip memory is the most significant hardware feature of the TPU 8i. Each chip integrates 384MB of Static Random Access Memory (SRAM)—three times that of the previous Ironwood generation—allowing larger KV Caches to remain entirely on-chip. This significantly reduces idle waiting time for cores during long-context decoding, a critical factor for AI tasks requiring multi-step reasoning.
<img src="https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/8336dae0-1480-47dd-9974-a01369a8beb4.png?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg" alt="" original-src="https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/8336dae0-1480-47dd-9974-a01369a8beb4.png"/>
The TPU 8i also introduces a Collective Acceleration Engine (CAE), specifically accelerating reduction and synchronization steps in autoregressive decoding and &#34;Chain-of-Thought&#34; processing. Each TPU 8i chip contains two Tensor Cores (TCs) and one CAE chiplet, replacing the four SparseCores found in the previous Ironwood generation. On-chip collective operation latency is reduced by five times, directly boosting the throughput required to run millions of agents simultaneously.
Regarding network topology, the TPU 8i abandons the 3D torus structure used by the TPU 8t in favor of a new Boardfly interconnect topology. In a 1,024-chip configuration, the 3D torus requires up to 16 hops between any two chips; the Boardfly topology compresses the maximum hop count to 7 through high-radix design, reducing the network diameter by 56% and improving all-to-all communication latency by up to 50%. This is particularly beneficial for frequent cross-chip token routing in Mixture of Experts (MoE) and inference models. The Boardfly adopts a hierarchical structure, scaling from four-chip building blocks up to complete Pods of up to 1,152 chips, with inter-group connectivity achieved via Optical Circuit Switches (OCS).
<img src="https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/9605d494-8b53-4096-afbc-e668cfeb3c36.png?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg" alt="" original-src="https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/9605d494-8b53-4096-afbc-e668cfeb3c36.png"/>
<h2>Software Ecosystem and Market Significance</h2>
Google emphasizes that realizing hardware performance depends on the synergy of a supporting software stack.
The eighth-generation TPU continues the software framework established by the seventh-generation Ironwood, supporting mainstream frameworks such as JAX, PyTorch, Keras, and vLLM, and offering the Pallas custom kernel language to fully unlock the hardware potential of SparseCore and CAE.
Google also announced that native PyTorch support for TPU has now entered the preview stage, allowing users to migrate existing PyTorch models to TPU for execution without modifying code.
From a market perspective, Google's dual-chip strategy directly addresses the cost pressures of AI infrastructure. Train

Google Cloud Next opened with the theme of "Agent Cloud," and in a single day, it unveiled the 8th generation TPU chip, the Gemini enterprise-grade Agent platform, cross-cloud data lakes, and new AI security solutions, representing a comprehensive upgrade from underlying computing power to upper-layer applications. Which layer of this release are you most interested in? Feel free to share your thoughts in the comments below!

Google Cloud Next opened with the theme of "Agent Cloud," and in a single day, it unveiled the 8th generation TPU chip, the Gemini enterprise-grade Agent platform, cross-cloud data lakes, and new AI security solutions, representing a comprehensive upgrade from underlying computing power to upper-layer applications. Which layer of this release are you most interested in? Feel free to share your thoughts in the comments below!

Google Cloud Next kicks off: Has the Agent era truly arrived?

GOOGL

SOXL

GOOW

GGLL

SOXX

CLOU

Google answers a single question—efficiency—with two chips. The TPU 8t enhances massive-scale training and throughput efficiency by leveraging SparseCore, FP4, and a new network architecture to significantly boost computational scalability; the TPU 8i focuses on low-latency inference, improving concurrency and decoding efficiency through ultra-large SRAM and CAE. Both share a unified software stack and are deeply integrated into cloud AI infrastructure, directly addressing the divergence of AI workloads and the trend toward optimizing compute costs

- Google Cloud has introduced its eighth-generation Tensor Processing Units (TPUs), including TPU 8t for training and TPU 8i for inference, splitting these tasks into separate chips.  
- The TPU 8t offers significant performance improvements over its predecessor, with a 2.8 times increase at the same price, while TPU 8i shows an 80% enhancement.  
- This strategic shift addresses the growing differentiation in AI workloads and allows Google to optimize hardware for specific tasks, ultimately improving cost-effectiveness for cloud customers.

The Moment of 'Division of Labor' for AI Chips! Why Does Google's Eighth-Generation TPU Come in Two Models?