SemiAnalysis in-depth interpretation of TPU - Google's impact on the "NVIDIA Empire"

Wallstreetcn
2025.11.29 05:02
portai
I'm PortAI, I can summarize articles.

For NVIDIA, what was once its largest customer has now become its most understanding competitor. When OpenAI can leverage "threatening to purchase TPU" for a 30% discount, when Anthropic can train models that surpass GPT-4 using TPU, and when Google is willing to open its software ecosystem and provide financial leverage, NVIDIA's myth of a 75% gross margin is no longer unbreakable

The AI chip market in 2025 is at a delicate turning point.

On one hand, NVIDIA still maintains absolute leadership in technology and market share with Blackwell; but on the other hand, the full commercialization of Google's TPU is causing NVIDIA's seemingly unbreakable pricing power to loosen.

According to estimates from semiconductor industry research firm SemiAnalysis, OpenAI has forced substantial concessions from NVIDIA's ecosystem simply by using the "threat to purchase TPU" as leverage, resulting in a reduction of approximately 30% in the total cost of ownership (TCO) of its computing clusters.

With the details of Anthropic's procurement of up to 1GW of TPU being revealed, Google has officially torn off the mask of a "cloud service provider" and transformed into a "commercial chip supplier" that directly sells high-performance chips and systems to external customers.

When OpenAI can leverage the "threat to purchase TPU" for a 30% discount, when Anthropic can train models surpassing GPT-4 using TPU, and when Google is willing to open its software ecosystem and provide financial leverage, NVIDIA's myth of a 75% gross margin is no longer unbreakable.

For NVIDIA, that once largest customer has now become its most understanding competitor.

(Chart: Cost per million input and output tokens)

Google's "Proactive Attack"

For a long time, Google's TPU has been like its search algorithm, a hidden internal weapon. However, supply chain intelligence obtained by SemiAnalysis indicates that this strategy has undergone a fundamental reversal.

The most direct case comes from Anthropic. As a large model company that can rival OpenAI on cutting-edge models, Anthropic has confirmed it will deploy over 1 million TPUs. The structure of this deal is highly disruptive, revealing Google's new model of "mixed sales":

Of the 1 million chips, the first batch of approximately 400,000 of the latest TPUv7 "Ironwood" will no longer be leased through the cloud but will be sold directly by Broadcom to Anthropic, valued at about $10 billion. As a long-term joint design partner of TPU, Broadcom has stepped into the spotlight in this deal, becoming the invisible winner of this computing power transfer.

The remaining 600,000 TPUv7 will be leased through Google Cloud. It is estimated that this portion of the deal involves up to $42 billion in remaining performance obligations (RPO), directly supporting the recent surge in backlogged orders for Google Cloud.

The signal of this move is very clear: Google is no longer stingy about selling its most advanced computing power externally. In addition to Anthropic, top AI labs such as Meta, SSI, and xAI have also appeared on the potential customer list.

In the face of this sudden offensive, NVIDIA has rarely shown a defensive posture, and its financial team recently had to publish a lengthy defense in response to questions about the "circular economy" (i.e., investing in startups to purchase its own chips) This sensitive reaction to market sentiment precisely indicates that Google's offensive has already touched a nerve with NVIDIA.

Cost is the Hard Truth

The reason for customers switching sides is very straightforward: in the AI arms race, performance is the entry ticket, but TCO (Total Cost of Ownership) determines life and death.

Models from SemiAnalysis show that Google's TPUv7 has a crushing advantage over NVIDIA in cost efficiency.

From Google's internal perspective, the TCO of TPUv7 servers is about 44% lower than that of NVIDIA's GB200 servers. Even when adding the profits of Google and Broadcom, the TCO for Anthropic using TPU through GCP is still about 30% lower than purchasing GB200.

This cost advantage is not achieved merely by lowering chip prices, but stems from Google's unique financial engineering innovation—“super cloud vendor backing.”

In AI infrastructure construction, there exists a significant maturity mismatch: the economic lifespan of GPU clusters is only 4-5 years, while data center lease contracts typically last over 15 years. This mismatch makes it difficult for emerging computing service providers like Fluidstack and TeraWulf to obtain financing.

Google has solved this problem through a form of "off-balance-sheet" credit support (IOU): Google promises that if intermediaries cannot pay rent, Google will step in to back them up.

This financial tool directly connects cryptocurrency miners (who have power and space) with AI computing demand, creating a low-cost infrastructure ecosystem independent of NVIDIA's system.

Not Just Chips, But Systems

If the price war is a tactical confrontation, then system engineering is Google's strategic moat.

Previously, there was a prevailing view in the industry that "systems are more important than microarchitecture." Now, this assertion has been validated with TPUv7. Although a single TPUv7 slightly lags behind NVIDIA's Blackwell in theoretical peak computing power (FLOPs), Google has narrowed the gap through extreme system design.

Currently, TPUv7 "Ironwood" has significantly reduced the gap in memory bandwidth and capacity compared to NVIDIA's flagship chips. More importantly, it adopts a more pragmatic design philosophy—not pursuing unsustainable peak frequencies, but enhancing actual output through higher Model Floating-point Utilization (MFU)

Google's real killer feature is its unparalleled optical interconnect (ICI) technology. Unlike NVIDIA, which relies on expensive NVLink and InfiniBand/Ethernet switches, Google has built an interconnect network called ICI using its self-developed optical circuit switch (OCS) and 3D Torus topology.

This architecture allows a single TPUv7 cluster (Pod) to scale up to an astonishing 9,216 chips, far exceeding the common 64 or 72 card clusters from NVIDIA. OCS allows for dynamic reconfiguration of the topology through software-defined networking.

This means that if a certain part of the chip fails, the network can bypass the fault point in milliseconds and re-slice into a complete 3D torus, greatly enhancing the availability of the cluster. Additionally, optical signals in OCS do not require optoelectronic conversion, reflecting directly and significantly reducing power consumption and latency.

Both Gemini 3 and Claude 4.5 Opus, the two strongest global models, were fully pre-trained on TPU, which itself is the ultimate endorsement of the TPU system's capability to handle the highly challenging task of "cutting-edge model pre-training."

Removing the Last Barrier: Changes in Software Ecosystem

For a long time, the biggest obstacle preventing external customers from adopting TPU was software—Google stubbornly adhered to the JAX language while global AI developers were using PyTorch and CUDA.

However, in the face of enormous commercial interests, Google has finally put aside its arrogance.

A report from SemiAnalysis pointed out that the KPIs of Google's software team have undergone significant adjustments, shifting from "serving internally" to "embracing open source."

Previously, Google's "super captain" Robert Hundt had clearly announced that they would fully support PyTorch Native running on TPU.

Google no longer relies on the inefficient Lazy Tensor conversion but directly interfaces with PyTorch's Eager Execution mode through the XLA compiler. This means that clients like Meta, who are accustomed to using PyTorch, can almost seamlessly migrate their code to TPU At the same time, Google has begun to contribute a large amount of code to open-source inference frameworks such as vLLM and SGLang, bridging the gap for TPU in the open-source inference ecosystem.

This shift means that NVIDIA's strongest "CUDA moat" is being filled by Google with "compatibility."

And this battle for the "Silicon Valley throne" has just begun.

Full Translation

The following is the full translation of SemiAnalysis's report (translated by AI):

TPUv7: Google strikes a punch at the king

The end of the CUDA moat? Anthropic signs a 1GW+ TPU procurement deal; the more TPUs purchased by Meta/SSI/xAI/OAI/Anthro, the more GPU capital expenditures (Capex) can be saved; the next-generation TPUv8AX and TPUv8X will face off against Vera Rubin.

The two leading models in the world today—Anthropic's Claude 4.5 Opus and Google's Gemini 3—run most of their training and inference infrastructure on Google's TPU and Amazon's Trainium. Now, Google is breaking the norm by directly selling physical TPU hardware to multiple companies. Is this the prologue to the end of Nvidia's dominance?

The dawn of the AI era has arrived, and it is crucial to understand that the cost structure of AI-driven software is fundamentally different from that of traditional software. Chip microarchitecture and system architecture play a decisive role in the development and scaling of these innovative software. Compared to the early software era, where developer costs were relatively high, the hardware infrastructure for running AI software has a significantly greater impact on capital expenditures (Capex) and operating expenses (Opex)—and thus on gross margins. Therefore, it has become unprecedentedly critical to invest significant effort in optimizing AI infrastructure to deploy AI software. Companies with advantages in infrastructure will undoubtedly occupy a high ground in their ability to deploy and scale AI applications.

As early as 2006, Google had promoted the idea of building AI-specific infrastructure, but this issue reached a boiling point in 2013. They realized that to deploy AI at any scale, they needed to double the number of existing data centers. Thus, they began laying the groundwork for TPU chips and went into production in 2016. Interestingly, Amazon also recognized the need to build custom chips in the same year. In 2013, Amazon launched the Nitro project, focusing on developing chips to optimize general CPU computing and storage. Two distinctly different companies optimized their infrastructure paths for different computing eras and software paradigms We have long believed that TPU is one of the best systems in the world for AI training and inference, on par with the "king of the jungle" Nvidia. Two and a half years ago, we wrote an article about "TPU hegemony," a point that has been proven very correct over time.

The achievements of TPU are self-evident: Gemini 3 is one of the best models in the world and is entirely trained on TPU. In this report, we will delve into Google's significant strategic shift—appropriately commercializing TPU for external customers, making it Nvidia's latest and most threatening Merchant Silicon challenger.

The plan for this report:

  • (Re) tell our clients and new readers that the commercial success of external TPU customers is rapidly growing, starting with Anthropic and extending to Meta, SSI, xAI, and even potential OpenAI...
  • Demonstrate the core logic: The more TPU you purchase, the more you save on Nvidia GPU capital expenditures! OpenAI has already gained about a 30% discount on computing cluster costs through competitive threats, even before deploying TPU, thus improving performance per TCO (Total Cost of Ownership).
  • Explain the "circular economy" deal of AI infrastructure.
  • Revisit our original in-depth analysis of TPU, providing a comprehensive update on the TPU hardware stack from chip to software layer.
  • Cover positive progress in the open software ecosystem and the missing key elements that Google needs to make the TPU ecosystem a viable challenger to the CUDA moat: open-sourcing their XLA:TPU compiler, runtime, and multi-Pod "MegaScaler" code.
  • In paid content, we will discuss the implications for Nvidia's moat and compare Vera Rubin with the next-generation TPUv8AX/8X (also known as Sunfish/Zebrafish).
  • Also cover the long-term threat to Nvidia.

First, let's talk about the impact of this news on the ecosystem. The performance of TPU has clearly caught the attention of competitors. Sam Altman acknowledged that OpenAI is facing "rough vibes" due to Gemini stealing the spotlight from OpenAI. Nvidia even released a reassuring PR statement urging everyone to stay calm and carry on—claiming that it is still far ahead of its competitors.

We understand the reasons behind this. The past few months have been a series of victories for Google DeepMind, GCP (Google Cloud Platform), and the TPU ecosystem. The significant ramp-up in TPU production, the expansion of Anthropic's TPU capacity exceeding 1GW, the training of state-of-the-art (SOTA) models Gemini 3 and Opus 4.5 on TPUs, and the now-expanding list of target customers (Meta, SSI, xAI, OAI) queuing for TPUs have driven a massive revaluation of the value of Google and the TPU supply chain, at the expense of Nvidia's GPU supply chain. While the "sudden" rise of Google and the TPU supply chain has surprised many, institutional product subscribers of SemiAnalysis had anticipated this over the past year.

(Chart: Comparison of TPU, Trainium, and Nvidia risk exposure infrastructure baskets)

Another reason Nvidia is on the defensive is that an increasing number of skeptics believe the company is sustaining a "circular economy" by funding cash-burning AI startups, essentially transferring money from one pocket to another through additional steps. We believe this perspective is biased, but it has clearly struck a nerve within Nvidia. The finance team released a detailed response, which is reproduced below.

Circular financing is an unsustainable business practice

Allegation: NVIDIA is involved in a $61 billion circular financing scheme, where NVIDIA invests in AI startups, the startups commit to cloud spending, cloud service providers (CSPs) and startups purchase NVIDIA hardware, NVIDIA recognizes revenue, but cash never completes the cycle because the underlying economic activity—profitable AI applications—remains insufficient.

Response: First, NVIDIA's strategic investments account for only a small portion of NVIDIA's revenue, and an even smaller share of the approximately $1 trillion raised annually in the global private capital markets. In the third quarter and year-to-date, NVIDIA's investments in private companies were $3.7 billion and $4.7 billion, respectively, accounting for 7% and 3% of revenue. Companies in NVIDIA's strategic investment portfolio primarily raise funds from third-party financing providers rather than from NVIDIA.

Second, NVIDIA is fully transparent about its strategic investments, which are reported on the balance sheet as long-term assets and securities, reported on the income statement as other income and expenses (OI&E), and reported on the cash flow statement as cash flows from investing activities

Third, the companies in NVIDIA's strategic investment portfolio are rapidly increasing their revenues, indicating a strong potential customer demand for AI applications and their path to profitability. The companies in NVIDIA's strategic investment portfolio primarily generate revenue from third-party customers rather than from NVIDIA.

We believe a more realistic explanation is that NVIDIA aims to protect its dominance in Foundation Labs by offering equity investments instead of price cuts, as price reductions would lower gross margins and trigger widespread investor panic. Below, we outline the arrangements of OpenAI and Anthropic to demonstrate how frontier labs can reduce GPU TCO by purchasing or threatening to purchase TPUs.

(Table: The more TPUs you buy, the more GPU costs you save) Source: SemiAnalysis TCO model, Anthropic and OpenAI

OpenAI has not even deployed TPUs yet, but they have already saved about 30% across the entire NVIDIA fleet in the lab. This proves that the performance advantage of TPUs per TCO is so strong that you can gain the benefits of adopting TPUs even before turning on a TPU.

Our accelerator industry model, data center industry model, and core research subscribers saw the industry impact long before this news was announced and became market consensus. In early August, we shared with our accelerator model clients that we observed a massive upward adjustment in Broadcom/Google TPU orders in the supply chain for 2026. We also revealed that the reason for this increase in orders is that Google will begin selling systems externally to multiple customers. In early September, we disclosed that one of the large external customers would be Anthropic, with a demand for at least 1 million TPUs. This was formally confirmed by Anthropic and Google in October. We also noted on November 7 that Meta is a large TPU customer, ahead of others by several weeks. Additionally, we discussed other customers.

As a result, our institutional clients had fully anticipated the largest Performance Dispersion in AI trading to date. SemiAnalysis was the first company to disclose all these insights, as no other research firm could connect the dots from the foundry to the supply chain, and then through the data center to the lab Back to the point.

Google's Large-Scale TPU Externalization Push and Anthropic Deal

The TPU stack has long been comparable to Nvidia's AI hardware, but it primarily supports Google's internal workloads. In keeping with Google's usual style, even after offering TPUs to GCP customers in 2018, it never fully commercialized them. This situation is beginning to change. In recent months, Google has mobilized the power of the entire stack to bring TPUs to external customers through GCP or to sell complete TPU systems as a commercial vendor. The search giant is leveraging its strong internal chip design capabilities to become a truly differentiated cloud provider. Furthermore, this aligns with Marquis Customer Anthropic's ongoing strategy to reduce dependence on NVDA.

(Chart: Anthropic FLOP Portfolio)

The deal with Anthropic marks an important milestone in this push. We understand that GCP CEO Thomas Kurian played a central role in the negotiations. Google committed early on to actively invest in Anthropic's funding rounds, even agreeing to forgo voting rights and set an ownership cap at 15% to extend TPU usage beyond Google itself. The presence of former DeepMind TPU talent in foundational labs facilitated the implementation of this strategy, leading to Anthropic training Sonnet and Opus 4.5 on various hardware, including TPUs. Google has established a substantial facility for Anthropic, as shown below, which is part of our "Building-by-Building Tracking AI Labs" project.

(Image: Data Center Industry Model)

In addition to renting Google data center capacity through GCP, Anthropic will also deploy TPUs in its own facilities, allowing Google to compete directly with Nvidia as a true commercial hardware vendor.

Regarding the split of 1 million TPUs:

  1. The first phase of the deal covers 400,000 TPUv7 Ironwood, valued at approximately $10 billion in finished racks, which Broadcom will directly sell to Anthropic. Anthropic is the fourth customer mentioned in Broadcom's most recent earnings call. Fluidstack, a gold-tier ClusterMax Neocloud provider, will handle on-site setup, cabling, burn-in testing, acceptance testing, and remote assistance work Because Anthropic has outsourced the management of physical servers. The data center infrastructure will be provided by TeraWulf (WULF) and Cipher Mining (CIFR).
  2. The remaining 600,000 TPUv7 units will be leased through GCP, and we estimate that the remaining performance obligation (RPO) for this transaction is $42 billion, which accounts for most of the $49 billion increase in backlog orders reported by GCP in the third quarter.
  3. We believe that additional transactions with Meta, OAI, SSI, and xAI in the coming quarters may provide GCP with additional RPO + direct hardware sales.

Despite the enormous internal and external demand, Google has failed to deploy TPUs at the pace it hoped. Although Google has more control over its hardware supply compared to other hyperscale vendors that still need to "please" Jensen Huang, Google's main bottleneck is power.

While other hyperscale vendors are expanding their sites and acquiring significant hosting capacity, Google's actions have been slower. We believe the core issue lies in contractual and administrative aspects. Each new data center vendor requires a Master Service Agreement (MSA), which involves billions of dollars and multi-year commitments, naturally involving some bureaucracy. However, Google's process is particularly slow, often taking up to three years from initial discussions to signing the MSA.

Google's workaround has significant implications for Neocloud providers and cryptocurrency miners seeking to transition to AI data center infrastructure. Google does not lease directly but provides a credit backstop, meaning that if Fluidstack cannot pay its data center rent, Google will step in to pay, which is an off-balance-sheet "IOU."

(Chart: Fluidstack Transaction Overview)

Neocloud providers like Fluidstack are flexible and agile, making it easier for them to deal with new data center vendors like "transformed cryptocurrency miners." This mechanism has been key to our optimism about the cryptocurrency mining industry—it's worth noting that we identified numerous companies, including IREN and Applied Digital, when their stock prices plummeted earlier this year.

The opportunity for miners lies in a simple dynamic: the data center industry faces severe power constraints, while cryptocurrency miners have already controlled capacity through their Power Purchase Agreements (PPAs) and existing power infrastructure. We expect more agreements to be reached in the coming weeks and quarters.

How Google is Reshaping the Neocloud Market

Before the transactions involving Google/Fluidstack/TeraWulf, we had never seen any deals in the Neocloud market achieved solely through "IOUs" off the balance sheet. After the transactions, we believe it has become the new de facto standard financing template. This addresses a key challenge for Neocloud in securing data center capacity and growing its business:

  • The useful and economic lifespan of GPU clusters is 4-5 years.
  • Large data center leases typically exceed 15 years, with a typical payback period of about 8 years.

This mismatch in timelines makes project financing very complex for Neocloud and data center providers. However, with the rise of "hyperscaler backstops," we believe the financing issue has been resolved. We expect a new wave of growth in the Neocloud industry. Check out our accelerator and data center models to understand the main beneficiaries. These are the ways and reasons behind the Anthropic deal, and now let's move on to the hardware section.

Additionally, with Jensen as an investor, Neocloud, along with CoreWeave, Nebius, Crusoe, Together, Lambda, Firmus, and Nscale, has a clear incentive not to adopt any competing technologies within their data centers: TPU, AMD GPUs, and even Arista switches are off-limits! This leaves a significant gap in the TPU hosting market, currently filled by crypto miners + Fluidstack. In the coming months, we expect to see more Neoclouds making tough decisions between pursuing the growing TPU hosting opportunities and securing allocations of the latest and greatest Nvidia Rubin systems.

TPUv7 Ironwood – Why do Anthropic and other clients want TPU?

The answer is simple. It is a powerful chip in an excellent system, and this combination provides Anthropic with compelling performance and TCO. Two and a half years ago, we wrote about Google's advantages in computing infrastructure. Even if the chips lag behind Nvidia on paper, Google's system-level engineering allows the TPU stack to compete with Nvidia in performance and cost efficiency.

At that time, we believed that "systems are more important than microarchitecture," and the events of the past two years have reinforced this view. Anthropic's large-scale TPU orders are a direct validation of the platform's technological strength. The GPU ecosystem has also made significant strides. Nvidia's GB200 represents a massive leap, pushing Nvidia to become a true systems company, designing complete servers rather than just internal chip packages.

When we talk about the significant innovations of GB200 in rack-level interconnects, one underrated point is that Google has been scaling up TPUs within and across racks since TPU v2 in 2017! At the end of the report, we will conduct an in-depth analysis of Google's ICI extended network, which is the only true competitor to Nvidia's NVLink.

Google's recent Gemini 3 model is now regarded as the state-of-the-art frontier LLM. Like all earlier versions of Gemini, it is fully trained on TPUs. This result provides concrete evidence of TPU capabilities and Google's broader infrastructure advantages.

Today's attention is often focused on inference and post-training hardware, but pre-training frontier models remain the most challenging and resource-intensive tasks in AI hardware. The TPU platform has decisively passed this test. This stands in stark contrast to competitors: leading researchers at OpenAI have not completed a successful full-scale pre-training run widely used for new frontier models since the GPT-4o in May 2024, highlighting the significant technical hurdles that Google's TPU fleet has successfully overcome.

A key highlight of the new model includes significant improvements in tool invocation and agent capabilities, particularly in long-term tasks with economic value. Vending Bench is an assessment designed to measure a model's ability to operate a business over the long term, and by placing them in the position of owners of a simulated vending machine business, Gemini 3 has outperformed its competitors.

(Chart: Vending-Bench funding changes over time)

This release not only brings enhanced capabilities but also new products. Antigravity, a product stemming from the pre-acquisition of Windsurf CEO Varun Mohan and his team, is Google's response to OpenAI Codex, officially bringing Gemini into the token consumption battle of "vibe coding."

For Google, quietly intervening and establishing a performance lead on one of the most challenging hardware issues is indeed an impressive feat for a company whose core business is not—or should we say, once was not—hardware.

Microarchitecture Still Matters: Ironwood Approaches Blackwell

The inference that "systems are more important than microarchitecture" suggests that while Google has been pushing the boundaries of system and network design, the TPU chips themselves have not been particularly groundbreaking. Since then, TPU chips have made significant progress in the latest generations.

From the beginning, Google's design philosophy has been more conservative compared to Nvidia's on chips. Historically, the peak theoretical FLOPs of TPUs have been significantly lower, and the memory specifications have also been below those of corresponding Nvidia GPUs.

There are three reasons for this. First, Google places a high internal emphasis on the "RAS" (Reliability, Availability, and Serviceability) of its infrastructure Google is willing to sacrifice absolute performance for higher hardware uptime. Running devices to their limits means a higher hardware mortality rate, which has a tangible impact on system downtime and the TCO of hot spares. After all, hardware that you cannot use has an infinite TCO relative to performance.

The second reason is that until 2023, Google's primary AI workloads are recommendation system models that power its core search and advertising assets. Compared to LLM workloads, the arithmetic intensity of RecSys workloads is much lower, meaning fewer FLOPs are required for each bit of data transmitted.

(Chart: Reco vs. LLM)

The third point boils down to the utility of the marketed "peak theoretical FLOPs" numbers and how they are manipulated. Commercial GPU providers like Nvidia and AMD want to market the best performance specifications for their chips. This incentivizes them to stretch the marketed FLOPs to as high a number as possible. In reality, these numbers are unsustainable. On the other hand, TPUs are primarily aimed at internal use, and the pressure to exaggerate these specifications externally is much lower. This has important implications that we will discuss further. The polite view is that Nvidia is better at DVFS (dynamic voltage frequency scaling) and therefore is happy to report only peak specifications.

After we entered the LLM era, Google's TPU design philosophy underwent a noticeable shift. We can see this change reflected in the latest two generations of TPUs designed after LLM: TPUv6 Trillium (Ghostlite) and TPUv7 Ironwood (Ghostfish). In the chart below, we can see that for TPUv4 and v5, the computational throughput is far below that of Nvidia's flagship products at the time. TPUv6 is very close to H100/H200 in FLOPs, but it was released 2 years later than H100. With the launch of TPUv7, the gap further narrows, with servers available just a few quarters later while offering nearly the same level of peak theoretical FLOPs.

(Chart: Comparison of TPU and Nvidia's TFLOPs and system availability (BF16 Dense))

What drives these performance improvements? Part of the reason is that Google began announcing TPUs when they went into production rather than after the next generation was deployed. Additionally, TPUv6 Trillium is manufactured using the same N5 node as TPUv5p, with a similar silicon area, but is able to deliver an astonishing 2x increase in peak theoretical FLOPs And power consumption has significantly decreased! For Trillium, Google has increased the size of each systolic array from 128 x 128 to 256 x 256 tiles, quadrupling it. This increase in array size has brought about an enhancement in computational capability.

(Table: Google TPU Chip Specifications)

Trillium is also the last "E" (lite) SKU, which means it is equipped with only 2 HBM3 sites. While Trillium has narrowed the computational gap with Hopper, it falls significantly short in memory capacity and bandwidth compared to H100/H200, with only 2 stacks of HBM3, while the latter has 5 and 6 stacks of HBM3 and HBM3E, respectively. This makes it painful for newcomers to use, but if you shard the model correctly and take advantage of all those cheap FLOPS, the performance TCO achieved by Trillium is unparalleled.

(Chart: TPU v6 (Trillium) vs H100 (SXM) Comparison)

TPU v7 Ironwood is the next iteration, where Google has nearly completely narrowed the gap in FLOPs, memory, and bandwidth with the corresponding Nvidia flagship GPU, although its full market launch is a year later than Blackwell. Compared to GB200, there is only a slight shortfall in FLOPs and memory bandwidth, with capacity equivalent to 8-Hi HBM3E, of course, this is significantly lower compared to GB300, which has 288GB of 12-Hi HBM3E.

(Chart: TPU v7 (Ironwood) vs GB200/GB300 Comparison)

Theoretical absolute performance is one thing, but what really matters is the real-world performance of total cost of ownership (TCO).

Although Google procures TPU through Broadcom and pays high profits, this is far lower than Nvidia's profits not only from selling GPUs but also from the entire system, including CPUs, switches, NICs, system memory, wiring, and connectors. From Google's perspective, this results in a fully bundled TCO per Ironwood chip for the entire 3D Torus configuration being about 44% lower than the TCO of GB200 servers This is enough to make up for the shortfall of about 10% in peak FLOPs and peak memory bandwidth. This is viewed from Google's perspective and the prices they pay for TPU servers.

(Table: Nvidia vs TPU SKU performance comparison per TCO)

So what about for external customers when Google adds their profit? We assume that under the scenario where Google rents TPU 7 to external customers for profit, the hourly TCO can still be about 30% lower than the cost of GB200 and about 41% lower than the cost of GB300. We believe this reflects the pricing of Anthropic through GCP.

(Chart: Hourly total cost comparison (USD/hr/GPU))

Why Anthropic Bets on TPU

Comparing theoretical FLOPs only tells part of the story. What's important is effective FLOPs, as peak numbers are rarely achieved in actual workloads.

In practice, once communication overhead, memory stalls, power limits, and other system effects are taken into account, Nvidia GPUs typically only achieve a small fraction of their theoretical peak. A rule of thumb for training is 30%, but utilization also varies by workload. A significant portion of the gap comes down to software and compiler efficiency. Nvidia's advantage in this area stems from the CUDA moat and a wide array of out-of-the-box open-source libraries that help workloads run efficiently, achieving high FLOPs and memory bandwidth utilization.

The TPU software stack is not as easy to use, although this is beginning to change. Internally at Google, TPUs benefit from excellent internal tools that are not available to external customers, resulting in weaker out-of-the-box performance. However, this only applies to small and/or lazy users, which Anthropic is neither.

Anthropic has strong engineering resources and former Google compiler experts who understand both the TPU stack and their own model architecture deeply. They can invest in custom kernels to drive high TPU efficiency. As a result, they can achieve significantly higher MFU and better performance price ratio per PFLOP.

We believe that despite the lower peak FLOPs in marketing, TPUs can achieve a higher realized model FLOP utilization (MFU) than Blackwell, meaning Ironwood's effective FLOPs are higher. One major reason is that the GPU FLOPs marketed by Nvidia and AMD are significantly exaggerated. Even in tests aimed at maximizing throughput through GEMM (shapes far from actual workloads), Hopper only reaches about 80% of peak, while Blackwell falls around 70% The MI300 series from AMD is between 50%-60%.

The limiting factor is power delivery. These chips cannot maintain the clock speeds used in peak mathematical operations. Nvidia and AMD implement Dynamic Voltage and Frequency Scaling (DVFS), which means the chip's clock frequency dynamically adjusts based on power consumption and heat, rather than a stable clock frequency that can actually be maintained. Nvidia and AMD then choose the highest clock frequency that can be delivered (even if very intermittent) to compute peak theoretical FLOPs (operations per cycle/ALU x number of ALUs x cycles per second, i.e., clock frequency).

Other tricks are used, such as running GEMM on zero-filled tensors, because 0x0=0, and transistors do not need to switch from 0 to 1, thus reducing power consumption per operation. Of course, in the real world, zero-filled tensors do not multiply.

When we combine a much lower TCO with higher effective FLOPs utilization, the dollar cost per effective FLOP becomes much cheaper from Google's perspective, with about 15% MFU being the breakeven point with the 30% MFU of the GB300. This means that if Google (or Anthropic) manages to achieve half the GB300 FLOPs utilization, they can still break even. Of course, with Google's elite compiler engineering team and deep understanding of their own models, the MFU they achieve on TPUs could reach 40%. That would be an astonishing reduction of about 62% in the cost per effective training FLOP!

(Chart: TCO / Effective Training Dense FP8 PFLOP ($/hr per Eff PFLOP) under different MFUs)

However, when observing 600,000 leased TPUs, when we incorporate the higher TCO paid by Anthropic (i.e., including Google's profit markup) into this analysis, we estimate that the cost Anthropic incurs from GCP is $1.60 per TPU hour, narrowing the TCO advantage. We believe Anthropic can achieve 40% MFU on TPUs, thanks to their focus on performance optimization and the FLOPs marketed by TPUs being essentially more realistic. This provides Anthropic with an astonishing approximately 52% lower TCO per effective PFLOP compared to GB300 NVL72. The breakeven point for the same effective FLOP TCO compared to the GB300 benchmark is that Anthropic's extracted MFU is as low as 19%. This means Anthropic can sustain a significant performance shortfall relative to the benchmark GB300, while the performance/TCO of training FLOPs ultimately remains the same as the benchmark Nvidia system

(Chart: TCO / Effective Training Dense FP8 PFLOP under Different MFUs)

FLOPs are not everything in performance; memory bandwidth is crucial for inference, especially in bandwidth-intensive decoding steps. Unsurprisingly, the cost per memory bandwidth dollar for TPUs is much cheaper than GB300. There is significant evidence that TPUs achieve even higher memory bandwidth utilization than GPUs at small message sizes (such as 16MB to 64MB, loading a single layer of experts).

(Chart: TCO / Memory Bandwidth ($/hr per TB/s))

All of this translates into efficient computation for training and serving models. Anthropic's Opus 4.5 continues its consistent coding focus, setting a new SWE-Bench record. The major surprise is that the API price has dropped by about 67%. This price reduction, combined with the model's lower redundancy compared to Sonnet and higher token efficiency (requiring 76% fewer tokens to achieve Sonnet's best score and 45% fewer tokens than its score of 4), means that Opus 4.5 is the best model for coding use cases and can effectively improve Anthropic's actual token pricing, as Sonnet currently accounts for over 90% of the token mix.

(Chart: Anthropic API Pricing)

(Chart: SWE-Bench Scores vs Required Total Output Tokens)

Google Threads the Needle on Margins

When pricing for external customers, Google needs to "thread the needle" to balance its own profitability while providing competitive propositions to customers. Our estimates of Anthropic's pricing are at the lower end of the external pricing range we have heard. For flagship customers like Anthropic, who will provide valuable input for software and hardware roadmaps while ordering large quantities, we expect sweetheart pricing. While Nvidia's staggering 4x markup (approximately 75% gross margin) provides significant pricing flexibility, Broadcom has siphoned off a lot of the oxygen As a joint designer of TPU, Broadcom earns high profits from the chips, which is the largest component of the system BOM (Bill of Materials). Nevertheless, this still leaves Google with significant room to earn very substantial profits.

We can see this by comparing the GCP Anthropic deal with other large GPU-based cloud deals. Note that this is for the 600,000 TPUs being leased, while the remaining 400,000 TPU v7 chips were pre-purchased by Anthropic.

Under these assumptions, the economic benefits of TPU v7 show superior EBIT margins compared to other large GPU-based cloud deals we have observed, with only OCI-OpenAI being close. Even with Broadcom's profit overlay at the chip level BOM, Google can still achieve much superior profits and returns than the more commoditized GPU deals. This is where the TPU stack allows GCP to become a truly differentiated CSP (Cloud Service Provider). Meanwhile, players like Microsoft Azure, whose ASIC plans are struggling, are limited to earning more mediocre returns in the business of merely leasing commercial hardware.

(Table: Comparison of Major AI Cloud Contracts)

TPU System and Network Architecture

So far, we have discussed the comparison of TPU with Nvidia GPU in terms of single-chip specifications and shortcomings. Now, let’s return to the system discussion, which is where TPU capabilities really begin to differentiate. One of the most significant features of TPU is the tremendous Scale-up world size achieved through the ICI protocol. The world size of a TPU pod reaches 9,216 Ironwood TPUs, with large pod sizes being a feature since the TPUv2 in 2017, scaling up to a full 256 of 1024 chip cluster sizes. Let’s start from the rack level, which is the basic building block of each TPU super pod.

Ironwood Rack Architecture

(Image: Rack System)

The TPU rack has adopted a similar design over the past few generations. Each rack consists of 16 TPU trays, 16 or 8 host CPU trays (depending on the cooling configuration), a ToR switch, a power unit, and a BBU.

(Chart: TPU v7 Ironwood Rack)

Each TPU tray consists of 1 TPU board, which is equipped with 4 TPU chip packages. Each Ironwood TPU will have 4 OSFP cages for ICI connections, as well as 1 CDFP PCIe cage for connecting to the host CPU.

Google has been implementing liquid-cooled TPU racks since the TPU v3 in 2018, but some intermediate TPU generations are still designed for air cooling. The main difference between liquid-cooled and air-cooled racks is that the ratio of TPU trays to host CPU trays in air-cooled racks is 2 to 1, while in liquid-cooled racks, the ratio is 1 to 1.

One innovative design of TPU liquid cooling is that the flow rate of the coolant is actively controlled by valves. This makes cooling more efficient, as the flow can be adjusted based on the workload of each chip at any given time. Google's TPU has long adopted vertical power delivery, where the VRM module of the TPU is located on the opposite side of the PCB board. These VRM modules also require cold plates for cooling.

Overall, the TPU rack design is much simpler than Nvidia's Oberon NVL72 design, which has a higher density and utilizes backplane connections to expand the switch. The expansion connections between TPU trays are all done through external copper cables or optical devices, which will be explained in the ICI section below. The connections between TPU trays and CPU trays are also made through PCIe DAC cables.

Inter-Chip Interconnect (ICI) – Key to Scaling Up Globally

The building block of Google TPUv7's ICI expansion network is a 4x4x4 3D Torus consisting of 64 TPUs. Each 4x4x4 cube of 64 TPUs maps to a physical rack of 64 TPUs. This is an ideal size because all 64 TPUs can be electrically connected to each other and still fit within a single physical rack

(Chart: TPU v7 - 64 TPU 4x4x4 Cube Logical Configuration)

TPUs are interconnected in a 3D toroidal configuration, with each TPU connecting to a total of 6 neighbors—2 logically adjacent TPUs along the X, Y, and Z axes. Each TPU is always connected to 2 other TPUs through PCB traces within the compute tray, but depending on the TPU's position within the 4x4x4 cube, it will connect to 4 other neighbors via Direct Attach Copper (DAC) or optical transceivers.

Connections within the 4x4x4 cube are made using copper cables, while connections outside the 4x4x4 cube (including connections that wrap around to the opposite side of the cube and connections to adjacent 4x4x4 cubes) will use optical transceivers and OCS (Optical Circuit Switch). In the diagram below, we see that this is a 3D toroidal network: TPUs 2, 3, and 4 (on the Z+ face) use 800G optical transceivers and are routed through OCS, with wrap-around connections back to the opposite Z-axis face TPUs 2, 3, and 1 (on the Z- face).

(Chart: TPU Unit Connections)

As mentioned above, in addition to the 2 adjacent TPUs that are always connected via PCB traces, each TPU will connect to 4 other neighbors using DAC, transceivers, or a combination of both, depending on their position within the 4x4x4 cube.

TPUs inside the 4x4x4 cube will connect to 4 other neighbors using only DAC, TPUs on the faces of the cube will connect via 3 DACs and 1 optical transceiver, edge TPUs will connect via 2 optical transceivers and 2 DACs, while corner TPUs will connect via 1 DAC and 3 optical transceivers. You can remember how many transceivers it will use by looking at how many faces of the given TPU are facing the "outside" of the cube.

(Chart: TPU Positions within the 4x4x4 Cube)

The above image and the following table summarize the number of each type of TPU position, which can be used to derive the ratio of 1.5 optical transceivers per TPU in TPU v7. These transceivers connect to the Optical Circuit Switch (OCS), enabling connections between the 4x4x4 cubes—detailed in the next section

(Table: Google TPU v7 3D Annular Connection Ratio)

Optical Devices for ICI

Google employs a software-defined networking approach to manage network routing through Optical Circuit Switches (OCS). An NxN OCS is essentially a massive train station with N incoming tracks and N outgoing tracks. Any incoming train can be transferred to any outgoing train, but this must be reconfigured at the station. Trains cannot "loop back" or return to another N incoming track; they must route to one of the N outgoing tracks.

The benefit of this approach is that the network can assemble smaller logical TPU slices—tailored for different workloads—by partitioning from the theoretical maximum of 9,216 chips in the ICI network layer. By partitioning larger clusters and rerouting ICI paths around failures in the network, cluster availability is improved.

Unlike Electronic Packet Switching (EPS) switches, such as the Arista Tomahawk 5, where fixed total bandwidth is further divided into several smaller bandwidth ports, OCS allows any bandwidth fiber to connect to its ports. The latency of OCS is also lower than that of EPS because the optical signals entering the OCS simply bounce from the input port to the output port. For EPS, optical signals must be converted to electrical signals upon entering the switch—this is a key reason why OCS is generally more energy-efficient than EPS. EPS also allows packets to be routed from any port to any port, while OCS only allows you to route from an "input" port to any other "output" port.

(Image: Internal Structure of OCS)

OCS ports only route a single fiber bundle. This poses a challenge for standard duplex transceivers, as bandwidth is transmitted through multiple fiber bundles, reducing the effective radix and bandwidth of the OCS. To address this issue, FR optical transceivers are used to consolidate all wavelengths into a single fiber bundle to connect to one OCS port. The Apollo project innovatively achieves this in two steps. First, 8 wavelengths—one wavelength for each 100G channel—are multiplexed through Coarse Wavelength Division Multiplexing (CWDM8), transmitting 800G over a single pair of fibers instead of 8 pairs. Second, optical circulators are integrated into Wavelength Division Multiplexing (WDM) transceivers to enable full-duplex data flow, reducing the requirement from 1 pair of fibers to just 1 fiber bundle

(Image: Principle of the Ring Coupler)

The ring coupler forms a bidirectional link by combining the Tx and Rx fiber bundles at the transceiver into a single fiber bundle sent to the OCS switch.

Connecting Multiple 64 TPU Cubes

What makes Google's ICI expansion network unique is that it allows multiple 64 TPU 4x4x4 cubes to be connected together in a 3D toroidal configuration to create a massive world scale. TPUv7 has a maximum world scale of 9,216 TPUs, but today, Google supports configuring TPUs into multiple different slice sizes, ranging from 4 TPUs up to 2,048 TPUs.

(Table: Supported Configurations)

While Google can innovatively achieve an impressive scaling cluster of 9,216 TPUs, the benefits of running training workloads on larger incremental block sizes of up to about 8,000 TPUs diminish at any given time. This is because larger block sizes are more prone to failures and interruptions, reducing slice availability, which is defined as the proportion of time the ICI cluster can form continuous 3D toroidal slices.

(Chart: Effective Throughput (Goodput) vs CPU Host Availability With/Without OCS)

For slices that can be fully accommodated within a 4x4x4 cube, we can simply use copper interconnects within the rack and optical transceivers on the faces/edges/corners of the cube to slice these slices out, allowing for wrapping and completing the 3D toroidal structure when needed.

To understand how the wrapping and inter-cube connections are made, let’s look at how we create a 64 TPU slice in a 4x4x4 topology. We can use a 64 TPU unit 4x4x4 cube corresponding to a physical 64 TPU rack to build this topology. All 8 TPUs inside the 4x4x4 cube can be fully connected to all 6 neighbors using copper cables. If a TPU has no internal neighbors on a given axis, it will wrap around and connect to the TPU on the other side of the cube. For example, TPU 4,1,4 has no internal neighbors in the Z+ direction, so it will connect to the OCS assigned to the Z axis using an 800G optical transceiver and configure the OCS to route this connection to the Z- side of the cube Connect to TPU 4,1,1. In the Y direction, TPU 1,1,1 will connect to the Y-axis OCS using an optical transceiver to link to the Y+ side of TPU 1,4,1, and so on.

(Chart: TPU v7 - 64 TPU slice 4x4x4 topology)

Each face of the 4x4x4 cube will be connected through 16 different OCS, with one OCS for each TPU on each face.

For example, in the diagram below, on the X+ face, TPU 4,3,2 connects to the input side of OCS X,3,2. The input side of OCS X,3,2 will also connect to the same TPU index (4,3,2) on the X+ face of all 144 4x4x4 cubes in the 9,216 TPU cluster. The output side of OCS X,3,2 will then connect to the same TPU index on the X- face of each cube in the cluster—thus it will connect to TPU 1,3,2 on all 144 cubes in the cluster. The diagram below illustrates how all 16 TPUs on the X+ face of cube A are connected to the 16 TPUs on the X- face of cube B through 16 OCS.

These connections allow any "plus" face of any cube to connect to any "minus" face of any other cube, enabling complete interchangeability of cubes when forming slices.

There are two limitations to briefly note. First, a TPU at a given index on a face can never connect directly to a different index—therefore TPU 4,3,2 can never be configured to connect to TPU 1,2,3. Second, since OCS essentially acts as a patch panel—TPUs connected on the input side cannot "loop back" to connect to any other TPU also connected on the OCS input side—for example, TPU 4,3,2 can never connect to TPU 4,3,3. Thus—TPUs on any "plus" face can never connect to any other cube's "plus" face, and TPUs on any "minus" face can never connect to any other cube's "minus" face.

(Chart: TPU v7 connected to OCS)

Let's scale up and see how to set up a 4x4x8 topology. In this configuration, we extend the slice by connecting two 64 TPU 4x4x4 cubes along the Z-axis. In this case, the OCS will reconfigure the optical port connected to TPU 4,1,4 to now connect to TPU 4,1,5, instead of looping back to TPU 4,1,1 as in the standalone 4x4x4 topology By extension, we will have 16 optical connections extending from the Z- and Z+ faces of two 4x4x4 TPU cubes, totaling 64 fiber bundles connected to 16 Z-axis OCS.

It is important to remind readers that the cubes A and B depicted below are not necessarily physically located next to each other. Instead, they are connected via OCS and may be located in completely different positions within the data center.

(Chart: TPU v7 - 128 TPU slice 4x4x8 topology)

We will now move to a larger topology—a 16x16x16 topology, which brings us to 4,096 TPUs. In this topology, we use a total of 48 OCS to connect 64 cubes, each containing 64 TPUs. In the diagram below, each multicolored cube represents a 64 TPU 4x4x4 cube. Taking the 4x4x4 cube in the lower right corner as an example—this cube is connected via OCS to an adjacent cube along the Y-axis.

The maximum world scale of 9,216 TPUs is constructed using 144 4x4x4 cubes, with each cube requiring 96 optical connections, resulting in a total demand of 13,824 ports. Dividing this total port demand by 288 (144 input and 144 output ports per OCS) means we need 48 144x144 OCS to support this maximum world scale.

(Chart: TPU v7 4,096 TPU slice 16x16x16 topology)

Why Use Google's ICI 3D Torus Architecture?

Aside from spending countless hours drawing all the fancy cube diagrams, what are the benefits of Google's unique ICI extended network?

World Scale: The most obvious benefit is the very large 9,216 TPU maximum world scale supported by TPUv7 Ironwood. Even though the maximum slice size of 9,216 may rarely be used due to the downside of effective throughput (goodput) reduction, slices of thousands of TPUs can and are often used. This is far greater than the 64 or 72 GPU world scales commonly found in the commercial accelerator market and among other custom chip providers.

Reconfigurability and Replaceability: The use of OCS means that the network topology essentially supports the reconfiguration of network connections to accommodate a large number of different topologies—potentially thousands of topologies. Google's documentation site lists 10 different combinations (in the images earlier in this section), but these are just the most common 3D slice shapes—there are many more shapes available Even slices of the same size can be reconfigured differently. In the simple example of the Twisted 2D Torus illustrated below, we see how crossing to indices of different X coordinates rather than the same X coordinates can reduce the worst-case hop count and worst-case bisection bandwidth. This helps improve collective throughput for all-to-all communication. The TPUv7 cluster will be twisted at the 4x4x4 cube level.

(Chart: Regular vs Twisted 2D Torus)

Reconfigurability also opens the door to a wide range of diversified parallelism. In a world scale of 64 or 72 GPUs, different combinations of parallelism are often limited to factors of 64. When it comes to ICI extended networks, the possibilities of implementing topologies to precisely match the required combinations of data parallelism, tensor parallelism, and pipeline parallelism are abundant.

The fact that OCS allows any “+” face of any cube to connect to any “-” face of any other cube means that the cubes have complete interchangeability. Slices can consist of any set of cubes. Therefore, if there are any failures or changes in user demand or usage scenarios, this will not hinder the formation of new topological slices.

(Chart: TPUv4 Circuit Switching Reconfigurability)

Lower Cost: Google's ICI network costs less than most switched extended networks. Although the FR optical devices used may be slightly more expensive due to the use of ring routers, the mesh network reduces the total number of switches and ports required and eliminates the costs incurred by connections between switches.

(Table: Comparison of Extended Network Costs)

Low Latency and Better Locality: The use of direct links between TPUs means that for TPUs that are physically close to each other or reconfigured to connect directly, much lower latency can be achieved. TPUs that are close to each other also have better data locality.

Data Center Network (DCN) – Scaling Beyond 9,216 TPUs

The Data Center Network (DCN) is a network independent of ICI that serves the role of a typical backend and frontend network. It connects even larger domains—14,700 TPUs in the case of the TPUv7 cluster As discussed in our previous article about the Apollo project, Google proposed replacing the electronic packet switching (EPS) included in the traditional "Clos" architecture's spine layer with the Paloma Optical Circuit Switch (OCS). Google's Data Center Network (DCN) consists of an Optical Switching Data Center Network Interconnect (DCNI) layer, which combines several aggregation blocks, each connecting several 9,216 TPU ICI clusters.

In 2022, Google's Apollo project proposed a DCN architecture that described the use of 136x136 OCS switches for TPUv4 pods, with a pod size of 4,096 TPUs. The OCS switches in the DCNI layer are organized into 4 Apollo regions, each containing up to 8 racks with 8 OCS switches, totaling 256 OCS switches. When it comes to Ironwood, to support up to 147 TPUv7s on the same network, we assume that the number of ports on the OCS will nearly double, rather than increasing the maximum number of OCS switches.

The following diagram illustrates what an Ironwood DCN network might look like, accommodating 256 300x300 OCS switches using 32 racks. Assuming no oversubscription between the spine layers of each aggregation block, up to 16 ICI pods can be connected in the DCN, with 4 aggregation blocks each connecting 4 ICI pods—totaling 147,456 TPUs.

The DCNI layer connects 4 aggregation blocks—depicted as the top layer in the diagram below. Similar to ICI, FR optical devices are used to connect to the OCS to maximize the bandwidth of each OCS port.

(Chart: 147,456 DCN topology)

While existing Ironwood clusters may only have 1 or 2 aggregation blocks, the unique architecture of Google's DCN allows for the addition of new TPU aggregation blocks to the network without significant rewiring.

By using OCS for the DCNI layer, the size of the DCN structure can be incrementally scaled, and the network can be re-striped to support new aggregation blocks. Additionally, the bandwidth of the aggregation blocks can be upgraded without changing the composition of the DCN layer. This allows for the link speeds of existing aggregation blocks to be refreshed without altering the fundamental architecture of the network itself. However, the process of structural scaling cannot continue indefinitely—at a massive scale, rewiring the network becomes unmanageable

(Chart: AB extension using OCS link)

TPU Software Strategy – Another Huge Shift

Traditionally, the TPU software and hardware teams have been internally focused. This brought advantages, such as no marketing team pressure to exaggerate theoretical FLOPs.

Another advantage of being internally focused is that the TPU team greatly prioritized internal feature requests and optimized internal workloads. The downside is that they were less concerned about external customers or workloads. The number of external developers in the TPU ecosystem is far lower than that in the CUDA ecosystem. This is one of the main weaknesses of TPU, as it is for all non-Nvidia accelerators.

Google has since modified its software strategy aimed at external customers and has made significant changes to the KPIs of the TPU team and how they contribute to the AI/ML ecosystem. We will discuss two major changes:

  1. A massive engineering effort on PyTorch TPU "native" support
  2. A massive engineering effort on vLLM/SGLang TPU support

By looking at the number of contributions Google has made to various TPU software repositories, it is clear to see this externalization strategy. We can see a significant increase in vLLM contributions starting in March. Then, starting in May, the "tpu-inference" repository was created, which is the official vLLM TPU unified backend, and since then there has been a series of activities.

(Chart: Google's monthly contributions by repository)

Traditionally, Google only provided first-class support for the Jax/XLA:TPU stack (as well as TensorFlow/TF-Mesh, rest in peace) but treated PyTorch on TPU as a second-class citizen. It relied on lazy tensor graph capture through PyTorch/XLA rather than having a first-class eager execution mode. Additionally, it did not support the native distributed API of PyTorch (torch.distributed.*) or the native parallel API of PyTorch (DTensor, FSDP2, DDP, etc.), instead relying on the odd out-of-tree XLA SPMD API (torch_xla.experimental.spmd_fsd, torch_xla.distributed.spmd, etc.) This has led to a poor non-native experience for external users who are accustomed to the native PyTorch CUDA backend on GPUs and are trying to switch to TPUs.

(Code example: XLA)

In October, Google's "Captain Awesome" Robert Hundt quietly announced in the XLA repository that they would be transitioning from the non-native lazy tensor backend to a "native" TPU PyTorch backend, which will default to support eager execution and integrate with APIs such as torch.compile, DTensor, and torch.distributed. They will achieve this by using the PrivateUse1 TorchDispatch key. This is primarily for Meta, which has regained interest in purchasing TPUs and does not want to switch to JAX. This will also allow those who prefer PyTorch and do not like JAX to use TPUs.

From 2020 to 2023, several teams at Meta FAIR extensively used PyTorch XLA on TPUs, but it was not widely adopted, leading Meta's leadership to ultimately cancel the contract in 2023. PyTorch XLA on TPUs was not an interesting experience. At that time, Meta FAIR's GCP TPU even used SLURM to run, rather than anything you would typically find on the TPU stack, such as GKE/Xmanager/borg, etc.

(Image: GitHub RFC)

This new PyTorch <> TPU will create a smoother transition for ML scientists accustomed to PyTorch on GPUs to switch to PyTorch on TPUs and leverage the higher performance per TCO on TPUs.

Pallas is a kernel creation language for writing custom kernels for TPUs (similar to cuTile or Triton or CuTe-DSL). Meta and Google have also begun working on supporting Pallas kernels as a code generation target for the Torch Dynamo/Inductor compilation stack. This will allow for native TPU integration with PyTorch's native torch.compile API and enable end users to register custom Pallas operations into PyTorch In addition to the core tree-based PyTorch native API, there is also work behind the scenes to integrate the TPU Pallas kernel language as a code generation target for Helion. You can think of Helion as a high-level language for writing reasonably performant kernels. Users can view Helion as a low-level Aten operator rather than a high-level Triton/Pallas, as it is closer in similarity to the native PyTorch Aten operators.

Another supreme area of the CUDA ecosystem is the open ecosystem inference. Historically, vLLM and SGLang have supported CUDA as a first-class citizen (with ROCm as a second-class citizen). Now Google wants to enter the vLLM and SGLang open inference ecosystem and has announced beta support for TPU v5p/v6e for vLLM and SGLang through a very "unique" integration.

vLLM and SGLang currently achieve this by lowering PyTorch modeling code to JAX and leveraging the existing mature JAX TPU compilation process. In the future, once the PyTorch XLA RFC #9684 (i.e., native TPU PyTorch backend) is implemented, vLLM and SGLang plan to evaluate whether to switch to using that backend instead of translating modeling from PyTorch to JAX via TorchAX.

Google and vLLM claim that this lowering path to JAX does not require any changes to the PyTorch modeling code, but given the limited number of models currently supported by vLLM TPU, we are skeptical.

Additionally, Google has open-sourced and integrated some of their TPU kernels into vLLM, such as TPU-optimized paged attention kernels, compute-communication overlapping GEMM kernels, and several other quantized matmul kernels. They do not yet have MLA-friendly TPU kernels. Once the Inductor Pallas TPU code generation integration matures further, it will be interesting to see if kernel fusion and pattern matching can be integrated into the existing vLLM PassManager. SGLang is also considering implementing a torch.compile PassManager to make kernel fusion management for many models easier to maintain.

For Ragged Paged Attention v3, the handling of TPU is vastly different from that of vLLM GPU. vLLM manages KV cache using techniques similar to virtual memory and paging However, this technology requires obtaining dynamic addresses and performing scatter operations, which TPU is not good at. Therefore, TPU cores utilize fine-grained operation pipelines. Specifically, the TPU's paged attention core prefetches the queries and KV blocks for the next sequence, allowing memory loading to overlap with computation.

In the existing vLLM MoE core, we sort tokens by expert ID, distribute the tokens to devices with the corresponding experts, perform group matrix multiplication, and combine the tokens from the experts back to the original device. However, this core performs poorly for two reasons: TPU is slow at executing sorting operations, and the core cannot overlap communication with computation.

To address this issue, Google developers designed All-fused MoE. All-fused MoE distributes tokens for one expert to each device at a time while overlapping MoE distribution and MoE combination communication, avoiding sorting tokens by expert ID. Using All-fused MoE, Google engineers report a 3-4 times speedup compared to the existing core.

(Chart: Time step diagram)

Additionally, another hardware unit in TPU is SparseCore (SC), which accelerates embedding lookups and updates. SC is equipped with scalar cores in the SparseCore Sequencer (SCS) and multiple vector sub-cores SparseCore Tiles (SCT). SCT supports local and remote direct memory access at a finer granularity of 4 bytes or 32 bytes, whereas TPU TensorCore loads at 512 bytes. This enables SC to perform gather/scatter operations and ICI communication while overlapping with TensorCore operations.

At JAX DevLabs, we learned that the programmability of SparseCore is in progress. We can expect Mosaic (TPU custom kernel compiler) to compile in an MPMD manner, where SCS and SCT execute different kernels, and different SparseCores can run different programs. We suspect that once programmability catches up, TPU MoE cores will be able to perform distribution and combination operations in a manner similar to GPUs, rather than distributing by expert ID.

(Chart: SparseCore Structure)

In disaggregated prefill decode, we elaborated on this in the AMD 2.0 article, and Google has provided experimental support for single-host disaggregated PD on vLLM, noting that they do not yet support multi-host wideEP disaggregated prefill or MTP. These inference optimizations are crucial for reducing TCO per million tokens and improving performance per dollar and per watt. Additionally, they have not integrated TPU vLLM inference support into popular RL frameworks (such as VERL, etc.). Google is slowly moving in the right direction regarding how to approach the open AI/ML ecosystem, especially for their "native" TPU backend.

vLLM TPU Benchmarking is Not Relevant Yet

This week, a new inference benchmark was released on TPUv6e, claiming that TPUv6e's performance per dollar is 5 times worse than NVIDIA GPUs. We disagree for two main reasons. First, this benchmark was conducted on vLLM on TPU, which has only been released for a few months and thus does not yet have optimized performance. Google's internal Gemini workloads and Anthropic workloads run on an internally customized inference stack, which outperforms NVIDIA GPUs in terms of TCO performance.

Secondly, the cost per million tokens from Artificial Analysis uses a pricing of $2.7/hour/chip for TPUv6e. Given that the BOM is only a small part of H100, no major customer of TPU would pay anywhere near that high a price for TPUv6e. It is well known that most clouds have inflated pricing so that their sales executives can employ "car salesman" tactics (high pricing, big discounts) to make customers feel they are getting a good deal. SemiAnalysis AI TCO model tracks the actual market rental prices of TPU for various contract lengths (1 month, 1 year, 3 years, etc.).

(Chart: Cost per million input and output tokens)

Key Missing Part of TPU Software Strategy

One part of Google's software strategy that is still mishandled is that their XLA graph compiler, network library, and TPU runtime are still not open-sourced and lack good documentation. This has led to frustration among various users, from advanced users to casual users, who are unable to debug what is wrong with their code. Additionally, their MegaScale codebase used for multi-pod training is also not open-sourced We firmly believe that in order to accelerate adoption, Google should open source it, and the increase in user adoption will surpass all their publicly available and free software IP. Just like PyTorch or Linux rapidly increased their adoption rates through open sourcing, open sourcing XLA:TPU and TPU runtime and network libraries will also quickly accelerate this