
Performance hard against Blackwell, energy efficiency surpassing GPU, a deep dive into the "real combat power" of Google's TPU

For investors and cloud vendors, the greatest value of TPU is not just speed, but profit margins. By controlling the full-stack design of TPU, Google has successfully bypassed the "NVIDIA tax." Meanwhile, Broadcom's gross margin is far lower than NVIDIA's, allowing Google to push computing costs to the extreme. From TPU v6 to the recently exposed TPU v7, Google is not only making chips but also building an almost insurmountable moat for the upcoming "AI inference era."
In the field of AI computing power, NVIDIA seems to be the invincible leader. However, behind the scenes, tech giant Google is redefining the rules of the AI chip war in a more covert yet highly destructive manner.
This trump card is Google's self-developed TPU (Tensor Processing Unit).
If you think this is just a "backup" that Google created to save costs, you are gravely mistaken. According to the latest disclosed in-depth information, Google's latest TPU v7 (codenamed Ironwood) not only matches NVIDIA's B200 in memory capacity but also achieves a dimensionality reduction strike against GPUs in terms of energy efficiency. Even Jensen Huang himself has hinted that in the ASIC field, Google's TPU is a "special existence."
From TPU v6 (Trillium) to the recently exposed TPU v7 (Ironwood), Google is not just making chips; it is building an almost insurmountable moat for the impending "AI inference era."
Origin: A "forced" survival self-rescue
The story of TPU did not begin with breakthroughs in chip manufacturing but rather with a math problem that left Google's executives in a cold sweat.
In 2013, Jeff Dean and the Google Brain team conducted a simulation: if every Android user used voice search for just 3 minutes a day, Google would need to double the capacity of its global data centers to handle the computational load.
At that time, Google relied on general-purpose CPUs and GPUs, but these chips were too inefficient for the massive matrix multiplication operations in deep learning. Continuing to expand with old hardware would lead to a nightmare of financial and logistical costs.
Thus, Google decided to take a path it had never taken before: to customize an ASIC chip for the TensorFlow neural network.
This project progressed rapidly, taking only 15 months from design concept to data center deployment. By 2015, while the outside world was still unaware, TPU was already quietly supporting core services such as Google Maps, Photos, and Translation.
Architecture Battle: Shedding "burdens" and letting data flow like blood
Why can TPU's energy efficiency outperform GPUs? This starts from the underlying architecture.
GPUs are "general" parallel processors designed for graphics processing. To handle various tasks from game textures to scientific simulations, they carry a heavy "architectural burden"—such as complex caching, branch prediction, and thread management, all of which consume a significant amount of chip area and energy.
In contrast, TPU is extremely "minimalist." It strips away all irrelevant hardware like rasterization and texture mapping, adopting a unique "Systolic Array" architecture.
In traditional GPUs, each computation requires moving data between memory and computing units, creating the famous "von Neumann bottleneck." In the TPU's Systolic Array, data flows through the chip like blood through the heart. This significantly reduces the read and write times to HBM (High Bandwidth Memory), allowing the chip to spend time on computation rather than waiting for data. This design gives TPU a crushing advantage in "Operations Per Joule."
Hard-hitting Blackwell: Terrifying Data of TPU v7
Although Google has always been tight-lipped about performance data, according to Semianalysis and internal disclosures, Google's latest TPU v7 (Ironwood) has shown astonishing generational leaps.
Computing power skyrockets: The BF16 computing power of TPU v7 reaches 4,614 TFLOPS, while the widely used TPU v5p only reaches 459 TFLOPS. This is a full order of magnitude improvement.
Video memory comparable to B200: The single-chip HBM capacity reaches 192GB, which is completely consistent with NVIDIA's Blackwell B200 (Blackwell Ultra is 288GB).
Bandwidth surges: Memory bandwidth reaches 7,370 GB/s, far exceeding v5p's 2,765 GB/s.
In terms of interconnect technology, Google uses Optical Circuit Switches (OCS) and 3D toroidal networks.
Compared to NVIDIA's InfiniBand, OCS is extremely cost-effective and power-efficient as it eliminates optoelectronic conversion. Although it sacrifices some flexibility, its efficiency is unmatched when handling specific AI tasks in conjunction with Google's compiler.
Even more noteworthy is energy efficiency. Google revealed at Hot Chips 2025 that v7's performance per watt improved by 100% compared to v6e (Trillium). A former Google executive stated: "For specific applications, TPU can provide 1.4 times the performance per dollar compared to GPU." For dynamic model training (such as search workloads), TPU's speed is even five times that of GPU.
Escaping the "NVIDIA Tax" and Returning to High-Margin Era
For investors and cloud vendors, the greatest value of TPU is not just speed, but profit margins.
In the AI era, cloud giants are facing a decline from "oligopoly" to "commoditization." Because they must purchase NVIDIA's GPUs, up to 75% of the gross margin is taken by NVIDIA, causing cloud vendors' AI business margins to plummet from the traditional 50-70% to 20-35%, making them more like a toll-collecting "utility company."
How to return to the high-margin era? Self-developed ASICs are the only remedy.
Google successfully bypassed the "NVIDIA tax" by controlling the full-stack design of TPU (doing front-end RTL design themselves, with Broadcom only responsible for back-end physical implementation). Meanwhile, Broadcom's gross margin is far lower than NVIDIA's, allowing Google to push computing costs to the extreme.
One customer admitted after comparison:
If I use 8 H100s compared to using one v5e Pod, the latter not only has higher performance per dollar, but as Google launches the next generation of TPU, the old models will not only not be eliminated but will become extremely cheap Sometimes, if you are willing to wait a few more days for training time, the cost can even be reduced to one-fifth of the original.
Although TPU faces challenges from the ecosystem (the dominance of CUDA) and multi-cloud deployment (data migration costs), the importance of CUDA is decreasing as AI workloads shift from "training" to "inference."
SemiAnalysis's evaluation is to the point:
Google's chip dominance among hyperscale computing vendors is unmatched, and TPU v7 is performance-wise on par with Nvidia Blackwell.
In the trillion-dollar game of AI computing power, although Nvidia is leading, Google, wielding the TPU sword, may be the only player who can fully control its own destiny

