In-depth discussion on Google TPU chips: How to compete with Broadcom? Can it compete with NVIDIA?

Wallstreetcn
2025.12.07 02:45
portai
I'm PortAI, I can summarize articles.

The competition of TPU against NVIDIA cards is still far from being a threat. Whether it is the hardware barriers, software ecosystem adaptation, or the business logic, it is destined that the act of directly purchasing TPU for self-deployment can only be lightly attempted by a very small number of high-end players, such as Meta, which has recently been mentioned in small articles

A few days ago, we briefly discussed the challenges faced by Google's TPU, which cannot be separated from Broadcom while also wanting to reduce its high dependence on Broadcom. This time, let's discuss in detail how Google is strategizing against Broadcom. Additionally, in the competitive market environment, can TPU ultimately be sold in large quantities to capture market share from NVIDIA?

I. Development Model of Google TPU

The development ownership of TPU versions v7, v7e, and v8 is as follows:

Google initially chose Broadcom because, in the chip design and manufacturing process, Broadcom is indeed the best service provider globally, especially with its top-notch chip high-speed interconnect technology, which is core to achieving large-scale AI chip parallel computing. On the other hand, Broadcom's gross margin for TPU orders is as high as 70%, while MediaTek, as a consumer-grade chip manufacturer, although not as technically strong as Broadcom, is willing to take TPU orders at a gross margin of over 30%, significantly reducing Google's operating costs, thus becoming a chess piece for Google to balance Broadcom.

Similarly, many tech giants in Mag7 are adopting a similar model to develop their own AI chips. Meta has also chosen to cooperate with Broadcom, while Microsoft and Amazon have opted for Marvell and Alchip, leaving only Tesla and Apple to choose independent development.

II. Interface Issues Between Google and Broadcom

Why does Google want to design the top-level architecture of the chips instead of completely outsourcing it to Broadcom? Why doesn't Broadcom sell Google's chip designs as public versions to other manufacturers? Let's explore this interface issue.

Before diving into the main topic, let me share a small story. I remember nearly 10 years ago when cloud service equity investment was hot in China, during our due diligence covering server manufacturing, I heard a rumor: when Alibaba first entered the cloud service track, it approached Foxconn and privately requested to provide the server motherboards it manufactured for Google. Foxconn refused and proposed using its own public version. Setting aside commercial IP and reputation issues, the motherboard designed by Google at that time had a 12V lead-acid battery directly attached to each motherboard, with the power grid's electricity converted only once before entering the motherboard, unlike traditional centralized UPS designs that require three conversions, significantly reducing energy consumption. In the cloud service field at that time, significant energy savings meant that manufacturers could either greatly increase their gross profits or significantly lower front-end market prices, making it a powerful commercial weapon.

**Similarly, looking at the development interface issue of TPU, Google is developing TPU because the primary user is Google's own internal application load, such as search engines, YouTube, ad recommendations, and the Gemini large model, etc. Therefore, only Google's internal team knows how to design the TPU operators to maximize the effectiveness of internal applications, and this internal commercial information cannot be disclosed to Broadcom to complete the top-level architecture design of the chip. This is why Google must design the top-level architecture of TPU itself **

But this raises a second question: if the top-level architecture design is entrusted to Broadcom, won't Broadcom know about it? Can they improve their own public version and sell it to other manufacturers?

Similarly, setting aside commercial IP and reputation issues, the delivery of chip top-level architecture design is not like the delivery of circuit board designs over a decade ago. Google's engineers can use SystemVerilog to write design source code (RTL), and what is provided to Broadcom after compilation is a gate-level netlist, ensuring that even if Broadcom knows how the 100 million transistors in the chip design are connected, it is almost impossible to reverse-engineer the underlying high-level design logic. For the most critical logic module designs, such as Google's unique matrix multiplication unit MXU, Google doesn't even show Broadcom the specific netlist but instead creates a physical layout (Hard IP) and hands Broadcom a black box. Broadcom only needs to ensure power supply, heat dissipation, and data connectivity for the black box without needing to know what the black box is computing.

Therefore, what we see now as the working interface between Google and Broadcom is actually the most ideal commercial cooperation scenario. Google does the top-level architecture design for TPU, encrypts various information, and hands it over to Broadcom, which takes on all the remaining implementation work while providing Google with its top-notch high-speed interconnect technology, ultimately delivering it to TSMC for manufacturing. Now Google says that TPU shipments are increasing, and I want to control costs, so Broadcom, you can delegate some of your work to MediaTek, and I will pay them less than I pay you. Broadcom agrees, as they also have significant projects from Meta and OpenAI to handle, so some finishing work can be handed over to MediaTek. MediaTek says, "Google brother, I can do it cheaper, so please consider giving me more work in the future. Except for high-speed interconnects, which I don't understand, please delegate as much work as possible to me."

III. Can TPU truly capture market share from NVIDIA?

In simple terms, the conclusion is: TPU will see a noticeable increase in shipments, but its impact on NVIDIA's shipments will be minimal. The growth logic of the two is not the same, and the services provided to customers are also different.

As mentioned earlier, the growth in shipments of NVIDIA's cards is attributed to three major demand areas:

(1) Growth in the high-end training market. There have been many voices saying that AI models have consumed most of the world's information, and there will be no training demand in the future, which actually refers to pre-training. However, it quickly became apparent that models purely pre-trained on large data sets can easily produce hallucinations and nonsensical outputs, so post-training has gained immediate importance. Post-training involves a significant amount of expert judgment, and the data here can even be dynamic; as long as the world changes, expert judgments need to be continuously revised. Therefore, the more complex the large model, the greater the scale of post-training required.

(2) Complex reasoning demands. The thinking-type large models produced through post-training, such as OpenAI's o1, xAI's Grok 4.1 Thinking, and Google's Gemini 3 Pro, now require multiple reasoning and self-verification for each complex task, making the workload equivalent to a small-scale lightweight training, which means that most high-end complex reasoning still needs to run on NVIDIA's cards (3) Demand for physical AI. Even if all fixed knowledge information training is completed worldwide, what about the dynamic physical world? Autonomous driving, robots in various industries, automated production, scientific research—these continuously generating new knowledge and interactive information in the physical world create training and complex reasoning demands that far exceed the total knowledge available globally today.

The rapid growth of TPU is largely attributed to:

(1) The increase in Google's own usage. Especially as AI has been embedded in almost all of Google's top applications, from the search engine Search, video platform YouTube, advertising recommendations, cloud services, to Gemini applications, this massive growth has led to an explosive increase in Google's own demand for TPU.

(2) The provision of TPU cloud services in Google Cloud. Although Google Cloud currently primarily uses NVIDIA cards for external customers, it is also vigorously promoting TPU-based cloud services. Large clients like Meta have a strong demand for AI infrastructure, but procuring NVIDIA cards to deploy data centers takes time. As a bargaining chip in commercial negotiations, Meta could consider using leased TPU cloud services for pre-training to alleviate the issues of NVIDIA card shortages and high prices, while using its self-developed chips for internal reasoning tasks. This hybrid chip solution may be the most advantageous choice for Meta.

Finally, let's discuss the hardware and software aspects of why TPU cannot replace or directly compete with NVIDIA cards.

(1) Hardware barriers: Infrastructure incompatibility

NVIDIA's GPUs are standard components that can be bought and plugged into Dell/HP servers for immediate use, making them installable in any data center. TPUs are "systems" that rely on Google's unique 48V power supply, liquid cooling pipelines, cabinet sizes, and closed ICI optical interconnect networks. Unless customers are willing to rebuild their data centers like Google, it is almost impossible to buy TPUs for self-deployment (On-Prem). This means TPUs can only be rented on Google Cloud, limiting their reach in the high-end market.

(2) Software barriers: Ecological incompatibility (PyTorch/CUDA vs. XLA)

90% of AI developers globally use PyTorch + CUDA (dynamic graph mode), while TPUs require a static graph mode (XLA). This presents a high migration cost for developers. Apart from giants like Apple and Anthropic that have the capability to rewrite underlying code, ordinary companies and developers cannot afford to use TPUs. This means TPUs can only serve a "very small number of clients with full-stack development capabilities," and cannot popularize AI training and reasoning to every university and startup, even through cloud services.

(3) Lastly, there is a commercial issue: internal "mutual combat" (Gemini vs. Cloud) As a cloud service giant, Google Cloud certainly wants to sell TPUs for profit, but the Google Gemini team is more interested in monopolizing TPU computing power to maintain its lead, using the output applications to generate revenue for the company. There are bound to be conflicts of interest here; who will make the money for the year-end bonuses? Suppose Google starts selling the most advanced TPUs on a large scale to Meta or Amazon, or even helps them deploy and use them. As a result, Google's most profitable advertising business begins to be undermined by these two largest competitors. How will this be accounted for? Such internal strategic conflicts will inevitably lead to hesitation on Google's part when it comes to selling TPUs externally, and they may even withhold the strongest versions from sale. This also means they will be unable to compete with NVIDIA for the high-end market.

IV. Summary:

The game between Google and Broadcom over TPUs will continue in a hybrid development model, but it will indeed increase the development difficulty for the powerful V8. We will wait and see the specific development progress, and we also look forward to whether Broadcom will provide us with more information when it releases its Q3 financial report on December 11.

The competition between TPUs and NVIDIA cards is still far from being a threat. Whether it’s hardware barriers, software ecosystem adaptation, or commercial logic, it is destined that only a very small number of high-end players can attempt to directly purchase TPUs for their own deployment, such as the recent rumors about Meta.

However, from my understanding of Meta, it is difficult for them to spend a large amount of capital expenditure to rebuild a data center based on TPUs, and it may develop AI to encroach on Google's advertising business. Moreover, the source of this rumor is The Information, a media outlet that has long been hostile to several tech giants like NVIDIA and Microsoft, and most of the rumors reported by them have been debunked. The most likely scenario is that Meta uses TPU cloud leasing for model pre-training or complex inference to reduce dependence on NVIDIA, similar to TPU's own hybrid development strategy. Tech giants come together and part ways, but ultimately, one must be strong in order to forge ahead; only the best interest solution is the correct answer.

Source: New Vision Alan Shore