---
title: "Cerebras IPO: Deeply Tied to OpenAI, Reshaping AI Chip Market Expectations with \"Fast Tokens\""
type: "News"
locale: "en"
url: "https://longbridge.com/en/news/286370662.md"
description: "Cerebras, a wafer-scale chip company betting on \"fast tokens,\" stands at the threshold of an IPO. Leveraging 21 PB/s on-chip bandwidth to achieve an extreme inference speed of 2,000 tok/sec/user, it has secured a massive 750 MW computing power deal with OpenAI, backed by $24.6 billion in orders. However, the other side of the coin is equally stark: a 44 GB SRAM capacity ceiling, only 150 GB/s off-chip I/O, heavy reliance on a single customer, and the ultimate question of whether the \"fast token premium\" can sustainably cover the costs of its complex system. These factors will determine the outcome of this high-stakes gamble"
datetime: "2026-05-14T09:00:58.000Z"
locales:
  - [zh-CN](https://longbridge.com/zh-CN/news/286370662.md)
  - [en](https://longbridge.com/en/news/286370662.md)
  - [zh-HK](https://longbridge.com/zh-HK/news/286370662.md)
---

# Cerebras IPO: Deeply Tied to OpenAI, Reshaping AI Chip Market Expectations with "Fast Tokens"

Cerebras’s story has suddenly smoothed out. A few years ago, it was a radical AI hardware company using an “entire wafer as a chip.” Its technology was bold, but its commercialization remained uncertain. Now, fast inference has become a direction large model vendors are willing to pay a premium for, and with OpenAI signing a 750 MW inference computing power cooperation deal, Cerebras has stepped up to the IPO window.

Myron Xie, an analyst at SemiAnalysis, summarized the core shift directly in a research report released on the 14th: **“After crossing a certain intelligence threshold, developers prefer faster tokens over smarter tokens.”** This statement explains the shift in Cerebras’s valuation logic: It does not necessarily need to defeat GPUs in all AI computing scenarios. As long as “high interaction speed” becomes a billable product, its wafer-scale architecture will have a role to play.

This is also what makes Cerebras most fascinating. The WSE-3 packs 44 GB of SRAM, compute cores, and on-chip interconnects into an entire wafer, delivering memory bandwidth at the 21 PB/s level and achieving inference speeds in ranges difficult for traditional HBM accelerators to reach. However, the same architecture brings limitations: **SRAM capacity is not large enough, off-chip I/O is only 150 GB/s, and cooling, power supply, and packaging are highly customized, making it increasingly strained when serving ultra-large models and long contexts.**

**OpenAI represents Cerebras’s biggest opportunity, but also concentrates risk onto a single customer. The agreement between the two parties corresponds to 750 MW of inference computing power, with OpenAI holding an option for an additional 1.25 GW; Cerebras disclosed remaining performance obligations amounting to $24.6 billion.** Yet this deal also ties up a $1 billion working capital loan, warrants close to free exercise, and intense pressure for data center delivery. What IPO investors really need to ask is not “whether wafer chips are cool,” but rather: Can the premium for fast tokens cover Cerebras’s structural costs and single-customer risk?

## Cerebras Is Betting on “Interaction Speed,” Not “Total Throughput”

In the past, the main thread of AI inference hardware was how many tokens each GPU or each rack could output. For cloud providers and model vendors, total throughput means unit cost and the ability to serve more users.

But user behavior is pushing another curve to the forefront: tokens/sec/user, or the speed at which a single user receives output.

OpenAI and Anthropic are splitting the same model into different service tiers: **fast, priority, standard, and batch.** Whether users are willing to pay for faster responses is no longer just a product manager’s guess. Opus 4.6 fast once traded approximately 6x the price for 2.5x the interaction speed, later reducing the speed advantage to about 1.75x; even so, the high-speed mode remains a SKU that developers are willing to pay for. SemiAnalysis’s own AI spending in April annualized to $10 million, with 80% spent on Opus 4.6 fast.

This indicates a market shift: When model capabilities are sufficiently usable, wait time becomes a productivity bottleneck. For agentic workflows involving coding, tool invocation, and continuous iteration, a delay of a few seconds is not just an experience issue—it interrupts the workflow.

Cerebras’s advantage lies precisely here. It does not rely on stacking more HBM for capacity but uses the extremely high bandwidth of on-chip SRAM to make decode scenarios with low batch sizes, small concurrency, and high interaction speeds exceptionally fast. In other words, if a GPU is like a bus that can carry many people, Cerebras is more like a sports car designed for high-speed direct transport for a few passengers.

## WSE-3 Is Not a “Large GPU”; It Is an Entire Wafer

Cerebras’s core product, the WSE, treats an entire wafer as a single chip rather than cutting it into dozens or hundreds of independent dies.

The WSE-3 uses TSMC’s N5 process and consists of 12×7, totaling 84 identical step-and-repeat regions. Each wafer contains approximately 970,000 cores, with 900,000 enabled. Half of the wafer area is allocated to SRAM, and the other half to compute cores. The key to this design is keeping both computation and storage on the same piece of silicon, minimizing data movement off the chip and out of the package.

The specifications are staggering:

-   SRAM Capacity: 44 GB
    
-   SRAM Bandwidth: 21 PB/s
    
-   External I/O: 150 GB/s
    
-   Publicly marketed FP16 compute power: 125 PFLOPs
    
-   Dense FP16 compute power adjusted for 8:1 unstructured sparsity: Approximately 15.6 PFLOPs
    

These numbers must be viewed separately. The 21 PB/s memory bandwidth is Cerebras’s strongest feature; the 15.6 PFLOPs dense FP16 compute power is also respectable, but if measured per unit of silicon area, it is not as astonishing as the marketing claims suggest. The 125 PFLOPs figure comes from sparse assumptions, referred to in materials as “Feldman’s Formula,” which involves multiplying dense compute power by 8.

The real dividing line lies in memory type. Mainstream AI accelerators like GPUs, TPUs, and Trainium place model weights and KV Cache in HBM; Cerebras tries to keep them in SRAM as much as possible. SRAM is fast and has low latency, but it has a high cost per bit and low capacity density.

44 GB of SRAM is large in the world of single chips. But compared to HBM, it is not. A single HBM3E 12-Hi stack has 36 GB; currently, a high-end GPU or TPU package commonly features 8 stacks, corresponding to 288 GB, which is 6.5 times the SRAM capacity of the WSE-3.

This is Cerebras’s fundamental trade-off: exchanging capacity for speed.

## Wafers Win in Low Arithmetic Intensity Decode, but Lose in Large Models and Long Contexts

The tasks Cerebras is best suited for are the decode phase, which is low in arithmetic intensity and limited by memory bandwidth.

In large model inference, many kernels do not lack compute power but lack memory bandwidth. A GPU’s Tensor Core may be powerful, but if weights and KV Cache cannot be fed quickly enough, the compute power goes hungry. By spreading extensive SRAM across the wafer, Cerebras keeps data closer to compute units with sufficiently high bandwidth, allowing low-concurrency decode scenarios like batch=1 to achieve interaction speeds difficult for traditional HBM systems to reach.

The theoretical comparison in the materials is clear: For a decode kernel with batch=1 and an arithmetic intensity of about 2, NVIDIA GPUs and Groq LPUs can theoretically only achieve tens to hundreds of TFLOPs; under ideal conditions, the Cerebras WSE-3 can approach its full 15.625 PFLOPs dense FP16 compute power.

This is the hardware foundation of “fast tokens.”

**However, as models grow larger and contexts lengthen, the 44 GB SRAM begins to feel tight. An inference system’s memory must hold three types of data:**

1.  Model weights;
    
2.  KV Cache required for concurrent requests;
    
3.  Larger KV Cache resulting from long contexts.
    

Workloads like agentic coding are particularly troublesome. Sample calculations involving approximately 432,000 requests and about 80 billion tokens show that the typical P50 input sequence length is around 96.3k tokens, rather than the 64k assumed in Cerebras product specifications; nearly 50% of requests exceed 128k, which already reaches the maximum context window currently supported by Cerebras’s public endpoints.

This means that if future model services move toward 256k or 1M context windows, Cerebras will either have to compress KV Cache, deploy more wafers, or sacrifice interaction speed and economic efficiency.

## Cooling and BOM Explanation: This Is Not Cheap Compute

The CS-3 system is not simply a matter of plugging a chip into a server.

Each CS-3 includes a WSE-3 engine block, peripheral compute and I/O modules, two mechanical pumps, twelve 3.3 kW power modules, and a liquid cooling system. The single WSE-3 itself consumes about 25 kW. Spread over a 46,225 square millimeter wafer, the average heat flux density is about 50 W/cm², not even accounting for hotspots.

Air cooling is impractical. If ordinary 3D vapor chambers were scaled up to 21.5 cm square, they would encounter capillary limits, with working fluid return unable to keep up. Cerebras must use custom liquid cooling structures: a four-layer “sandwich” composed of cold plates, the wafer, flexible connectors, and PCBs, with a cooling manifold attached behind the cold plate. Since silicon and PCBs have different coefficients of thermal expansion, traditional packaging would crack, requiring custom connections, pre-loading, and assembly tools.

The data center side is also affected. The facility-side flow rate for the GB200 NVL72 reference design is about 1.5 LPM/kW, whereas the WSE-3 at 25 kW requires about 100 LPM, equivalent to 4 LPM/kW, nearly three times higher. This demands larger pumps, thicker pipes, larger CDUs, and higher-flow quick-connect fittings. Only if the CS-4 can bring rack-level flow back to 1.5–1.7 LPM/kW will it be closer to standardized infrastructure.

The cost is also significant. The estimated BOM for a CS-3 plus KVSS CPU node was about $350,000 per rack before memory price increases in Q4 last year; including recent memory prices, it is about $450,000 per rack. The KVSS is a dual-socket AMD CPU node equipped with 6 TB of DDR5 RDIMM, used for KV Cache offload.

Interestingly, the most expensive component is not just the TSMC N5 wafer. The nominal cost of a single N5 wafer is about $20,000, but Cerebras must also create additional upper-layer metal masks for each batch of wafers to bypass defective tiles; Vicor’s custom power modules are also expensive, with materials estimating their value close to that of the TSMC content. Cooling, packaging, and assembly involve extensive in-house development, and there are also twelve 100GbE Xilinx FPGAs acting similarly to NICs, converting Cerebras’s proprietary I/O to Ethernet.

**Therefore, Cerebras is not a “cheap chip alternative to GPUs.” It exchanges complex systems for extreme interaction speed within a specific inference speed range.**

## Stagnant SRAM Scaling Is an Insurmountable Node Issue for Cerebras

Cerebras relies heavily on SRAM, but SRAM scaling is stalling.

Changes in SRAM capacity across three generations of WSE illustrate this clearly:

-   WSE-1, TSMC 16nm, 18 GB SRAM;
    
-   WSE-2, 7nm, 40 GB SRAM, a 2.2x generational improvement;
    
-   WSE-3, 5nm, 44 GB SRAM, an improvement of only about 10%.
    

Moving from 7nm to 5nm, the number of logic transistors increased by about 50%, but SRAM capacity barely moved. It will be even harder going forward. N3E offers basically no shrinkage in SRAM compared to N5, and N2 and beyond will continue to be constrained.

**For Cerebras, this is more fatal than for GPU manufacturers. GPUs can continue to stack HBM, expand packages, and pool memory via interconnects; SRAM-based machines like Groq can use hybrid bonding to stack more SRAM tiles in the Z-direction. Cerebras uses an entire wafer, and the planar area is already fully utilized. Increasing SRAM area would mean sacrificing compute area.**

The CS-4 roadmap also exposes this point: It still uses the N5-based WSE-3, increasing power consumption, clock speed, and sustained compute capability, but leaving SRAM capacity unchanged.

An optional direction is wafer-to-wafer hybrid bonding, stacking DRAM wafers or more storage onto the WSE. Cerebras is indeed exploring this path. However, the thermomechanical issues of wafer-scale monolithic chips and bond wave problems are more difficult than conventional hybrid bonding. It has solved many unusual problems in the past, but the next step remains a tough battle.

## The Biggest Weakness Is I/O: The Wafer Is Large, but the Exit Is Narrow

The WSE-3’s off-chip bandwidth is only 150 GB/s, or 1.2 Tb/s. Relative to its compute scale and on-chip bandwidth, this exit is too small.

This issue is not due to engineers ignoring the importance of I/O, but rather the geometric constraints inherent in wafer-scale architecture.

The WSE consists of 84 identical step-and-repeat regions. Each reticle exposure pattern must be consistent, with logic, SRAM, and wiring positions identical, to allow cross-scribe-line interconnects to extend continuously across the wafer. In other words, you cannot place SerDes PHYs only on edge reticles while making all middle reticles purely for computation. Every reticle must look the same.

To increase edge I/O, PHYs would need to be placed in every reticle. The problem is that PHYs in the middle cannot connect to the outside world and would become wasted silicon area. Worse, high-speed SerDes PHYs are large, analog circuits dislike being close to digital logic and require guard regions; placing them inside the wafer would create holes in the 2D mesh, increasing routing and latency, thereby weakening the very problem wafer-scale interconnects aim to solve.

**The materials provide an intuitive figure: The current off-chip bandwidth density of the WSE is about 0.17 GB/s/mm of edge density, whereas NVIDIA’s off-chip I/O density is about 130 times higher.**

**Cerebras’s solution is optical interconnect wafers: stacking photonic interconnect wafers onto the WSE via hybrid bonding, allowing data to enter and exit along the Z-axis rather than squeezing out from the wafer edges. The partner is Ranovus.**

This path is elegant but difficult. Optical devices are sensitive to temperature—they cannot be too hot or too cold—and they must be attached to a high-power wafer. Fiber coupling has not yet been fully engineered for easy mass production even in ordinary CPO, let alone scaled up to an entire wafer.

## Large Models Will Force Cerebras to Use Pipelining, Contradicting the Original Intent of “Speed”

If a model cannot fit into a single WSE, it must be split across multiple wafers.

But low I/O bandwidth rules out many common parallelism methods. High-bandwidth collective communication is unrealistic, as is frequent movement of large tensors in and out of the wafer. The most feasible remaining option is pipeline parallelism: splitting the model by layers across multiple WSEs, with each wafer retaining the weights for its corresponding layers and transmitting only activation values between stages.

When serving Llama 3 70B, Cerebras splits the model across 4 WSE-3s, transmitting only activations between wafers, keeping communication volume within the 1.2 Tb/s I/O capacity.

**However, pipelining brings three problems.**

First, pipeline bubbles. Four stages require at least about 4 in-flight microbatches to stay busy; 16 stages would require about 16. The more stages, the harder the scheduling.

Second, each in-flight microbatch has its own KV Cache, and the KV Cache must also squeeze into the 44 GB SRAM along with the weights. Even if new models use stronger KV compression, moving KV on and off-chip will still increase TTFT and TPOT pressure by milliseconds.

Third, as the number of wafers increases, the fixed latency of activation transmission between wafers also increases linearly. The larger the model, the further it deviates from Cerebras’s ideal form: small batch, low latency, high-speed decode on a single or few wafers.

Public product lines also reveal boundaries. The largest production model on Cerebras Inference Cloud is currently GPT-OSS, with 120B total parameters; the larger preview model GLM 4.7 stops at 355B. Llama 70B and 405B were once popular but were later taken offline, possibly due to service economics. Two popular open-source frontier models of 2025, DeepSeek V3 and Kimi K2, also do not appear on Cerebras’s public cloud.

However, this is not an absolute dead end. If models like DeepSeek V4 Pro adopt stronger KV Cache compression, models with 1T+ parameters might become servicable again under sufficient concurrency. The question is whether they can simultaneously preserve what Cerebras values most: speed.

## OpenAI Brings Cerebras to the Main Table, but Also Concentrates Risk on Itself

OpenAI is not just an ordinary customer in Cerebras’s future.

**In December 2025, the two parties signed a Master Relationship Agreement. OpenAI committed to purchasing 750 MW of AI inference computing power, to be deployed in batches from 2026 to 2028, with each batch having a term of 3–4 years, extendable to 5 years. OpenAI also has an option to purchase an additional 1.25 GW, raising the total to 2 GW.**

Disclosure in the S-1 reveals that as of December 31, 2025, Cerebras’s remaining performance obligations amounted to $24.6 billion. More importantly, pass-through costs such as data center rent, electricity, leasehold improvements, and security are reimbursed by OpenAI and recognized as revenue on a gross basis.

OpenAI also provides a $1 billion working capital loan at an annual interest rate of 6%. If Cerebras repays through the delivery of computing power or hardware, the corresponding interest can be waived. Repayment begins after the final delivery of the initial 250 MW, amortized equally over three years. If the MRA is terminated for reasons other than OpenAI’s material uncured default, Cerebras may be required to immediately repay all outstanding principal and accrued interest. OpenAI can also instruct the custodian bank to stop using funds according to Cerebras’s instructions, taking direct control of fund disposition.

Equity binding is also deep. Cerebras issued 33,445,026 shares of Class N non-voting common stock warrants to OpenAI, with an exercise price of $0.00001, virtually free. Part of this vested immediately due to the $1 billion loan, another part is linked to a $40 billion market cap or payment thresholds, and the remainder is related to computing power delivery and the additional 2 GW expansion option. On a fully diluted basis, OpenAI could hold up to about 12% of Cerebras’s shares, excluding subsequent new issuances.

Under ASC 505-50, equity incentives granted to customers are recognized as contra-revenue over the term of the commercial agreement. Based on a rough valuation of $82.02 per share in the S-1, all warrants theoretically correspond to about $2.74 billion in contra-revenue, approximately 10% of OpenAI’s expected revenue.

This is an order that can change destiny, but also a structure that bets the company’s fate on a single counterparty.

## GPT-5.3-Codex-Spark Proves the Value of Speed, but Also Exposes Model Size Issues

After OpenAI released GPT-5.3-Codex-Spark, Cerebras’s narrative became more complete. This model uses the gpt-oss-120B architecture, distilled from the true GPT-5.3-Codex, and can run at up to 2,000 tok/sec/user on Cerebras.

The key lies in “120B.” It is not the full GPT-5.3-Codex, but a much smaller distilled model. The materials explicitly state that it is more than 10 times smaller than the full model.

This is both good news and a limitation for Cerebras.

The good news is that if 120B-class models are sufficiently capable, combined with extremely fast output speeds, they can indeed become high-value products. Developers have already proven they are willing to sacrifice some frontier intelligence for faster tokens.

The limitation is that if OpenAI wants to run large models with over 1T parameters, 1M context windows, and oriented towards real agentic workloads on Cerebras, it must accept significant cost trade-offs, and actual interaction speeds may fall below 1,000 tok/sec. Whether a sufficiently high token premium can be sold is key to the viability of the business model.

The path assumption given in the materials is aggressive: Small model capabilities continue to improve, and within about a year, the 120B form factor may approach GPT-5.5-level intelligence. If this holds true, Cerebras would not need to host the most cutting-edge, largest-parameter models to still sell high-priced fast tokens. The 750 MW locked in by OpenAI is just the first step; the true upside potential comes from whether the additional 1.25 GW option is exercised, or even further procurement expansion.

But this upside condition is narrow: Cerebras must prove that it can continuously host sufficiently smart and profitable models within the model sizes suitable for its hardware.

## The Core Question of the IPO: Can the Fast Token Premium Long-Term Cover Hardware Trade-offs?

Cerebras is not another GPU story. It is not comprehensively replacing NVIDIA in training, general large model inference, or long-context throughput, but is placing a heavy bet in a narrower but potentially very profitable segment: high interaction speed, low batch sizes, and inference where users are willing to pay a premium.

Wafer-scale architecture gives it extremely strong bandwidth and ultra-fast decode, but also burdens it with hard constraints such as SRAM capacity, off-chip I/O, cooling, BOM, and data center adaptation. The OpenAI order solves the demand problem but does not eliminate delivery risks and customer concentration.

Therefore, Cerebras’s IPO pricing should not only look at the $24.6 billion backlog, nor just at pretty speeds like 2,000 tok/sec/user. More important are three questions:

1.  Whether the fast tokens OpenAI needs will long-term be sufficient with models in the 120B–355B range;
    
2.  Whether the premium users are willing to pay for speed can cover Cerebras’s more complex system costs;
    
3.  Whether the 750 MW can be implemented on schedule by 2028 without being dragged down by cooling, power, supply chain, and data center capabilities.
    

**If the answers lean towards “yes,” Cerebras will become one of the most distinctive AI hardware companies in the era of fast inference. If the answers lean towards “no,” the speed advantage brought by the entire wafer may be gradually eaten away by the memory demands of large models and long contexts.**

### Related Stocks

- [159995.CN](https://longbridge.com/en/quote/159995.CN.md)
- [588780.CN](https://longbridge.com/en/quote/588780.CN.md)
- [159325.CN](https://longbridge.com/en/quote/159325.CN.md)
- [SOXL.US](https://longbridge.com/en/quote/SOXL.US.md)
- [512760.CN](https://longbridge.com/en/quote/512760.CN.md)
- [512720.CN](https://longbridge.com/en/quote/512720.CN.md)
- [SOXX.US](https://longbridge.com/en/quote/SOXX.US.md)
- [159998.CN](https://longbridge.com/en/quote/159998.CN.md)
- [CBRS.US](https://longbridge.com/en/quote/CBRS.US.md)
- [OpenAI.NA](https://longbridge.com/en/quote/OpenAI.NA.md)
- [NVDA.US](https://longbridge.com/en/quote/NVDA.US.md)
- [AMD.US](https://longbridge.com/en/quote/AMD.US.md)
- [TSM.US](https://longbridge.com/en/quote/TSM.US.md)
- [VICR.US](https://longbridge.com/en/quote/VICR.US.md)
- [RAN.US](https://longbridge.com/en/quote/RAN.US.md)
- [NVD.DE](https://longbridge.com/en/quote/NVD.DE.md)

## Related News & Research

- [ByteDance Hikes AI Budget by 25% After Past Spending Wiped Out 70% Net Profit](https://longbridge.com/en/news/285977418.md)
- [FotoNation and SEMIFIVE Announce Strategic Collaboration for Turnkey Development of TriSilica Perceptual AI Chip Family Using Samsung Foundry](https://longbridge.com/en/news/285912358.md)
- [Cerebras set for debut in stock market gripped by AI mania](https://longbridge.com/en/news/286399918.md)
- [EXCLUSIVE-SK Hynix flooded with unprecedented offers from big tech firms to secure chip supplies](https://longbridge.com/en/news/285627175.md)
- [China’s chip dream: Loongson challenges Intel, fuelled by Beijing’s tech drive](https://longbridge.com/en/news/286178095.md)