---
title: "IEIT Liu Jun: The AI industry cannot achieve profitability without reducing costs; a cost of 1 yuan per million tokens is still far from enough!"
type: "News"
locale: "en"
url: "https://longbridge.com/en/news/270816475.md"
description: "IEIT Liu Jun: The AI industry cannot achieve profitability without reducing costs; a cost of 1 yuan per million tokens is still far from enough!"
datetime: "2025-12-25T19:09:30.000Z"
locales:
  - [zh-CN](https://longbridge.com/zh-CN/news/270816475.md)
  - [en](https://longbridge.com/en/news/270816475.md)
  - [zh-HK](https://longbridge.com/zh-HK/news/270816475.md)
---

# IEIT Liu Jun: The AI industry cannot achieve profitability without reducing costs; a cost of 1 yuan per million tokens is still far from enough!

The current global AI industry has transitioned from a competition in model performance to a "life-and-death race" for the large-scale implementation of intelligent agents. "Cost reduction" is no longer an optional optimization but the core lifeline determining whether AI companies can be profitable and whether the industry can break through. Against this backdrop, Inspur Information has launched the Yuan Nao HC1000 ultra-scalable AI server, which has for the first time reduced inference costs to 1 yuan per million tokens. This breakthrough is expected to eliminate the cost barriers of the "last mile" in the industrialization of intelligent agents and will reshape the underlying logic of competition in the AI industry.

**Liu Jun, Chief AI Strategist of Inspur Information, emphasized that** the current breakthrough of 1 yuan per million tokens is only a temporary victory. In the face of the inevitable trend of exponential growth in token consumption and a tenfold increase in the demand for tokens for complex tasks, the existing cost level still cannot support the widespread implementation of AI. In the future, for AI to truly become a basic resource like "water, electricity, and coal," token costs must achieve an order-of-magnitude leap from the current level. Cost capability will further upgrade from "core competitiveness" to "survival ticket," directly determining the life and death of AI companies in the era of intelligent agents.

 https://static001.geekbang.org/infoq/2e/2e4f082af574d2a71f429053bdacf33a.png
Liu Jun, Chief AI Strategist of Inspur Information

## In the era of intelligent agents, token cost is competitiveness

Looking back at the history of internet development, the "speeding up and reducing costs" of infrastructure is an important cornerstone of industry prosperity. From dial-up internet billed by Kb, to fiber optic access where 100 Mbps bandwidth became standard, and then to the 4G/5G era where data traffic costs approach zero—each significant reduction in communication costs has driven the explosion of new application ecosystems such as video streaming and mobile payments.

The current AI era is also at a similar critical point. As technological advancements lead to a decline in token prices, companies can apply AI on a large scale to more complex and energy-intensive scenarios, such as from early short Q&A to now supporting ultra-long contexts with multi-step planning and reflection capabilities of intelligent agents... This has also resulted in an exponential increase in the demand for tokens for single tasks. If the rate of decline in token costs does not keep pace with the exponential growth in consumption, companies will face higher expenditure. This indicates that the famous "Jevons Paradox" in economics is perfectly replaying in the token economy.

Data from multiple sources strongly supports the trend of exponential growth in token consumption. The latest data disclosed by Volcano Engine shows that as of December this year, the daily token usage of the Doubao large model under ByteDance has exceeded 50 trillion, an increase of more than 10 times compared to the same period last year, and a growth of 417 times compared to the daily usage when it was launched in May 2024; Google disclosed in October that its platforms process a monthly token volume of 1.3 trillion, equivalent to an average of 43.3 trillion per day, compared to just 9.7 trillion per month a year ago.

 https://static001.geekbang.org/infoq/63/630b39f9465489f30921f6716ba10c8d.png
Google announced changes in its token processing volume.

When usage reaches the scale of "trillions of tokens per month," even a cost reduction of just $1 per million tokens could result in a monthly cost difference of $100 million. Liu Jun believes that "token cost is competitiveness; it directly determines the profitability of intelligent agents. To truly bring AI into a stage of large-scale and inclusive development, token costs must continue to decrease by orders of magnitude from the current level."

## In-depth Analysis of Token Cost "Black Box": Mismatched Architecture is the Core Bottleneck

Currently, the global large model competition has shifted from "blindly stacking computing power" to a new stage of "pursuing the value output per unit computing power." The value output per unit computing power is influenced by various factors such as energy prices, hardware procurement costs, algorithm optimization, and operational costs. However, it is undeniable that over 80% of current token costs still come from computing power expenditures, and the core contradiction hindering cost reduction lies in the stark differences between inference loads and training loads. Continuing to use old architectures leads to difficulties in optimizing computing power, memory, and network resources simultaneously, resulting in severe "high configuration but low efficiency."

**First, there is a severe inversion of computing power utilization (MFU).** The MFU during the training phase can exceed 50%, but during the inference phase, especially for real-time interactive tasks that pursue low latency, due to the autoregressive decoding characteristics of tokens, the hardware must load all model parameters in each round of computation, but only to compute the output of one token, causing expensive GPUs to spend most of their time waiting for data transfer, with actual MFU often only at 5%-10%. This enormous idle computing power is the structural root of high costs.

**Second, the "storage wall" bottleneck is amplified in inference scenarios.** In large model inference, as the context length increases, the KV Cache grows exponentially. This not only occupies a large amount of memory space but also leads to high power consumption due to memory access intensity. This separation of storage and computation not only incurs data migration power consumption and latency but also requires the use of expensive HBM, which has become an important bottleneck hindering the reduction of token costs.

**Third, the costs of network communication and horizontal expansion are becoming increasingly high.** When the model scale exceeds the capacity of a single machine, inter-node communication becomes a new bottleneck. The latency of traditional RoCE or InfiniBand networks is much higher than the total bus latency within chips, and communication overhead can account for more than 30% of total inference time, forcing companies to pile on more resources to maintain response speed, thereby increasing total cost of ownership (TCO) In this regard, Liu Jun pointed out that the core of reducing token costs is not about "making a machine more comprehensive," but rather about reconstructing the system around the goal: breaking down the reasoning process into finer details, supporting P/D separation, A/F separation, KV parallelism, fine-grained expert splitting, and other computing strategies, allowing different computing modules to be configured concurrently on different cards as needed, fully loading each card, thereby lowering the "cost per card time" and increasing the "output per card time."

## Based on a brand new ultra-expansion architecture, the Yuan Nao HC1000 achieves a reasoning cost that breaks through 1 yuan per million tokens for the first time.

Currently, the token costs of mainstream large models remain high. For example, the price for outputting one million tokens for models like Claude and Grok generally ranges from $10 to $15, while domestic large models, although relatively cheaper, are mostly above 10 yuan. With astronomical levels of usage, such high token costs pose a severe ROI challenge for large-scale commercial applications. To break the cost deadlock, a fundamental reconstruction at the computing architecture level is necessary to significantly enhance the output efficiency of unit computing power.

 https://static001.geekbang.org/infoq/19/198914798231d67ba6a74f066b7fb704.png
Price of one million tokens for mainstream LLMs

(Note: Data from September 26 (the day of the AICC2025 conference), on September 29, DeepSeek announced that the price for V3.2 Exp was reduced to 3 yuan per million tokens.)

To this end, Inspur Information launched the Yuan Nao HC1000 ultra-expansion AI server. This product is based on a newly designed fully symmetric DirectCom ultra-fast architecture, employing a lossless ultra-expansion design that can efficiently aggregate massive domestic AI chips, supporting extremely high reasoning throughput, and achieving a reasoning cost that breaks through 1 yuan per million tokens for the first time, providing an innovative computing power system with exceptional performance to help agents overcome the token cost bottleneck.

 https://static001.geekbang.org/infoq/55/555aec0d257a11bca539feb24d982749.png
Yuan Nao HC1000 ultra-expansion AI server

Liu Jun stated: "We see that the original AI computing was built with a focus on being large and comprehensive, with all sorts of things included. However, when we focus on the core goal of reducing token costs, we rethink the system architecture design, identify system bottlenecks, and reconstruct a system with a minimalist design The YuanNerve HC1000 innovatively designs the DirectCom ultra-fast architecture, with each computing module configured with 16 AIPUs, adopting a direct communication design to solve the protocol conversion and bandwidth contention issues of traditional architectures, achieving ultra-low latency; the computing communication is balanced at a 1:1 ratio, enabling global non-blocking communication; the fully symmetric system topology design supports flexible PD separation and AF separation schemes, configuring computing instances on demand to maximize resource utilization.

 https://static001.geekbang.org/infoq/d8/d85af4f80fdabc71116f11fd0119e4be.png
Fully symmetric DirectCom ultra-fast architecture

At the same time, the YuanNerve HC1000 supports ultra-large-scale lossless expansion, with the DirectCom architecture ensuring a balance between computing and communication. Through deep collaboration of computing networks and global lossless technology, it achieves a 1.75 times improvement in inference performance, and by subdividing the computing process of large models and decoupling model structures, it realizes flexible on-demand allocation of computing loads, with a single card MFU capable of improving performance by up to 5.7 times.

 https://static001.geekbang.org/infoq/a1/a16275931151639c59d7858e36766f3e.png
Ultra-large-scale lossless expansion

In addition, the YuanNerve HC1000 provides packet-level dynamic load balancing through adaptive routing and intelligent congestion control algorithms, achieving intelligent scheduling of KV Cache transmission and All to All communication traffic, reducing the impact of KV Cache transmission on Prefill and Decode computing instances by 5-10 times.

Liu Jun emphasized that the current "1 yuan per million tokens" is still far from enough. In the face of the exponential growth of token consumption in the future, to achieve a continuous and order-of-magnitude reduction in the cost per token, a fundamental innovation in computing architecture must be promoted. This also requires product and technology innovation across the entire AI industry, shifting from the current scale-oriented approach to an efficiency-oriented one, fundamentally rethinking and designing AI computing systems, developing AI-specific computing architectures, exploring the development of large model chips, and promoting the innovation of dedicated computing architectures that hardwareize algorithms, achieving deep optimization of software and hardware, which will be the direction of future development

### Related Stocks

- [000977.CN](https://longbridge.com/en/quote/000977.CN.md)

## Related News & Research

- [AI face is taking over — and driving plastic surgeons crazy](https://longbridge.com/en/news/286641783.md)
- [Jack Antonoff tells people who are making AI art to 'drive right off that cliff'](https://longbridge.com/en/news/286592426.md)
- [China Telecom launches trial AI Token subscription plans starting at $1.4 per month](https://longbridge.com/en/news/286702846.md)
- [Google has been quietly gaining AI customers, even before big releases next week](https://longbridge.com/en/news/286580439.md)
- [HeroHire Launches Autonomous AI Recruiter That Works 24/7 So Business Owners Stop Doing HR's Job](https://longbridge.com/en/news/286677056.md)