---
title: "DeepSeek V4 Shockwave: Million-Token Context Becomes Standard, Battle for Agent Infrastructure Begins"
type: "News"
locale: "en"
url: "https://longbridge.com/en/news/283948041.md"
description: "Moving forward"
datetime: "2026-04-24T06:59:58.000Z"
locales:
  - [zh-CN](https://longbridge.com/zh-CN/news/283948041.md)
  - [en](https://longbridge.com/en/news/283948041.md)
  - [zh-HK](https://longbridge.com/zh-HK/news/283948041.md)
---

# DeepSeek V4 Shockwave: Million-Token Context Becomes Standard, Battle for Agent Infrastructure Begins

Author | Lin Ke

On April 24, the highly anticipated preview version of DeepSeek's V4 model was finally released, with weights open-sourced simultaneously.

The two versions released this time are the flagship V4 PRO with 1.6 trillion total parameters and 49 billion activated parameters, and the cost-effective V4-Flash with 284 billion total parameters and 13 billion activated parameters. Both support a 1 million token context window and are fully open-sourced under the MIT license.

Just one day prior, OpenAI had launched GPT-5.5, priced at $30 per million output tokens. Today, DeepSeek V4-Flash is priced at 2 RMB per million output tokens, equivalent to less than $0.3.

**In just two days, the pricing logic of closed-source versus open-source models was presented face-to-face before the market.**

![Image](https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/2b277e59-51f4-4f71-a670-62d1d6c15646.jpeg?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg)

## **I. Timing: After Three Delays**

The arrival of this day for DeepSeek was not entirely unexpected, but it came later than anyone anticipated.

From late last year through February, March, and early April, the release window for DeepSeek V4 was pushed back three times. During this period, frequent updates from major models in the industry entered their most intensive phase.

It must be acknowledged that by late April 2026, a million-token context can no longer be considered an absolute lead; Gemini, Qwen, and others have already reached this scale. With the deployment of **DeepSeek V4, the question is no longer "can it be done," but rather "once done, can the costs be sustained?"**

V4 provides its answer through an entirely new hybrid attention architecture. It introduces a compression mechanism at the token level, combined with self-developed DSA sparse attention. This allows the model to avoid full computation on all tokens when processing ultra-long texts, instead distinguishing importance: reading strongly associated tokens in detail while compressing or skipping weakly associated ones.

This mechanism changed how the model processes long sequences starting from the pre-training stage. According to the technical report, V4 also introduced manifold-constrained hyperconnections (mHC) to replace traditional residual connections, enhancing the stability of deep network signal propagation, and utilized the Muon optimizer to accelerate training convergence. The entire model underwent pre-training on over 32 trillion tokens.

The actual results can be summarized in two numbers: under a million-token context setting, V4-Pro consumes only 27% of the compute power per token compared to V3.2, and KV cache usage drops to just 10%.

The official announcement stated more clearly: "From now on, 1M context will become the standard configuration for all official DeepSeek services." This means long-context capability has officially shifted from a "premium add-on" to a "default feature," representing a recalibration of cost expectations across the entire industry.

## **II. Matrix: Two Models + Three Modes**

In this release, both the flagship V4-Pro and the economy V4-Flash support three inference modes: non-thinking mode (fast response), thinking mode-high (explicit reasoning chain), and thinking mode-extreme (pushing the model to its capability limits). The official recommendation for complex Agent scenarios is to use the extreme mode.

DeepSeek's positioning for V4-Pro offers a direct benchmark: internal employees have already adopted it as their daily Agentic Coding tool, with an experience superior to Claude Sonnet 4.5, delivery quality approaching Opus 4.6 in non-thinking mode, though still lagging behind Opus 4.6 in thinking mode.

Regarding inference performance, it surpasses all currently publicly evaluated open-source models in mathematics, STEM, and competitive coding benchmarks, matching world-class top-tier closed-source models. Its world knowledge significantly leads other open-source models, slightly trailing Gemini-Pro-3.1.

V4-Flash's inference capabilities are close to the Pro version, though its world knowledge reserve is slightly inferior; it performs comparably on simple Agent tasks but shows gaps in high-difficulty tasks.

**One notable point in this self-assessment is that DeepSeek explicitly highlighted the gap with Opus 4.6 in thinking mode.** In the tradition of domestic large model announcements, such restraint itself is an expression of technical confidence.

## **III. Trigger: Token Price Disparity**

With the public release of the preview version, V4 API pricing went live simultaneously.

Per million tokens, V4-Flash input price is 1 RMB (0.2 RMB on cache hit), output price 2 RMB; V4-Pro input price is 12 RMB (1 RMB on cache hit), output price 24 RMB. The official note states these are preview prices, and the Pro version's prices will drop significantly after compute capacity expansion in the second half of the year.

These figures only make sense when viewed within a coordinate system.

At 1 RMB per million input tokens, the Flash version allows almost any developer to call a trillion-parameter MoE architecture open-source flagship model without burden.

By comparison, GPT-5.5 launched the previous day with an output price of $30 per million tokens, equivalent to over 200 RMB—a difference of more than 100 times compared to V4-Flash's 2 RMB output price. Even comparing with V4-Pro's 24 RMB output price, the gap remains more than an order of magnitude.

While the current Pro version price is high, the official has provided clear expectations for future price reductions. The constraint behind this is not pricing strategy but compute supply—the high-performance inference of the Pro version requires more chip resources, and current service throughput is very limited. This indirectly confirms DeepSeek's deep investment in autonomous compute adaptation.

The discount magnitude for cache hits is also worth noting. For Flash, the cache hit price is one-fifth of the miss price; for Pro, it is one-twelfth.

This means DeepSeek is using pricing leverage to encourage a specific usage pattern: placing fixed content such as system prompts, tool definitions, and document templates at the beginning of requests so that the caching mechanism activates automatically. For Agent-type applications, this is exactly the most typical calling pattern.

**Using Flash's bargain prices to scale volume, leveraging Pro's advanced capabilities to handle top-tier scenarios, and employing caching mechanisms to reduce marginal costs for Agent developers. Every move cuts directly into the pain points of the application layer.**

## **IV. Direction: Agent Infrastructure**

If only one key label could be extracted from the V4 release, then "Agent" might be even more important than the million-token context.

The official statement explicitly writes: "V4 has undergone specialized adaptation and optimization for mainstream Agent products such as Claude Code, OpenClaw, OpenCode, and CodeBuddy, achieving the best level among open-source models in Agentic Coding evaluations." This adaptation list includes both Anthropic products and domestic developer tools.

**The signal is clear: DeepSeek does not intend to build its own application ecosystem but aims to become the infrastructure supplier for the Agent era.**

Placed within the current industry landscape, this choice represents a conscious trade-off. Over the past four months, Anthropic's annual revenue jumped from $9 billion to $30 billion, with nearly all growth coming from Claude Code; Cursor, a code editor, has reached a valuation of $60 billion. Money exists at the application layer, but DeepSeek chooses not to touch it.

**This indicates its positioning is not to become the next Anthropic, but rather the infrastructure of the Agent era.**

The combination of long context + low-price API + Agent adaptation essentially turns DeepSeek into a power station, allowing all devices to run more cheaply.

For Agent developers who struggle daily with token consumption, V4 opens up a concrete scenario: feeding entire code repositories, complete requirement documents, and hundreds of rounds of historical dialogues into a single call, eliminating the need for splitting, retrieval, and summarization engineering workarounds. Managing context has always been the biggest headache for building Agents—each additional round of conversation causes exponential token stacking, worsening both cost and stability.

If V4 can deliver on its promises under real-world loads, the cost structure of this pain point will be rewritten.

## **V. Ecosystem: Model and Compute Race**

During the delay period of V4, the battlefield for domestic open-source large models never quieted down.

Around the Chinese New Year this year, there was a dense explosion: Alibaba's Qwen3.5 has 397 billion total parameters with only 17 billion activated, offering a million-token API price as low as 0.8 RMB—one-eighteenth of Gemini-3-Pro; Zhipu's GLM-5 achieved 96.2% on HumanEval for code generation, hitting the strongest open-source level.

Acceleration continued in April: Kimi K2.6 scored 80.2% on SWE-Bench Verified, almost catching up to Claude Opus 4.6; Zhipu GLM-5.1 surpassed GPT-5.4 and Claude Opus 4.6 with 58.4% on SWE-Bench Pro; Qwen 3.6 Plus also entered the million-token context club.

Qwen, Kimi, GLM, MiniMax, MiMo—these domestic models are appearing with visibly increasing frequency in international developer communities.

Beyond models, matching compute resources are also being deployed simultaneously.

On the same day as the V4 release, Huawei confirmed full compatibility of its Ascend series—A2, A3, and the latest Ascend 950—with both V4-Flash and V4-Pro.

The wording was "tight synergy between chip and model technology," indicating that DeepSeek and Ascend's adaptation work has been progressing synchronously since the model R&D phase.

Huawei provided specific performance data: based on the Ascend 950 super-node, V4-Pro achieved approximately 20ms single-token decoding latency in 8K input scenarios, with single-card throughput of 4700 TPS; V4-Flash can achieve approximately 10ms latency with single-card throughput of 1600 TPS.

On the Ascend A3 super-node, V4-Flash achieved single-card throughput exceeding 2000 TPS in large-scale deployments of 64 cards.

Behind these numbers lie three generational upgrades in the underlying architecture of the Ascend 950: native support for low-precision formats like FP8/MXFP4 (reducing memory usage by over 50% and doubling compute power), hardware-level optimization targeting MoE sparse access patterns, and a new design where Vector and Cube units share on-chip memory.

More noteworthy are actions at the engineering ecosystem level.

Huawei simultaneously open-sourced the PyPTO programming paradigm, shortening the development cycle for complex operators involving Attention compression and mHC in V4's new architecture from weeks to days, allowing developers to avoid manually handling hardware-level synchronization and data movement.

Cambricon also announced on the release day that it has completed Day 0 adaptation for V4-Flash and V4-Pro based on the vLLM framework, with code open-sourced to GitHub.

Two domestic chip manufacturers delivering complete inference deployment solutions on the first day of the model release demonstrates that this adaptation work was not rushed but deeply integrated with model R&D for a significant period.

DeepSeek paid a considerable engineering cost for this underlying compute migration. According to previous reports, the team rewrote a large amount of core code, completing the entire tech stack migration from the CUDA ecosystem to the Ascend CANN framework, which was one reason for V4's repeated delays.

However, when a trillion-parameter open-source flagship model can run on the full product line of domestic compute on its release day, with adaptation code directly open-sourced and inference performance providing concrete throughput and latency data rather than "support coming soon"—the significance of this event transcends the evaluation scope of any single model.

Whether regarding models or compute, while there may be competition between them, from a broader perspective, they are both proving the same thing:

**China's AI R&D systemic capability is not one or two isolated cases, but an ecosystem capable of continuous innovation.**

In January 2025, the release of DeepSeek R1 triggered a single-day market value evaporation of over $1 trillion in US stocks, dubbed the "Sputnik moment" in the artificial intelligence field.

Today's V4 release lacks that dramatic shock, but China's AI R&D has moved from "occasionally shocking" to "consistently present."

At the end of its announcement, DeepSeek quoted a line from Xunzi:

> **Do not be tempted by praise, do not fear slander; follow the right path and stand upright in self-correction.**

Given a company that has delayed three times, suffered core talent loss, and recently reported funding rumors, this sentence reads with a touch of stubbornness.

But in 2026, when the entire group of domestic open-source models stands tall, this sentence belongs not only to DeepSeek but to every step taken firmly forward by all Chinese AI innovators.

### Related Stocks

- [SOXL.US](https://longbridge.com/en/quote/SOXL.US.md)
- [PSI.US](https://longbridge.com/en/quote/PSI.US.md)
- [XSD.US](https://longbridge.com/en/quote/XSD.US.md)
- [SMH.US](https://longbridge.com/en/quote/SMH.US.md)
- [SOXX.US](https://longbridge.com/en/quote/SOXX.US.md)
- [DPSK.NA](https://longbridge.com/en/quote/DPSK.NA.md)
- [OpenAI.NA](https://longbridge.com/en/quote/OpenAI.NA.md)
- [BABA.US](https://longbridge.com/en/quote/BABA.US.md)
- [09988.HK](https://longbridge.com/en/quote/09988.HK.md)
- [00100.HK](https://longbridge.com/en/quote/00100.HK.md)
- [HUAWEI.NA](https://longbridge.com/en/quote/HUAWEI.NA.md)
- [688256.CN](https://longbridge.com/en/quote/688256.CN.md)
- [89988.HK](https://longbridge.com/en/quote/89988.HK.md)
- [HBBD.SG](https://longbridge.com/en/quote/HBBD.SG.md)

## Related News & Research

- [The good times roll for SK Hynix with record-breaking quarter but living up to expectations isn't easy](https://longbridge.com/en/news/283803308.md)
- [CSconnected supporting £436m for Welsh economy and 3140 jobs](https://longbridge.com/en/news/283479811.md)
- [Israel's Mobileye Q1 revenue, adjusted EPS top estimates; hikes FY forecasts](https://longbridge.com/en/news/283821036.md)
- [Bolt Graphics Completes Tape-Out of Test Chip for Its High-Performance Zeus GPU, A Major Milestone in Reducing Computing Costs By 17x](https://longbridge.com/en/news/283686979.md)
- [Trump Administration's Intel Stake Is Now Up 290% In Less Than A Year](https://longbridge.com/en/news/283921098.md)