---
title: "Fish Audio releases the S2-Pro model, promoting a new standard for high-fidelity real-time voice synthesis"
type: "News"
locale: "en"
url: "https://longbridge.com/en/news/278669226.md"
description: "Fish Audio launched the flagship text-to-speech model S2-Pro, which adopts a dual autoregressive architecture to achieve 44.1kHz high-fidelity audio output. This model supports zero-shot voice cloning, requiring only 10 to 30 seconds of reference audio to replicate the speaker's identity and emotional state, and allows for emotion control through natural language labels. S2-Pro achieves approximately 100 milliseconds of initial audio latency on NVIDIA H200 hardware and is available in the open-source ecosystem, with training data covering over 300,000 hours of multilingual speech, setting a new benchmark for real-time interactive AI applications"
datetime: "2026-03-11T06:46:02.000Z"
locales:
  - [zh-CN](https://longbridge.com/zh-CN/news/278669226.md)
  - [en](https://longbridge.com/en/news/278669226.md)
  - [zh-HK](https://longbridge.com/zh-HK/news/278669226.md)
---

# Fish Audio releases the S2-Pro model, promoting a new standard for high-fidelity real-time voice synthesis

PingWest reported on March 11 that Fish Audio officially launched its flagship text-to-speech (TTS) model S2-Pro, marking the evolution of speech synthesis technology towards integrated large audio models (LAM). This model employs an innovative dual autoregressive (Dual-AR) architecture, dividing the generation process into a "slow AR" module with 4 billion parameters (responsible for language structure and prosody) and a "fast AR" module with 400 million parameters (handling timbre, breath, and other high-frequency details), achieving 44.1kHz high-fidelity audio output.

S2-Pro supports zero-shot voice cloning, requiring only 10 to 30 seconds of reference audio to replicate the speaker's identity and emotional state, and achieves fine-grained emotional control through inline natural language tags (such as \[whisper\], \[laugh\]). The model is based on residual vector quantization (RVQ) technology, efficiently compressing audio information in multi-layer codebooks while retaining details of non-verbal sounds (such as sighs and pauses).

In terms of performance, S2-Pro achieves approximately 100 milliseconds of time-to-first-audio (TTFA) latency on NVIDIA H200 hardware and integrates the SGLang framework and RadixAttention mechanism, significantly reducing the pre-fill overhead for repeated voice generation through cached key-value states, supporting single inference for multi-character dialogues.

The model has been made available in the open-source ecosystem, with training data covering over 300,000 hours of multilingual speech, setting a new benchmark for real-time interactive AI applications.

### Related Stocks

- [AUDC.US](https://longbridge.com/en/quote/AUDC.US.md)
- [NVDA.US](https://longbridge.com/en/quote/NVDA.US.md)

## Related News & Research

- [Insider Selling: AudioCodes (NASDAQ:AUDC) CFO Sells 1,875 Shares of Stock](https://longbridge.com/en/news/285541810.md)
- [AudioCodes Q1 2026 Results Highlight Shift to AI-Driven, Recurring Revenue Model](https://longbridge.com/en/news/285224981.md)
- [Cizzle Brands Corporation Announces the Launch of CWENCH Hydration™ at Save-On-Foods Across Western Canada | CZZLF Stock News](https://longbridge.com/en/news/286559823.md)
- [ZAWYA: CNTXT AI introduces Munsit Edge](https://longbridge.com/en/news/286226922.md)
- [VQS.V: Growth strategy remains in place, off a smaller, higher margin base](https://longbridge.com/en/news/286451574.md)