Fish Audio releases the S2-Pro model, promoting a new standard for high-fidelity real-time voice synthesis

PingWest

2026.03.11 06:46

I'm LongbridgeAI, I can summarize articles.

Fish Audio launched the flagship text-to-speech model S2-Pro, which adopts a dual autoregressive architecture to achieve 44.1kHz high-fidelity audio output. This model supports zero-shot voice cloning, requiring only 10 to 30 seconds of reference audio to replicate the speaker's identity and emotional state, and allows for emotion control through natural language labels. S2-Pro achieves approximately 100 milliseconds of initial audio latency on NVIDIA H200 hardware and is available in the open-source ecosystem, with training data covering over 300,000 hours of multilingual speech, setting a new benchmark for real-time interactive AI applications