--- title: "Qwen3.5-Omni In-Depth Experience: This is What \"AI Productivity\" Should Look Like!" type: "News" locale: "zh-HK" url: "https://longbridge.com/zh-HK/news/281122026.md" description: "It transforms audio and video from content that is \"watched and then forgotten\" into \"digital assets\" that can be searched, reused, and directly put to work" datetime: "2026-03-31T04:25:57.000Z" locales: - [zh-CN](https://longbridge.com/zh-CN/news/281122026.md) - [en](https://longbridge.com/en/news/281122026.md) - [zh-HK](https://longbridge.com/zh-HK/news/281122026.md) --- > 支持的語言: [简体中文](https://longbridge.com/zh-CN/news/281122026.md) | [English](https://longbridge.com/en/news/281122026.md) # Qwen3.5-Omni In-Depth Experience: This is What "AI Productivity" Should Look Like! You must have had this experience: after a two-hour meeting, the recording file lies quietly in your cloud drive, but no one wants to rewatch it—because the cost of rewatching is almost equivalent to holding the meeting all over again. You come across a viral sales video and vaguely feel that its conversion logic is worth learning, but you have neither the time to deconstruct it frame by frame, nor do you know how to turn it into your own script even if you did. Then there are English podcasts, live press conferences, and customer service recordings mixed with dialects that need reviewing—these types of audio and video content are produced in massive quantities every day, but for the vast majority of people, once they have been "watched" or "heard," there is no follow-up. In our daily lives, a vast amount of extremely valuable audio and video content cannot be disassembled, searched, or summarized for reuse. **And Alibaba's Qwen has just released Qwen3.5-Omni, which makes us feel that this problem is starting to have a solution.** It is the latest generation of Qwen's omni-modal large model, adopting a Mixture-of-Experts (MoE) architecture. It has undergone native multimodal pre-training on massive text, visual, and over 100 million hours of audio data. It has achieved SOTA performance in 215 third-party performance tests, with several core indicators surpassing Gemini-3.1 Pro. ![Image](https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/116485c5-d120-4a59-af2d-d4e16c3a3978.png?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg) More noteworthy than the benchmark scores is what we actually experienced in our testing—after several rounds of extremely demanding stress tests, this omni-modal model completely shocked me: - We asked it to deconstruct a "Dune" trailer—it not only provided a structured analysis with timestamps but also inferred implicit relationships between characters and generated a replication storyboard script with pacing design and color grading suggestions; - We gave it a viral TikTok sales video—it broke down the complete conversion attribution and output a 5-step script template that could be directly migrated to other industries; - We described our requirements to a poorly drawn hand-sketched diagram—it directly generated runnable React code, and as we continued to dictate modifications, it iterated through rounds while consistently maintaining context. This means you can throw a two-hour meeting recording at it and get back a structured summary with timestamps and a to-do list; drop in a competitor's viral video and directly obtain a transferable script template; use it to quality-check customer service recordings and output emotion trajectories and script evaluations. **Its significance goes far beyond just another parameter upgrade in multimodal capabilities. It allowed me to see firsthand how audio and video content, which previously could only be "watched once and forgotten," is being forcibly deconstructed into "digital assets" that can be directly used for work.** And if you connect your lobster to Qwen3.5-Omni, giving your lobster "eyes" and "ears," then you will obtain a digital employee that can truly understand voice commands, comprehend video content, interpret audio information, and even operate a computer. **This, perhaps, is the true productivity revolution belonging to omni-modal large models that we have long awaited.** Next, let's first look at the testing details, then discuss what this model is changing and what strategy Alibaba is pursuing with it. ## **Deconstructing Movies, Reviewing Sales, and Dictating Code: Comprehensive Evolution of Omni-modal Capabilities** **(1) Dune: Beyond "Understanding the Story"** We chose the version of the "Dune" trailer without subtitles as our first test material to conduct a "stress test" of Qwen3.5-Omni's multimodal capabilities. Trailers are inherently the most unfriendly material in the field of video understanding: dense shot transitions, multi-threaded narratives, abundant metaphors, and visual cues, with extremely high audiovisual density. For Qwen3.5-Omni, the first round of structured information extraction was almost effortless: plot timeline, key shots, on-screen text, speakers and dialogue, character faction relationships, and emotional change curves were all precisely extracted with timestamps. ![Image](https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/83d7635b-090f-468c-94ae-84e0ae3e5ade.png?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg) In the second round, we specified a line appearing at the 24-second mark and asked it to identify the corresponding frame, speaker, and emotion. It accurately located "She would need to be strong, like her mother," correctly identifying it as Paul's voice-over rather than on-site dialogue, corresponding to a close-up of Chani's profile backlit in the desert, and the emotional judgment—tenderness, respect, expectation—perfectly matched the visuals. ![Image](https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/343cc783-090f-468c-94ae-84e0ae3e5ade.png?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg) **The real test lay in the third round of "deep inferential questioning"—** We asked it to analyze the "implicit relationships" between characters and provide evidence from shots and lines, identify "foreshadowing" shots in the trailer and their indications for the future plot, and generate a 45-second short video replication storyboard script. It accurately identified the "mirror-image nemesis" relationship between Paul and Feyd-Rautha, the "broken inheritance" tension between Paul and Jessica, and Chani's role as the "humanity anchor," accompanied by visual composition evidence and dialogue references. The replication storyboard script it provided was not a vague narrative summary, but featured a three-act pacing design of "slow lyrical → fast editing → epic explosion," even including color grading directions, sound effect cues, and subtitle treatment suggestions. **To be honest, by this stage, it is no longer just "understanding the video," but rather something akin to a director deconstructing a film. It has pushed the "video understanding" capability of LLMs from the summary level to the level of shot language interpretation and relationship inference.** ![Image](https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/188676bf-5e0b-41d9-a781-e921e25f01e7.png?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg) **(2) Sales: Deconstructing the Underlying Logic of Conversion from a Viral TikTok Video** For more people, a more practical question is: Is it actually "useful" in the real world and in daily work? We input a viral TikTok sales video for Yiwu merchant recruitment and asked Qwen3.5-Omni to help us deconstruct and replicate it. ![Image](https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/53a1bbb2-90e6-4e3e-9bd3-0d2617337d13.png?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg) As a result, the model not only completed a structured breakdown across seven dimensions—Hook, selling point sequence, visual proof points, subtitle strategy, emotional rhythm, CTA timing, and target audience—but its attribution analysis was also highly insightful: a three-level physical evidence chain building "trust through seeing is believing," "20,000 SKUs + 20-cent average price" creating numerical anchors, and a nanny-style full-solution commitment achieving risk reversal. In other words, it saw through it: this video is not selling products, but certainty. To verify if it was just mechanically applying marketing terms, we told it, "My factory sells T-shirts, help me design a script following this pattern," and asked it to migrate this logic to a "custom T-shirt factory" scenario. Consequently, it not only successfully migrated the 5-step conversion template just analyzed to the T-shirt scenario but also naturally changed the Hook to "stretching a T-shirt to show elasticity," replaced the proof of strength with "close-ups of the inkjet printer + rubbing without fading," and even included practical suggestions for guiding private messages in the comments section. **This means the large model is no longer just a content understanding tool; it can already act as a tireless e-commerce analyst and social media operations expert.** ![Image](https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/c14a3b2b-9108-4282-918d-52599264b6ce.png?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg) **(3) Dictating an App: Watching, Speaking, and Modifying Simultaneously** **The third test can be called an upgraded version of "Vibe Coding"—"Audio-Video Vibe Coding."** We hand-drew a deliberately crude app wireframe, turned on the camera, and held the sketch towards the lens while dictating: "Look at this interface sketch I drew... please use React to help me generate complete code that can be run directly." It recognized the hand-drawn layout and generated the React code. Then we continued to dictate modifications—"Change the navigation bar to a sidebar, double the size of the main button and give it rounded corners," while uploading replacement images. Later, we tested iterations such as dark theme, progress bar animation, and press feedback, and it consistently maintained context without losing previous modifications. After several rounds of modifications, the webpage was successfully launched. In terms of overall experience, it handled the most natural human interaction: watching, speaking, and modifying simultaneously. It wasn't the old experience of "AI generates code and you debug it yourself," but more like having an experienced developer sitting right next to you. ![Image](https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/bc15179d-73d9-46dd-acbe-1b894485b161.png?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg) **(4) Connecting the Dots** **From the complex narrative of "Dune" to the business analysis of sales videos, and then to the casual interaction of dictating an app, if we connect these test cases together, we find that:** Qwen3.5-Omni has successfully proven: it can turn complex, chaotic, and continuous inputs into directly usable results. Additionally, here are two more use cases we tested but didn't elaborate on: generating commentary for game videos (text scripts on the web side, TTS voice on the API side) and a "24-hour AI Newsroom" (a 50-minute international press conference audio went through information extraction, bilingual manuscript generation, and voice broadcasting, all with good results). Interested friends can also give it a try. ![Image](https://imageproxy.pbkrs.com/https://wpimg-wscn.awtmt.com/a4f9b85b-8d23-4444-98dc-5449d54dc4de.png?x-oss-process=image/auto-orient,1/interlace,1/resize,w_1440,h_1440/quality,q_95/format,jpg) ## Underlying Change: From "Understanding Content" to "Deconstructing into Assets" The reason the first three scenarios could work is not just because "capabilities have become stronger," but because the underlying product design has undergone a qualitative change: it forcibly deconstructs continuous, mixed, and hard-to-search audio and video streams into highly structured intermediate layers. **(1) How Fine the Deconstruction Is: Not Summaries, but Field-Level Structured Assets** If you look at the official API documentation, you will find that Qwen3.5-Omni's recommended output format for audio and video is not a vague summary, but a three-layered hard structure: - Storyline (a storyline merging audio and visual details by timestamps); - Visible Text (a list of on-screen text with start/end times and appearance characteristics); - Speakers and Transcript (a verbatim transcript including speaker identity, accent, tone, and emotion). In other words, what it receives is no longer "a mass of video," but a structured asset that can be directly called, searched, and executed by code. This is the underlying reason why the Dune test could achieve precise backtracking and the TikTok test could output transferable templates. Supporting this granularity are solid foundational model capabilities—the Mixture-of-Experts (MoE) architecture, native multimodal pre-training on over 100 million hours of audio data, model intelligence at the same level as Qwen3.5-Plus, and achieving SOTA in 215 third-party tests. **(2) How Long the Deconstruction Is: Ultra-Large Context Window** With a 256K context window, it supports over 10 hours of audio or over 400 seconds of 720P video. The real difficulty with long content has never been "finishing it," but rather cross-segment correlation and evidence backtracking—throwing in 10 hours of meeting recordings and asking, "What did the person mentioned at the 5th minute say at the 30th minute?"; inputting a sales livestream recording and asking it to pinpoint timestamps of exaggerated claims while attaching visual and dialogue evidence; using it for quality inspection of customer service recordings to output emotion trajectories and script scores. These information organization tasks, which in the past were highly dependent on human labor and extremely prone to errors, are what Qwen3.5-Omni is attempting to take over. **(3) Interaction: A Dynamic Interface** On the real-time interaction side, it supports intelligent semantic interruption—it won't stop speaking just because you cough or casually say "uh-huh," as it filters out meaningless background noise interference. It natively supports FunctionCall for web searches, allowing it to autonomously judge whether it needs to initiate a search to respond to real-time questions, and developers can see precise usage information in the callback. From an engineering level, this alleviates the "timeliness and hallucination" issues that are most headache-inducing for enterprises using large models. Capability improvements at the voice expression layer are equally valuable; it now supports speech recognition for 113 languages and dialects, speech synthesis for 36 languages and dialects, with 47 built-in multilingual speakers and 8 dialect speakers. In our testing, whether it was the customer service character Tina, who describes her voice as "warm milk tea," or the Sichuanese "Qing'er," the sense of character and product quality was very strong. This is not just about "understanding more," but about providing ample ammunition for high-frequency scenarios such as overseas customer service, audit and quality inspection, audiobooks, and podcast dubbing. **To summarize in one sentence,** Qwen3.5-Omni makes audio and video "deconstructible"—it's not just that it "understands," but that it breaks them down into ready-made materials that can be searched, reused, and directly used for work. ## What Alibaba Truly Wants to Sell Isn't Just a Model After discussing product and technology, it's worth shifting our gaze away from the model itself to look at Alibaba's recent series of organizational and product moves—a clear commercial underlying thread will emerge. Not long ago, Alibaba established the Alibaba Token Hub (ATH) business group, directly managed by CEO Eddie Wu, which explicitly proposed focusing on "creating Tokens, delivering Tokens, and applying Tokens." Among them, the debut of the "Wukong Business Unit" has a very clear positioning: "B-end AI-native work platform, deeply integrating model capabilities into enterprise workflows." In the latest "Wukong" product released by DingTalk, the core logic has already evolved from "communication as generation" to "communication as execution" (CLI-fication, where AI directly calls underlying interfaces). AI is no longer just chatting with you; it is required to go online to watch competitor videos, analyze viral Xiaohongshu content, pull data across systems, and even generate data animations. **Note the keywords here: watching videos, listening to audio, cross-platform execution. As AI Agents begin to grow "hands and feet" and autonomously process large amounts of audio and video content, their demand for omni-modal understanding capabilities and Token consumption will far exceed the era of pure text dialogue.** Looking back at Qwen3.5-Omni in this context, its extremely low pricing (less than 0.8 RMB per million input Tokens, less than 1/10th of Gemini-3.1 Pro) and powerful structural audio-video capabilities seem more like building cost-effective, stable omni-modal infrastructure for the large-scale landing of Alibaba's B-end enterprise-level Agents, represented by Wukong. You must know that deconstructing hours of audio and video into fine-grained structured data previously meant that enterprises needed to assemble an entire chain—ASR transcription, text large models, visual understanding models, TTS synthesis—which was high-cost, long-chain, and full of breakpoints. Now, an end-to-end omni-modal model has completely leveled the threshold for this. I believe what Qwen3.5-Omni truly deserves to be remembered for is not how complex a movie trailer it can understand today, but that from this moment on, **it begins to turn audio and video content into "digital assets" that can be tangibly processed and reused within enterprise workflows—** The productivity revolution driven by omni-modal large models is arriving. ### 相關股票 - [Global X Cloud Computing ETF (CLOU.US)](https://longbridge.com/zh-HK/quote/CLOU.US.md) - [SPDR S&P Semicon (XSD.US)](https://longbridge.com/zh-HK/quote/XSD.US.md) - [Direxion Semicon Bull 3X (SOXL.US)](https://longbridge.com/zh-HK/quote/SOXL.US.md) - [Franklin Exponential Data ETF (XDAT.US)](https://longbridge.com/zh-HK/quote/XDAT.US.md) - [Comm Servcies Select Sector SPDR (XLC.US)](https://longbridge.com/zh-HK/quote/XLC.US.md) - [Global X Data Center & Dgtl Infrs ETF (DTCR.US)](https://longbridge.com/zh-HK/quote/DTCR.US.md) - [BABA 2x Long Daily ETF - GraniteShares (BABX.US)](https://longbridge.com/zh-HK/quote/BABX.US.md) - [VanEck Vectors Semiconductor UCITS ETF Accum A USD (SMH.UK)](https://longbridge.com/zh-HK/quote/SMH.UK.md) - [Krne Csi China Internet (KWEB.US)](https://longbridge.com/zh-HK/quote/KWEB.US.md) - [iShares US Digital Infrastructure and Real Estate ETF (IDGT.US)](https://longbridge.com/zh-HK/quote/IDGT.US.md) - [Proshares Big Data Refiners ETF (DAT.US)](https://longbridge.com/zh-HK/quote/DAT.US.md) - [ARK Fintech Innovation ETF (ARKF.US)](https://longbridge.com/zh-HK/quote/ARKF.US.md) - [Tianhong CSI Computer Theme ETF (159998.CN)](https://longbridge.com/zh-HK/quote/159998.CN.md) - [VanEck Semiconductor ETF (SMH.US)](https://longbridge.com/zh-HK/quote/SMH.US.md) - [First Trust S-Network Streaming and Gaming ETF (BNGE.US)](https://longbridge.com/zh-HK/quote/BNGE.US.md) - [KraneShares 2x Long BABA Daily ETF (KBAB.US)](https://longbridge.com/zh-HK/quote/KBAB.US.md) - [Global X Fintech (FINX.US)](https://longbridge.com/zh-HK/quote/FINX.US.md) - [iShares Semiconductor ETF (SOXX.US)](https://longbridge.com/zh-HK/quote/SOXX.US.md) ## 相關資訊與研究 - [03:15 ETLunit, CellCarta Announce Strategic Collaboration to Accelerate AI-Enabled Digital Pathology for Companion Diagnostic Programs](https://longbridge.com/zh-HK/news/280971394.md) - [IC Manage Advances GDP-XL to GDP-AI â Boosting Designer Efficiency and Accelerating Workflows](https://longbridge.com/zh-HK/news/280348649.md) - [Redpanda launches streaming engine optimized for AI](https://longbridge.com/zh-HK/news/281232206.md) - [Databricks Unveils $850 Million UK Investment To Fast-Track Data, AI Innovation](https://longbridge.com/zh-HK/news/281222731.md) - [EPG Adds Over US$100 Million in Series B+ Financing, Expanding Strategic Backing for Global AI Data Center Growth](https://longbridge.com/zh-HK/news/281099391.md)