--- title: "Google releases its first native multimodal embedding model, Gemini Embedding 2" type: "News" locale: "zh-CN" url: "https://longbridge.com/zh-CN/news/278627037.md" description: "Google DeepMind launched its first native multimodal embedding model, Gemini Embedding 2, on March 10, which can unify text, images, videos, audio, and documents into a single embedding space. The model supports over 100 languages and introduces native voice embedding capabilities for the first time, eliminating the need for an intermediate step of converting speech to text. It uses MRL technology to support flexible compression of vector dimensions, balancing performance and storage costs" datetime: "2026-03-10T23:36:34.000Z" locales: - [zh-CN](https://longbridge.com/zh-CN/news/278627037.md) - [en](https://longbridge.com/en/news/278627037.md) - [zh-HK](https://longbridge.com/zh-HK/news/278627037.md) --- > 支持的语言: [English](https://longbridge.com/en/news/278627037.md) | [繁體中文](https://longbridge.com/zh-HK/news/278627037.md) # Google releases its first native multimodal embedding model, Gemini Embedding 2 On March 10, Google's DeepMind launched Gemini Embedding 2, the company's first native multimodal embedding model that unifies text, images, videos, audio, and documents into a single embedding space, marking a new stage in AI embedding technology towards full-modal integration. Gemini Embedding 2 supports semantic understanding in over 100 languages and surpasses existing mainstream models in benchmark tests for text, image, and video tasks, while also introducing speech processing capabilities that were previously lacking in embedding models. The model is now in public preview through the Gemini API and Vertex AI, allowing developers to access it immediately. For enterprise users, the release of this model directly lowers the technical barriers to building multimodal retrieval-augmented generation (RAG), semantic search, and data classification systems, promising to simplify the complex data pipelines that previously required separate handling across modalities. ## Full-modal Unification: Expanding from Text to Five Media Forms Gemini Embedding 2 is built on the Gemini architecture, extending embedding capabilities from pure text to five types of input forms: > - **Text** supports up to 8192 input tokens; > - **Images** can process up to 6 images per request, supporting PNG and JPEG formats; > - **Videos** support MP4 and MOV files up to 120 seconds long; > - **Audio** can be directly ingested and generate embedding vectors without the need for intermediate text transcription steps; > - **Documents** support direct embedding of PDF files up to 6 pages. **Unlike the traditional method of processing single modalities one by one, this model supports interleaved input, allowing multiple modality combinations such as images and text to be passed in a single request, enabling the model to capture complex and subtle semantic relationships between different media types.** **** Gemini Embedding 2 continues the use of Matryoshka Representation Learning (MRL) technology from Google's previous embedding models. **This technology dynamically compresses vector dimensions through a "nested" approach, allowing the output dimensions to be flexibly reduced from the default 3072, helping developers strike a balance between model performance and storage costs.** ## Leading Benchmark Tests, Speech Capability as a New Highlight Google stated that Gemini Embedding 2 outperforms current mainstream competitive models in benchmark tests for text, image, and video tasks, positioning it as a new performance benchmark in the field of multimodal embedding **Google recommends developers choose from three dimensions: 3072, 1536, or 768, based on the application scenario, to achieve the highest quality embedding effect. This design is particularly important for enterprises that require large-scale deployment of embedding vectors, as it effectively controls infrastructure costs without significantly sacrificing accuracy.** In terms of capability coverage, this model introduces native voice embedding capabilities that were generally lacking in previous similar models, allowing direct processing of audio data without the need for an intermediate step of converting speech to text. Google points out that embedding technology has been widely applied in several of its products, covering context engineering in RAG scenarios, large-scale data management, as well as traditional search and analysis scenarios. Some early access partners have already begun building multimodal applications based on Gemini Embedding 2, and Google states that these use cases are realizing the actual potential of the model in high-value scenarios ### 相关股票 - [Alphabet (GOOGL.US)](https://longbridge.com/zh-CN/quote/GOOGL.US.md) - [Alphabet - C (GOOG.US)](https://longbridge.com/zh-CN/quote/GOOG.US.md) - [Direxion Daily GOOGL Bull 2X Shares (GGLL.US)](https://longbridge.com/zh-CN/quote/GGLL.US.md) - [Roundhill GOOGL WeeklyPay ETF (GOOW.US)](https://longbridge.com/zh-CN/quote/GOOW.US.md) ## 相关资讯与研究 - [Google rolls out new Gemini capabilities to Docs, Sheets, Slides, and Drive](https://longbridge.com/zh-CN/news/278560180.md) - [Alphabet's Google Introduces New Gemini Features in Workspace](https://longbridge.com/zh-CN/news/278594065.md) - [Alphabet's Google Sued for 'Wrongful Death' Over Suicide Allegedly Caused by Gemini's Advice](https://longbridge.com/zh-CN/news/277803485.md) - [Alphabet Inc. $GOOG Shares Purchased by HUB Investment Partners LLC](https://longbridge.com/zh-CN/news/278263989.md) - [Google's Gemini for Government Debuts Feature Letting Civilian, Military Personnel Build Their Own Agents](https://longbridge.com/zh-CN/news/278592164.md)