谷歌发布Gemini Embedding 2公测版，通过Gemini API与Vertex AI提供，将文本、图像、视频、音频与PDF映射到统一向量空间，支持多模态混合输入与100多语言语义表示，可用于跨媒体检索、分类、RAG与聚类等任务，并以可缩放向量维度在性能与存储成本间取舍，整体多模态与语音表现优于以往与主流模型。

# Gemini Embedding 2 is our first natively multimodal embedding model that maps text, images, video, audio and documents into a single embedding space, enabling multimodal retrieval and classification across different types of media — and it’s available now in public preview.

T

Tom Duerig

Distinguished Engineer, Google DeepMind

## General summary

Google is releasing Gemini Embedding 2, a multimodal embedding model built on the Gemini architecture. You can now map text, images, videos, audio, and documents into a single embedding space. To get started, use the Gemini API or Vertex AI, and check out the interactive notebooks.

Summaries were generated by Google AI. Generative AI is experimental.

![Gemini Embedding 2](https://storage.googleapis.com/gweb-uniblog-publish-prod/images/gemini-2_embedding_keyword_blog_h.width-200.format-webp.webp)

[🎧 Gemini Embedding 2: Our first natively multimodal embedding model](https://storage.googleapis.com/gweb-uniblog-publish-prod/media/tts_audio_83556_umbriel_2026_03_13_20_13_58.wav)

Your browser does not support the audio element.

Listen to article

This content is generated by Google AI. Generative AI is experimental

4:00 minutes

Today we’re releasing Gemini Embedding 2, our first fully multimodal embedding model built on the Gemini architecture, in Public Preview via the [Gemini API](https://ai.google.dev/gemini-api/docs/embeddings) and [Vertex AI](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/embedding-2).

Expanding on our previous text-only foundation, Gemini Embedding 2 maps text, images, videos, audio and documents into a single, unified embedding space, and captures semantic intent across over 100 languages. This simplifies complex pipelines and enhances a wide variety of multimodal downstream tasks—from Retrieval-Augmented Generation (RAG) and semantic search to sentiment analysis and data clustering.

## New modalities and flexible output dimensions

The model is based on Gemini and leverages its best-in-class multimodal understanding capabilities to create high-quality embeddings across:

- Text: supports an expansive context of up to 8192 input tokens
- Images: capable of processing up to 6 images per request, supporting PNG and JPEG formats
- Videos: supports up to 120 seconds of video input in MP4 and MOV formats
- Audio: natively ingests and embeds audio data without needing intermediate text transcriptions
- Documents: directly embed PDFs up to 6 pages long

Beyond processing one modality at a time, this model natively understands interleaved input so you can pass multiple modalities of input (e.g., image + text) in a single request. This allows the model to capture the complex, nuanced relationships between different media types, unlocking more accurate understanding of complex, real-world data.

Like our previous embedding models, Gemini Embedding 2 incorporates Matryoshka Representation Learning (MRL), a technique that “nests” information by dynamically scaling down dimensions. This enables flexible output dimensions scaling down from the default 3072 so developers can balance performance and storage costs. We recommend using 3072, 1536, 768 dimensions for highest quality.

To see these embeddings in action, try out our lightweight multimodal semantic search [demo](https://findmemedia.lmm.ai/).

## State-of-the-art performance

Gemini Embedding 2 doesn't just improve on legacy models. It establishes a new performance standard for multimodal depth, introducing strong speech capabilities and outperforming leading models in text, image, and video tasks. This measurable improvement and unique multimodal coverage give developers exactly what they need for their diverse embedding needs.

## Unlocking deeper meaning for data

Embeddings are the technology that power experiences in many Google products. From RAG where embeddings can play a crucial role in context engineering to large-scale data management and classic search/analysis, some of our early access partners are already using Gemini Embedding 2 to unlock high-value multimodal applications:

## Start building today

Get started with the Gemini Embedding 2 model through [Gemini API](https://ai.google.dev/gemini-api/docs/embeddings) or [Vertex AI](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-multimodal-embeddings).

Learn how to use the model in our interactive [Gemini API](https://github.com/google-gemini/cookbook/blob/main/quickstarts/Embeddings.ipynb) and [Vertex AI](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/gemini/embedding/intro_gemini_embedding.ipynb) Colab notebooks. You can also use it through [LangChain](https://docs.langchain.com/oss/python/integrations/text_embedding/google_generative_ai), [LlamaIndex](https://developers.llamaindex.ai/python/framework/integrations/embeddings/google_genai/), [Haystack](https://haystack.deepset.ai/integrations/google-genai), [Weaviate](https://docs.weaviate.io/weaviate/model-providers/google), [QDrant](https://qdrant.tech/documentation/embeddings/gemini/), [ChromaDB](https://docs.trychroma.com/integrations/embedding-models/google-gemini), and [Vector Search](https://docs.cloud.google.com/vertex-ai/docs/vector-search-2/overview).

By bringing semantic meaning to the diverse data around us, Gemini Embedding 2 provides the essential multimodal foundation for the next era of advanced AI experiences. We can’t wait to see what you build.

> Gemini Embedding 2 是我们的首款原生多模态嵌入模型，可将文本、图像、视频、音频和文档映射到单一嵌入空间，实现跨不同媒体类型的多模态检索和分类 — 现已开放公共预览。

今天，我们通过Gemini API和Vertex AI发布了Gemini Embedding 2的首个公开预览版，这是一款基于Gemini架构的完全多模态嵌入模型。

在我们之前的仅限文本基础上，Gemini Embedding 2 将文本、图像、视频、音频和文档映射到一个统一的嵌入空间，并捕捉了100多种语言的语义意图。这简化了复杂的流程，并增强了从检索增强生成（RAG）和语义搜索到情感分析和数据聚类等广泛的多模态下游任务。

## 新的模态和灵活的输出维度

该模型基于Gemini，并利用其一流的多模态理解能力，创建了高质量的嵌入，涵盖：

- 文本：支持最多8192个输入令牌的扩展上下文
- 图像：能够处理每请求最多6张图像，支持PNG和JPEG格式
- 视频：支持MP4和MOV格式的最多120秒视频输入
- 音频：原生摄取和嵌入音频数据，无需中间文本转录
- 文档：直接嵌入最多6页的PDF

除了每次处理一种模态外，该模型还原生理解交错输入，因此您可以将多种模态的输入（例如图像+文本）传递到单个请求中。这使得模型能够捕捉不同媒体类型之间的复杂、细微的关系，从而更准确地理解复杂、现实世界的数据。

与我们之前的嵌入模型一样，Gemini Embedding 2结合了Matryoshka表示学习（MRL），这是一种通过动态缩小维度来“嵌套”信息的技术。这使得输出维度可以灵活缩小到默认的3072，从而使开发人员可以在性能和存储成本之间取得平衡。我们建议使用3072、1536、768维度以获得最高质量。

要查看这些嵌入的实际应用，请尝试我们的轻量级多模态语义搜索演示。

## 最先进的性能

Gemini Embedding 2不仅改进了传统模型，还建立了多模态深度的全新性能标准，引入了强大的语音能力，并在文本、图像和视频任务中优于领先模型。这种可衡量的改进和独特的多模态覆盖为开发人员提供了他们所需的确切功能，以满足其多样化的嵌入需求。

## 为数据解锁更深层次的含义

嵌入技术为许多谷歌产品提供了支持。从RAG（检索增强生成）中，嵌入可以发挥关键作用的上下文工程到大规模数据管理和经典搜索/分析，我们的一些早期访问合作伙伴已经开始使用Gemini Embedding 2来解锁高价值的多模态应用：

## 立即开始构建

通过Gemini API或Vertex AI开始使用Gemini Embedding 2模型。

在我们的交互式Gemini API和Vertex AI Colab笔记本中了解如何使用该模型。您还可以通过LangChain、LlamaIndex、Haystack、Weaviate、QDrant、ChromaDB和Vector Search使用它。

通过为我们周围的多样化数据带来语义意义，Gemini Embedding 2为下一代高级AI体验提供了必不可少的多模态基础。我们期待看到您将构建什么。

Gemini Embedding 2：Google首款原生多模态嵌入模型

内容

新的模态和灵活的输出维度

最先进的性能

为数据解锁更深层次的含义

立即开始构建

评论

摘要