Multimodality API

“所有自然连接的事物都应该结合起来教学”——约翰·阿莫斯·科梅尼乌斯，《世界图解》，1658 年

"All things that are naturally connected ought to be taught in combination" - John Amos Comenius, "Orbis Sensualium Pictus", 1658

人类同时以多种数据输入模式处理知识。我们学习的方式、我们的经验都是多模态的。我们不仅仅有视觉、听觉和文本。

Humans process knowledge, simultaneously across multiple modes of data inputs. The way we learn, our experiences are all multimodal. We don’t have just vision, just audio and just text.

与这些原则相反，机器学习通常专注于为处理单一模态而定制的专用模型。例如，我们开发了用于文本转语音或语音转文本等任务的音频模型，以及用于对象检测和分类等任务的计算机视觉模型。

Contrary to those principles, the Machine Learning was often focused on specialized models tailored to process a single modality. For instance, we developed audio models for tasks like text-to-speech or speech-to-text, and computer vision models for tasks such as object detection and classification.

然而，新一波多模态大型语言模型开始涌现。例如，OpenAI 的 GPT-4o、Google 的 Vertex AI Gemini 1.5、Anthropic 的 Claude3，以及开源产品 Llama3.2、LLaVA 和 BakLLaVA，都能够接受包括文本图像、音频和视频在内的多种输入，并通过整合这些输入生成文本响应。

However, a new wave of multimodal large language models starts to emerge. Examples include OpenAI’s GPT-4o , Google’s Vertex AI Gemini 1.5, Anthropic’s Claude3, and open source offerings Llama3.2, LLaVA and BakLLaVA are able to accept multiple inputs, including text images, audio and video and generate text responses by integrating these inputs.

多模态大型语言模型（LLM）的特性使得模型能够结合图像、音频或视频等其他模态来处理和生成文本。

The multimodal large language model (LLM) features enable the models to process and generate text in conjunction with other modalities such as images, audio, or video.

Spring AI Multimodality

多模态是指模型同时理解和处理来自各种来源（包括文本、图像、音频和其他数据格式）信息的能力。

Multimodality refers to a model’s ability to simultaneously understand and process information from various sources, including text, images, audio, and other data formats.

Spring AI 消息 API 提供了所有必要的抽象，以支持多模态 LLM。

The Spring AI Message API provides all necessary abstractions to support multimodal LLMs.

UserMessage 的 text 字段主要用于文本输入，而可选的 media 字段允许添加一个或多个不同模态的额外内容，例如图像、音频和视频。mimeType 指定模态类型。根据所使用的 LLM，data 字段可以是作为 byte[] 对象的原始媒体内容，也可以是内容的 URI。

The UserMessage’s content field is used primarily for text inputs, while the optional media field allows adding one or more additional content of different modalities such as images, audio and video. The MimeType specifies the modality type. Depending on the used LLMs, the Media data field can be either the raw media content as a Resource object or a URI to the content.

媒体字段目前仅适用于用户输入消息（例如，UserMessage）。它对系统消息没有意义。AIMessage，其中包括 LLM 响应，仅提供文本内容。要生成非文本媒体输出，您应该使用专用的单模态模型。

The media field is currently applicable only for user input messages (e.g., UserMessage). It does not hold significance for system messages. The AssistantMessage, which includes the LLM response, provides text content only. To generate non-text media outputs, you should utilize one of the dedicated, single-modality models.*

例如，我们可以将以下图片（picture）作为输入，并要求 LLM 解释它看到了什么。

For example, we can take the following picture (multimodal.test.png) as an input and ask the LLM to explain what it sees.

对于大多数多模态 LLM，Spring AI 代码将如下所示：

For most of the multimodal LLMs, the Spring AI code would look something like this:

var imageResource = new ClassPathResource("/multimodal.test.png");

var userMessage = new UserMessage(
	"Explain what do you see in this picture?", // content
	new Media(MimeTypeUtils.IMAGE_PNG, this.imageResource)); // media

ChatResponse response = chatModel.call(new Prompt(this.userMessage));

或使用流式 api API：

or with the fluent ChatClient API:

String response = ChatClient.create(chatModel).prompt()
		.user(u -> u.text("Explain what do you see on this picture?")
				    .media(MimeTypeUtils.IMAGE_PNG, new ClassPathResource("/multimodal.test.png")))
		.call()
		.content();

并产生如下响应：

and produce a response like:

这是一张带有简单设计的果盘图片。碗由金属制成，带有弯曲的铁丝边缘，形成开放结构，使水果可以从各个角度看到。碗内，有两根黄色香蕉放在一个看起来是红苹果的上面。香蕉有点过熟，其果皮上的棕色斑点表明了这一点。碗顶部有一个金属环，可能是用作提手。碗放在一个平坦的表面上，背景颜色中性，可以清晰地看到碗内的水果。

This is an image of a fruit bowl with a simple design. The bowl is made of metal with curved wire edges that create an open structure, allowing the fruit to be visible from all angles. Inside the bowl, there are two yellow bananas resting on top of what appears to be a red apple. The bananas are slightly overripe, as indicated by the brown spots on their peels. The bowl has a metal ring at the top, likely to serve as a handle for carrying. The bowl is placed on a flat surface with a neutral-colored background that provides a clear view of the fruit inside.

Spring AI 为以下聊天模型提供多模态支持：

Spring AI provides multimodal support for the following chat models:

Anthropic Claude 3
AWS Bedrock Converse
OpenAiChatModel
Azure Open AI (e.g. GPT-4o models)
GeminiChatModel
Mistral AI (e.g. Mistral Pixtral models)
AnthropicChatModel
Ollama (e.g. LLaVA, BakLLaVA, Llama3.2 models)
MistralAiChatModel
OpenAI (e.g. GPT-4 and GPT-4o models)
OllamaChatModel
Vertex AI Gemini (e.g. gemini-1.5-pro-001, gemini-1.5-flash-001 models)