Ollama Embeddings

使用 Ollama ,您可以在本地运行各种 AI Models 并从中生成嵌入。嵌入是浮点数的向量(列表)。两个向量之间的距离衡量它们的关联性。小距离表示高度关联,大距离表示低关联。

With Ollama you can run various AI Models locally and generate embeddings from them. An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

OllamaEmbeddingModel 实现利用 Ollama Embeddings API 端点。

The OllamaEmbeddingModel implementation leverages the Ollama Embeddings API endpoint.

Prerequisites

您首先需要访问 Ollama 实例。有几个选项,包括以下内容:

You first need access to an Ollama instance. There are a few options, including the following:

您可以从 Ollama model library 中提取您想在应用程序中使用的模型:

You can pull the models you want to use in your application from the Ollama model library:

ollama pull <model-name>

您还可以提取数千个免费的 GGUF Hugging Face Models

You can also pull any of the thousands, free, GGUF Hugging Face Models:

ollama pull hf.co/<username>/<model-repository>

或者,您可以启用自动下载任何所需模型的选项: Auto-pulling Models

Alternatively, you can enable the option to download automatically any needed model: auto-pulling-models.

Auto-configuration

Spring AI 自动配置、启动器模块的工件名称发生了重大变化。请参阅 upgrade notes 以获取更多信息。

There has been a significant change in the Spring AI auto-configuration, starter modules' artifact names. Please refer to the upgrade notes for more information.

Spring AI 为 Azure Ollama 嵌入模型提供了 Spring Boot 自动配置。要启用它,请将以下依赖项添加到您的 Maven pom.xml 或 Gradle build.gradle 构建文件中:

Spring AI provides Spring Boot auto-configuration for the Azure Ollama Embedding Model. To enable it add the following dependency to your Maven pom.xml or Gradle build.gradle build files:

  • Maven

  • Gradle

<dependency>
   <groupId>org.springframework.ai</groupId>
   <artifactId>spring-ai-starter-model-ollama</artifactId>
</dependency>
dependencies {
    implementation 'org.springframework.ai:spring-ai-starter-model-ollama'
}

请参阅 Dependency Management 部分,将 Spring AI BOM 添加到您的构建文件中。Spring AI 工件发布在 Maven Central 和 Spring Snapshot 存储库中。请参阅“存储库”部分,将这些存储库添加到您的构建系统。

Refer to the Dependency Management section to add the Spring AI BOM to your build file. Spring AI artifacts are published in Maven Central and Spring Snapshot repositories. Refer to the Repositories section to add these repositories to your build system.

Base Properties

前缀 spring.ai.ollama 是用于配置与 Ollama 连接的属性前缀。

The prefix spring.ai.ollama is the property prefix to configure the connection to Ollama

Property

Description

Default

spring.ai.ollama.base-url

Base URL where Ollama API server is running.

http://localhost:11434

以下是初始化 Ollama 集成和 auto-pulling models 的属性。

Here are the properties for initializing the Ollama integration and auto-pulling-models.

Property

Description

Default

spring.ai.ollama.init.pull-model-strategy

Whether to pull models at startup-time and how.

never

spring.ai.ollama.init.timeout

How long to wait for a model to be pulled.

5m

spring.ai.ollama.init.max-retries

Maximum number of retries for the model pull operation.

0

spring.ai.ollama.init.embedding.include

Include this type of models in the initialization task.

true

spring.ai.ollama.init.embedding.additional-models

Additional models to initialize besides the ones configured via default properties.

[]

Embedding Properties

嵌入自动配置的启用和禁用现在通过前缀为 spring.ai.azure.openai.embedding 的顶级属性进行配置。

Enabling and disabling of the embedding auto-configurations are now configured via top level properties with the prefix spring.ai.model.embedding.

要启用,请设置 spring.ai.model.embedding=ollama(默认已启用)

To enable, spring.ai.model.embedding=ollama (It is enabled by default)

要禁用,请设置 spring.ai.model.embedding=none(或任何与 ollama 不匹配的值)

To disable, spring.ai.model.embedding=none (or any value which doesn’t match ollama)

此更改旨在允许配置多个模型。

This change is done to allow configuration of multiple models.

前缀 spring.ai.ollama.embedding.options 是配置 Ollama 嵌入模型的属性前缀。它包括 Ollama 请求(高级)参数,例如 modelkeep-alivetruncate ,以及 Ollama 模型 options 属性。

The prefix spring.ai.ollama.embedding.options is the property prefix that configures the Ollama embedding model. It includes the Ollama request (advanced) parameters such as the model, keep-alive, and truncate as well as the Ollama model options properties.

以下是 Ollama 嵌入模型的高级请求参数:

Here are the advanced request parameter for the Ollama embedding model:

Property

Description

Default

spring.ai.ollama.embedding.enabled (Removed and no longer valid)

Enables the Ollama embedding model auto-configuration.

true

spring.ai.model.embedding

Enables the Ollama embedding model auto-configuration.

ollama

spring.ai.ollama.embedding.options.model

The name of the supported model to use. You can use dedicated Embedding Model types

mistral

spring.ai.ollama.embedding.options.keep_alive

Controls how long the model will stay loaded into memory following the request

5m

spring.ai.ollama.embedding.options.truncate

Truncates the end of each input to fit within context length. Returns error if false and context length is exceeded.

true

剩余的 options 属性基于 Ollama Valid Parameters and ValuesOllama Types 。默认值基于: Ollama type defaults

The remaining options properties are based on the Ollama Valid Parameters and Values and Ollama Types. The default values are based on: Ollama type defaults.

Property

Description

Default

spring.ai.ollama.embedding.options.numa

Whether to use NUMA.

false

spring.ai.ollama.embedding.options.num-ctx

Sets the size of the context window used to generate the next token.

2048

spring.ai.ollama.embedding.options.num-batch

Prompt processing maximum batch size.

512

spring.ai.ollama.embedding.options.num-gpu

The number of layers to send to the GPU(s). On macOS it defaults to 1 to enable metal support, 0 to disable. 1 here indicates that NumGPU should be set dynamically

-1

spring.ai.ollama.embedding.options.main-gpu

When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results.

0

spring.ai.ollama.embedding.options.low-vram

-

false

spring.ai.ollama.embedding.options.f16-kv

-

true

spring.ai.ollama.embedding.options.logits-all

Return logits for all the tokens, not just the last one. To enable completions to return logprobs, this must be true.

-

spring.ai.ollama.embedding.options.vocab-only

Load only the vocabulary, not the weights.

-

spring.ai.ollama.embedding.options.use-mmap

By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed. However, if the model is larger than your total amount of RAM or if your system is low on available memory, using mmap might increase the risk of pageouts, negatively impacting performance. Disabling mmap results in slower load times but may reduce pageouts if you’re not using mlock. Note that if the model is larger than the total amount of RAM, turning off mmap would prevent the model from loading at all.

null

spring.ai.ollama.embedding.options.use-mlock

Lock the model in memory, preventing it from being swapped out when memory-mapped. This can improve performance but trades away some of the advantages of memory-mapping by requiring more RAM to run and potentially slowing down load times as the model loads into RAM.

false

spring.ai.ollama.embedding.options.num-thread

Sets the number of threads to use during computation. By default, Ollama will detect this for optimal performance. It is recommended to set this value to the number of physical CPU cores your system has (as opposed to the logical number of cores). 0 = let the runtime decide

0

spring.ai.ollama.embedding.options.num-keep

-

4

spring.ai.ollama.embedding.options.seed

Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt.

-1

spring.ai.ollama.embedding.options.num-predict

Maximum number of tokens to predict when generating text. (-1 = infinite generation, -2 = fill context)

-1

spring.ai.ollama.embedding.options.top-k

Reduces the probability of generating nonsense. A higher value (e.g., 100) will give more diverse answers, while a lower value (e.g., 10) will be more conservative.

40

spring.ai.ollama.embedding.options.top-p

Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text.

0.9

spring.ai.ollama.embedding.options.min-p

Alternative to the top_p, and aims to ensure a balance of quality and variety. The parameter p represents the minimum probability for a token to be considered, relative to the probability of the most likely token. For example, with p=0.05 and the most likely token having a probability of 0.9, logits with a value less than 0.045 are filtered out.

0.0

spring.ai.ollama.embedding.options.tfs-z

Tail-free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting.

1.0

spring.ai.ollama.embedding.options.typical-p

-

1.0

spring.ai.ollama.embedding.options.repeat-last-n

Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)

64

spring.ai.ollama.embedding.options.temperature

The temperature of the model. Increasing the temperature will make the model answer more creatively.

0.8

spring.ai.ollama.embedding.options.repeat-penalty

Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient.

1.1

spring.ai.ollama.embedding.options.presence-penalty

-

0.0

spring.ai.ollama.embedding.options.frequency-penalty

-

0.0

spring.ai.ollama.embedding.options.mirostat

Enable Mirostat sampling for controlling perplexity. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)

0

spring.ai.ollama.embedding.options.mirostat-tau

Controls the balance between coherence and diversity of the output. A lower value will result in more focused and coherent text.

5.0

spring.ai.ollama.embedding.options.mirostat-eta

Influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive.

0.1

spring.ai.ollama.embedding.options.penalize-newline

-

true

spring.ai.ollama.embedding.options.stop

Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate stop parameters in a modelfile.

-

spring.ai.ollama.embedding.options.functions

List of functions, identified by their names, to enable for function calling in a single prompt requests. Functions with those names must exist in the functionCallbacks registry.

-

所有以 spring.ai.ollama.embedding.options 为前缀的属性都可以通过向 EmbeddingRequest 调用中添加一个特定的 Runtime Options 请求在运行时进行覆盖。

All properties prefixed with spring.ai.ollama.embedding.options can be overridden at runtime by adding a request specific Runtime Options to the EmbeddingRequest call.

Runtime Options

OllamaOptions.java 提供了 Ollama 配置,例如要使用的模型、低级别 GPU 和 CPU 调优等。

The OllamaOptions.java provides the Ollama configurations, such as the model to use, the low level GPU and CPU tuning, etc.

默认选项也可以使用 spring.ai.ollama.embedding.options 属性进行配置。

The default options can be configured using the spring.ai.ollama.embedding.options properties as well.

在启动时使用 OllamaEmbeddingModel(OllamaApi ollamaApi, OllamaOptions defaultOptions) 来配置所有嵌入请求的默认选项。在运行时,您可以使用 OllamaOptions 实例作为 EmbeddingRequest 的一部分来覆盖默认选项。

At start-time use the OllamaEmbeddingModel(OllamaApi ollamaApi, OllamaOptions defaultOptions) to configure the default options used for all embedding requests. At run-time you can override the default options, using a OllamaOptions instance as part of your EmbeddingRequest.

例如,要覆盖特定请求的默认模型名称:

For example to override the default model name for a specific request:

EmbeddingResponse embeddingResponse = embeddingModel.call(
    new EmbeddingRequest(List.of("Hello World", "World is big and salvation is near"),
        OllamaOptions.builder()
            .model("Different-Embedding-Model-Deployment-Name"))
            .truncates(false)
            .build());

Auto-pulling Models

Spring AI Ollama 可以在您的 Ollama 实例中没有模型时自动拉取模型。此功能对于开发和测试以及将应用程序部署到新环境特别有用。

Spring AI Ollama can automatically pull models when they are not available in your Ollama instance. This feature is particularly useful for development and testing as well as for deploying your applications to new environments.

您还可以按名称拉取数千个免费的 GGUF Hugging Face Models 中的任何一个。

You can also pull, by name, any of the thousands, free, GGUF Hugging Face Models.

有三种拉取模型的策略:

There are three strategies for pulling models:

  • always (在 PullModelStrategy.ALWAYS 中定义):始终拉取模型,即使它已经可用。用于确保您使用的是最新版本的模型。

  • always (defined in PullModelStrategy.ALWAYS): Always pull the model, even if it’s already available. Useful to ensure you’re using the latest version of the model.

  • when_missing (在 PullModelStrategy.WHEN_MISSING 中定义):仅在模型尚未可用时才拉取模型。这可能导致使用较旧版本的模型。

  • when_missing (defined in PullModelStrategy.WHEN_MISSING): Only pull the model if it’s not already available. This may result in using an older version of the model.

  • never (在 PullModelStrategy.NEVER 中定义):从不自动拉取模型。

  • never (defined in PullModelStrategy.NEVER): Never pull the model automatically.

由于下载模型可能存在延迟,因此不建议在生产环境中使用自动拉取。相反,请考虑提前评估并预先下载必要的模型。

Due to potential delays while downloading models, automatic pulling is not recommended for production environments. Instead, consider assessing and pre-downloading the necessary models in advance.

通过配置属性和默认选项定义的所有模型都可以在启动时自动拉取。您可以使用配置属性配置拉取策略、超时和最大重试次数:

All models defined via configuration properties and default options can be automatically pulled at startup time. You can configure the pull strategy, timeout, and maximum number of retries using configuration properties:

spring:
  ai:
    ollama:
      init:
        pull-model-strategy: always
        timeout: 60s
        max-retries: 1

应用程序将不会完成初始化,直到所有指定的模型都在 Ollama 中可用。根据模型大小和互联网连接速度,这可能会显著降低应用程序的启动时间。

The application will not complete its initialization until all specified models are available in Ollama. Depending on the model size and internet connection speed, this may significantly slow down your application’s startup time.

您可以在启动时初始化其他模型,这对于在运行时动态使用的模型非常有用:

You can initialize additional models at startup, which is useful for models used dynamically at runtime:

spring:
  ai:
    ollama:
      init:
        pull-model-strategy: always
        embedding:
          additional-models:
            - mxbai-embed-large
            - nomic-embed-text

如果您只想将拉取策略应用于特定类型的模型,可以将嵌入模型从初始化任务中排除:

If you want to apply the pulling strategy only to specific types of models, you can exclude embedding models from the initialization task:

spring:
  ai:
    ollama:
      init:
        pull-model-strategy: always
        embedding:
          include: false

此配置将对除嵌入模型之外的所有模型应用拉取策略。

This configuration will apply the pulling strategy to all models except embedding models.

HuggingFace Models

Ollama 可以直接访问所有 GGUF Hugging Face 嵌入模型。你可以通过名称拉取任何这些模型: ollama pull hf.co/<username>/<model-repository> ,或配置自动拉取策略: Auto-pulling Models

Ollama can access, out of the box, all GGUF Hugging Face Embedding models. You can pull any of these models by name: ollama pull hf.co/<username>/<model-repository> or configure the auto-pulling strategy: auto-pulling-models:

spring.ai.ollama.embedding.options.model=hf.co/mixedbread-ai/mxbai-embed-large-v1
spring.ai.ollama.init.pull-model-strategy=always
  • spring.ai.ollama.embedding.options.model : 指定要使用的 Hugging Face GGUF model

  • spring.ai.ollama.embedding.options.model: Specifies the Hugging Face GGUF model to use.

  • spring.ai.ollama.init.pull-model-strategy=always : (可选) 在启动时启用模型自动拉取。对于生产环境,你应该预先下载模型以避免延迟: ollama pull hf.co/mixedbread-ai/mxbai-embed-large-v1

  • spring.ai.ollama.init.pull-model-strategy=always: (optional) Enables automatic model pulling at startup time. For production, you should pre-download the models to avoid delays: ollama pull hf.co/mixedbread-ai/mxbai-embed-large-v1.

Sample Controller

这将创建一个 EmbeddingModel 实现,您可以将其注入到您的类中。这是一个使用 EmbeddingModel 实现的简单 @Controller 类的示例。

This will create a EmbeddingModel implementation that you can inject into your class. Here is an example of a simple @Controller class that uses the EmbeddingModel implementation.

@RestController
public class EmbeddingController {

    private final EmbeddingModel embeddingModel;

    @Autowired
    public EmbeddingController(EmbeddingModel embeddingModel) {
        this.embeddingModel = embeddingModel;
    }

    @GetMapping("/ai/embedding")
    public Map embed(@RequestParam(value = "message", defaultValue = "Tell me a joke") String message) {
        EmbeddingResponse embeddingResponse = this.embeddingModel.embedForResponse(List.of(message));
        return Map.of("embedding", embeddingResponse);
    }
}

Manual Configuration

如果你不使用 Spring Boot,可以手动配置 OllamaEmbeddingModel 。为此,请将 spring-ai-ollama 依赖项添加到项目的 Maven pom.xml 或 Gradle build.gradle 构建文件中:

If you are not using Spring Boot, you can manually configure the OllamaEmbeddingModel. For this add the spring-ai-ollama dependency to your project’s Maven pom.xml or Gradle build.gradle build files:

  • Maven

  • Gradle

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-ollama</artifactId>
</dependency>
dependencies {
    implementation 'org.springframework.ai:spring-ai-ollama'
}
  1. 参见 Dependency Management 部分,将 Spring AI BOM 添加到你的构建文件中。

Refer to the Dependency Management section to add the Spring AI BOM to your build file.

spring-ai-ollama 依赖项还提供对 OllamaChatModel 的访问。有关 OllamaChatModel 的更多信息,请参阅 Ollama Chat Client 部分。

The spring-ai-ollama dependency provides access also to the OllamaChatModel. For more information about the OllamaChatModel refer to the Ollama Chat Client section.

接下来,创建一个 OllamaEmbeddingModel 实例,并使用它通过专用的 chroma/all-minilm-l6-v2-f32 嵌入模型计算两个输入文本的嵌入:

Next, create an OllamaEmbeddingModel instance and use it to compute the embeddings for two input texts using a dedicated chroma/all-minilm-l6-v2-f32 embedding models:

var ollamaApi = OllamaApi.builder().build();

var embeddingModel = new OllamaEmbeddingModel(this.ollamaApi,
        OllamaOptions.builder()
			.model(OllamaModel.MISTRAL.id())
            .build());

EmbeddingResponse embeddingResponse = this.embeddingModel.call(
    new EmbeddingRequest(List.of("Hello World", "World is big and salvation is near"),
        OllamaOptions.builder()
            .model("chroma/all-minilm-l6-v2-f32"))
            .truncate(false)
            .build());

OllamaOptions 为所有嵌入请求提供配置信息。

The OllamaOptions provides the configuration information for all embedding requests.