Apache Cassandra Vector Store

本节将引导您设置 CassandraVectorStore 来存储文档嵌入并执行相似性搜索。

This section walks you through setting up CassandraVectorStore to store document embeddings and perform similarity searches.

What is Apache Cassandra?

Apache Cassandra® 是一个真正的开源分布式数据库，以线性可扩展性、久经考验的容错性和低延迟而闻名，使其成为关键任务事务数据的理想平台。

Apache Cassandra® is a true open source distributed database renowned for linear scalability, proven fault-tolerance and low latency, making it the perfect platform for mission-critical transactional data.

其矢量相似性搜索 (VSS) 基于 JVector 库，可确保一流的性能和相关性。

Its Vector Similarity Search (VSS) is based on the JVector library that ensures best-in-class performance and relevancy.

使用Gemini将这段文字翻译成中文：在 Apache Cassandra 中执行向量搜索，其简便程度如下：

A vector search in Apache Cassandra is done as simply as:

SELECT content FROM table ORDER BY content_vector ANN OF query_embedding;

用Gemini将这段文字翻译成中文:"关于此的更多文档可以在 here 阅读。"

What is JVector?

JVector 是一个纯 Java 嵌入式向量搜索引擎。

JVector is a pure Java embedded vector search engine.

它通过以下特性从其他 HNSW 向量相似性搜索实现中脱颖而出：

It stands out from other HNSW Vector Similarity Search implementations by being:

算法快速。JVector 使用受 DiskANN 和相关研究启发的最新图算法，提供高召回率和低延迟。
Algorithmic-fast. JVector uses state of the art graph algorithms inspired by DiskANN and related research that offer high recall and low latency.
实现快速。JVector 使用 Panama SIMD API 加速索引构建和查询。
Implementation-fast. JVector uses the Panama SIMD API to accelerate index build and queries.
内存高效。JVector 使用乘积量化压缩向量，以便它们在搜索期间可以保留在内存中。
Memory efficient. JVector compresses vectors using product quantization so they can stay in memory during searches.
磁盘感知。JVector 的磁盘布局旨在在查询时执行最少必要的 I/O 操作。
Disk-aware. JVector’s disk layout is designed to do the minimum necessary iops at query time.
并发。索引构建至少可以线性扩展到 32 个线程。线程加倍，构建时间减半。
Concurrent. Index builds scale linearly to at least 32 threads. Double the threads, half the build time.
增量。在构建索引时查询它。添加向量与在搜索结果中找到它之间没有延迟。
Incremental. Query your index as you build it. No delay between adding a vector and being able to find it in search results.
易于嵌入。API 旨在便于在生产中使用的人员进行嵌入。
Easy to embed. API designed for easy embedding, by people using it in production.

Prerequisites

一个 EmbeddingModel 实例用于计算文档嵌入。这通常配置为 Spring Bean。有几个选项可用：
- Transformers Embedding - 在您的本地环境中计算嵌入。默认是通过 ONNX 和 all-MiniLM-L6-v2 Sentence Transformers。这开箱即用。
- Transformers Embedding - computes the embedding in your local environment. The default is via ONNX and the all-MiniLM-L6-v2 Sentence Transformers. This just works.
- 如果您想使用 OpenAI 的嵌入 - 使用 OpenAI 嵌入端点。您需要在 OpenAI Signup 创建一个帐户，并在 API Keys 生成 API 密钥令牌。
- If you want to use OpenAI’s Embeddings - uses the OpenAI embedding endpoint. You need to create an account at OpenAI Signup and generate the api-key token at API Keys.
- 还有更多选择，请参阅 Embeddings API 文档。
- There are many more choices, see Embeddings API docs.

A EmbeddingModel instance to compute the document embeddings. This is usually configured as a Spring Bean. Several options are available:
- Transformers Embedding - 在您的本地环境中计算嵌入。默认是通过 ONNX 和 all-MiniLM-L6-v2 Sentence Transformers。这开箱即用。
- Transformers Embedding - computes the embedding in your local environment. The default is via ONNX and the all-MiniLM-L6-v2 Sentence Transformers. This just works.
- 如果您想使用 OpenAI 的嵌入 - 使用 OpenAI 嵌入端点。您需要在 OpenAI Signup 创建一个帐户，并在 API Keys 生成 API 密钥令牌。
- If you want to use OpenAI’s Embeddings - uses the OpenAI embedding endpoint. You need to create an account at OpenAI Signup and generate the api-key token at API Keys.
- 还有更多选择，请参阅 Embeddings API 文档。
- There are many more choices, see Embeddings API docs.

一个 Apache Cassandra 实例，从 5.0-beta1 版本开始[style="loweralpha"]

DIY Quick Start
对于托管服务， Astra DB 提供了一个健康的免费套餐。
For a managed offering Astra DB offers a healthy free tier offering.

An Apache Cassandra instance, from version 5.0-beta1[style="loweralpha"]

DIY Quick Start
对于托管服务， Astra DB 提供了一个健康的免费套餐。
For a managed offering Astra DB offers a healthy free tier offering.

Dependencies

Spring AI 自动配置、启动器模块的工件名称发生了重大变化。请参阅 upgrade notes 以获取更多信息。

There has been a significant change in the Spring AI auto-configuration, starter modules' artifact names. Please refer to the upgrade notes for more information.

以下是使用 Gemini 翻译后的文本：中文翻译：对于依赖项管理，我们建议使用 Spring AI BOM，具体说明请参见 Dependency Management 章节。

For dependency management, we recommend using the Spring AI BOM as explained in the Dependency Management section.

将这些依赖项添加到你的项目中：

Add these dependencies to your project:

当然，这是使用 Gemini 翻译后的文本：仅适用于 Cassandra 向量数据库：
For just the Cassandra Vector Store:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-cassandra-store</artifactId>
</dependency>

或者，对于 RAG 应用程序所需的一切（使用默认的 ONNX 嵌入模型）：
Or, for everything you need in a RAG application (using the default ONNX Embedding Model):

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-vector-store-cassandra</artifactId>
</dependency>

Configuration Properties

您可以在Spring Boot配置中使用以下属性来定制Apache Cassandra向量存储。

You can use the following properties in your Spring Boot configuration to customize the Apache Cassandra vector store.

Property Default Value

Property	Default Value
`spring.ai.vectorstore.cassandra.keyspace`	springframework
`spring.ai.vectorstore.cassandra.table`	ai_vector_store
`spring.ai.vectorstore.cassandra.initialize-schema`	false
`spring.ai.vectorstore.cassandra.index-name`
`spring.ai.vectorstore.cassandra.content-column-name`	content
`spring.ai.vectorstore.cassandra.embedding-column-name`	embedding
`spring.ai.vectorstore.cassandra.fixed-thread-pool-executor-size`	16

spring.ai.vectorstore.cassandra.keyspace

springframework

spring.ai.vectorstore.cassandra.table

ai_vector_store

spring.ai.vectorstore.cassandra.initialize-schema

false

spring.ai.vectorstore.cassandra.index-name

spring.ai.vectorstore.cassandra.content-column-name

content

spring.ai.vectorstore.cassandra.embedding-column-name

embedding

spring.ai.vectorstore.cassandra.fixed-thread-pool-executor-size

Usage

Basic Usage

创建一个CassandraVectorStore实例作为Spring Bean：

Create a CassandraVectorStore instance as a Spring Bean:

@Bean
public VectorStore vectorStore(CqlSession session, EmbeddingModel embeddingModel) {
    return CassandraVectorStore.builder(embeddingModel)
        .session(session)
        .keyspace("my_keyspace")
        .table("my_vectors")
        .build();
}

一旦您拥有向量存储实例，就可以添加文档并执行搜索：

Once you have the vector store instance, you can add documents and perform searches:

// Add documents
vectorStore.add(List.of(
    new Document("1", "content1", Map.of("key1", "value1")),
    new Document("2", "content2", Map.of("key2", "value2"))
));

// Search with filters
List<Document> results = vectorStore.similaritySearch(
    SearchRequest.query("search text")
        .withTopK(5)
        .withSimilarityThreshold(0.7f)
        .withFilterExpression("metadata.key1 == 'value1'")
);

Advanced Configuration

使用Gemini，这段文字的中文翻译是：对于更复杂的用例，你可以在你的Spring Bean中配置额外的设置：

For more complex use cases, you can configure additional settings in your Spring Bean:

@Bean
public VectorStore vectorStore(CqlSession session, EmbeddingModel embeddingModel) {
    return CassandraVectorStore.builder(embeddingModel)
        .session(session)
        .keyspace("my_keyspace")
        .table("my_vectors")
        // Configure primary keys
        .partitionKeys(List.of(
            new SchemaColumn("id", DataTypes.TEXT),
            new SchemaColumn("category", DataTypes.TEXT)
        ))
        .clusteringKeys(List.of(
            new SchemaColumn("timestamp", DataTypes.TIMESTAMP)
        ))
        // Add metadata columns with optional indexing
        .addMetadataColumns(
            new SchemaColumn("category", DataTypes.TEXT, SchemaColumnTags.INDEXED),
            new SchemaColumn("score", DataTypes.DOUBLE)
        )
        // Customize column names
        .contentColumnName("text")
        .embeddingColumnName("vector")
        // Performance tuning
        .fixedThreadPoolExecutorSize(32)
        // Schema management
        .initializeSchema(true)
        // Custom batching strategy
        .batchingStrategy(new TokenCountBatchingStrategy())
        .build();
}

Connection Configuration

用Gemini将这段文字翻译成中文:连接到 Cassandra 有两种配置方式：

There are two ways to configure the connection to Cassandra:

使用Gemini翻译这段文字成中文：使用注入的 CqlSession（推荐）：
Using an injected CqlSession (recommended):

@Bean
public VectorStore vectorStore(CqlSession session, EmbeddingModel embeddingModel) {
    return CassandraVectorStore.builder(embeddingModel)
        .session(session)
        .keyspace("my_keyspace")
        .table("my_vectors")
        .build();
}

在构建器中直接使用连接详细信息：
Using connection details directly in the builder:

@Bean
public VectorStore vectorStore(EmbeddingModel embeddingModel) {
    return CassandraVectorStore.builder(embeddingModel)
        .contactPoint(new InetSocketAddress("localhost", 9042))
        .localDatacenter("datacenter1")
        .keyspace("my_keyspace")
        .build();
}

Metadata Filtering

您可以利用 CassandraVectorStore 的通用、便携式元数据过滤器。要使元数据列可搜索，它们必须是主键或 SAI 索引。要使非主键列可索引，请使用 SchemaColumnTags.INDEXED 配置元数据列。

You can leverage the generic, portable metadata filters with the CassandraVectorStore. For metadata columns to be searchable they must be either primary keys or SAI indexed. To make non-primary-key columns indexed, configure the metadata column with the SchemaColumnTags.INDEXED.

例如，你可以使用文本表达式语言：

For example, you can use either the text expression language:

vectorStore.similaritySearch(
    SearchRequest.builder().query("The World")
        .topK(5)
        .filterExpression("country in ['UK', 'NL'] && year >= 2020").build());

或使用表达式 DSL 以编程方式：

or programmatically using the expression DSL:

Filter.Expression f = new FilterExpressionBuilder()
    .and(
        f.in("country", "UK", "NL"),
        f.gte("year", 2020)
    ).build();

vectorStore.similaritySearch(
    SearchRequest.builder().query("The World")
        .topK(5)
        .filterExpression(f).build());

可移植的过滤器表达式会自动转换为 CQL queries 。

The portable filter expressions get automatically converted into CQL queries.

Advanced Example: Vector Store on top of Wikipedia Dataset

以下示例演示了如何在现有模式上使用存储。这里我们使用来自 [role="bare"] [role="bare"]https://github.com/datastax-labs/colbert-wikipedia-data 项目的模式，该项目附带完整的维基百科数据集，已为您准备好向量化。

The following example demonstrates how to use the store on an existing schema. Here we use the schema from the [role="bare"]https://github.com/datastax-labs/colbert-wikipedia-data project which comes with the full wikipedia dataset ready vectorized for you.

首先，在 Cassandra 数据库中创建模式：

First, create the schema in the Cassandra database:

wget https://s.apache.org/colbert-wikipedia-schema-cql -O colbert-wikipedia-schema.cql
cqlsh -f colbert-wikipedia-schema.cql

然后使用构建器模式配置存储：

Then configure the store using the builder pattern:

@Bean
public VectorStore vectorStore(CqlSession session, EmbeddingModel embeddingModel) {
    List<SchemaColumn> partitionColumns = List.of(
        new SchemaColumn("wiki", DataTypes.TEXT),
        new SchemaColumn("language", DataTypes.TEXT),
        new SchemaColumn("title", DataTypes.TEXT)
    );

    List<SchemaColumn> clusteringColumns = List.of(
        new SchemaColumn("chunk_no", DataTypes.INT),
        new SchemaColumn("bert_embedding_no", DataTypes.INT)
    );

    List<SchemaColumn> extraColumns = List.of(
        new SchemaColumn("revision", DataTypes.INT),
        new SchemaColumn("id", DataTypes.INT)
    );

    return CassandraVectorStore.builder()
        .session(session)
        .embeddingModel(embeddingModel)
        .keyspace("wikidata")
        .table("articles")
        .partitionKeys(partitionColumns)
        .clusteringKeys(clusteringColumns)
        .contentColumnName("body")
        .embeddingColumnName("all_minilm_l6_v2_embedding")
        .indexName("all_minilm_l6_v2_ann")
        .initializeSchema(false)
        .addMetadataColumns(extraColumns)
        .primaryKeyTranslator((List<Object> primaryKeys) -> {
            if (primaryKeys.isEmpty()) {
                return "test§¶0";
            }
            return String.format("%s§¶%s", primaryKeys.get(2), primaryKeys.get(3));
        })
        .documentIdTranslator((id) -> {
            String[] parts = id.split("§¶");
            String title = parts[0];
            int chunk_no = parts.length > 1 ? Integer.parseInt(parts[1]) : 0;
            return List.of("simplewiki", "en", title, chunk_no, 0);
        })
        .build();
}

@Bean
public EmbeddingModel embeddingModel() {
    // default is ONNX all-MiniLM-L6-v2 which is what we want
    return new TransformersEmbeddingModel();
}

Loading the Complete Wikipedia Dataset

要加载完整的维基百科数据集：

To load the full wikipedia dataset:

从 [role="bare"] [role="bare"]https://s.apache.org/simplewiki-sstable-tar 下载 simplewiki-sstable.tar （这将需要一段时间，文件有几十 GB）
Download simplewiki-sstable.tar from [role="bare"]https://s.apache.org/simplewiki-sstable-tar (this will take a while, the file is tens of GBs)
Load the data:

tar -xf simplewiki-sstable.tar -C ${CASSANDRA_DATA}/data/wikidata/articles-*/
nodetool import wikidata articles ${CASSANDRA_DATA}/data/wikidata/articles-*/

如果此表中有现有数据，请检查 tarball 的文件在执行 tar 时不会覆盖现有 sstable。
If you have existing data in this table, check the tarball’s files don’t clobber existing sstables when doing the tar.
nodetool import 的替代方法是只重新启动 Cassandra。
An alternative to nodetool import is to just restart Cassandra.
如果索引中出现任何故障，它们将自动重建。
If there are any failures in the indexes they will be rebuilt automatically.

Accessing the Native Client

Cassandra 向量存储实现通过 getNativeClient() 方法提供对底层原生 Cassandra 客户端 ( CqlSession ) 的访问：

The Cassandra Vector Store implementation provides access to the underlying native Cassandra client (CqlSession) through the getNativeClient() method:

CassandraVectorStore vectorStore = context.getBean(CassandraVectorStore.class);
Optional<CqlSession> nativeClient = vectorStore.getNativeClient();

if (nativeClient.isPresent()) {
    CqlSession session = nativeClient.get();
    // Use the native client for Cassandra-specific operations
}

原生客户端允许您访问 Cassandra 特有的功能和操作，这些功能和操作可能不会通过 VectorStore 接口公开。

The native client gives you access to Cassandra-specific features and operations that might not be exposed through the VectorStore interface.