Apache Cassandra Vector Store
本节将引导您设置 CassandraVectorStore
来存储文档嵌入并执行相似性搜索。
This section walks you through setting up CassandraVectorStore
to store document embeddings and perform similarity searches.
What is Apache Cassandra?
Apache Cassandra® 是一个真正的开源分布式数据库,以线性可扩展性、久经考验的容错性和低延迟而闻名,使其成为关键任务事务数据的理想平台。
Apache Cassandra® is a true open source distributed database renowned for linear scalability, proven fault-tolerance and low latency, making it the perfect platform for mission-critical transactional data.
其矢量相似性搜索 (VSS) 基于 JVector 库,可确保一流的性能和相关性。
Its Vector Similarity Search (VSS) is based on the JVector library that ensures best-in-class performance and relevancy.
使用Gemini将这段文字翻译成中文:在 Apache Cassandra 中执行向量搜索,其简便程度如下:
A vector search in Apache Cassandra is done as simply as:
SELECT content FROM table ORDER BY content_vector ANN OF query_embedding;
用Gemini将这段文字翻译成中文:"关于此的更多文档可以在 here 阅读。"
More docs on this can be read here.
这个 Spring AI 向量存储旨在同时适用于全新的 RAG 应用程序,并能够改装到现有数据和表之上。
This Spring AI Vector Store is designed to work for both brand-new RAG applications and be able to be retrofitted on top of existing data and tables.
该存储还可以用于现有数据库中的非 RAG 用例,例如语义搜索、地理邻近搜索等。
The store can also be used for non-RAG use-cases in an existing database, e.g. semantic searches, geo-proximity searches, etc.
该存储将根据其配置自动创建或增强模式。如果您不希望修改模式,请使用 initializeSchema
配置存储。
The store will automatically create, or enhance, the schema as needed according to its configuration. If you don’t want the schema modifications, configure the store with initializeSchema
.
使用 spring-boot-autoconfigure 时, initializeSchema
默认为 false
,符合 Spring Boot 标准,您必须通过在 application.properties
文件中设置 …initialize-schema=true
来选择启用模式创建/修改。
When using spring-boot-autoconfigure initializeSchema
defaults to false
, per Spring Boot standards, and you must opt-in to schema creation/modifications by setting …initialize-schema=true
in the application.properties
file.
What is JVector?
JVector 是一个纯 Java 嵌入式向量搜索引擎。
JVector is a pure Java embedded vector search engine.
它通过以下特性从其他 HNSW 向量相似性搜索实现中脱颖而出:
It stands out from other HNSW Vector Similarity Search implementations by being:
-
算法快速。JVector 使用受 DiskANN 和相关研究启发的最新图算法,提供高召回率和低延迟。
-
Algorithmic-fast. JVector uses state of the art graph algorithms inspired by DiskANN and related research that offer high recall and low latency.
-
实现快速。JVector 使用 Panama SIMD API 加速索引构建和查询。
-
Implementation-fast. JVector uses the Panama SIMD API to accelerate index build and queries.
-
内存高效。JVector 使用乘积量化压缩向量,以便它们在搜索期间可以保留在内存中。
-
Memory efficient. JVector compresses vectors using product quantization so they can stay in memory during searches.
-
磁盘感知。JVector 的磁盘布局旨在在查询时执行最少必要的 I/O 操作。
-
Disk-aware. JVector’s disk layout is designed to do the minimum necessary iops at query time.
-
并发。索引构建至少可以线性扩展到 32 个线程。线程加倍,构建时间减半。
-
Concurrent. Index builds scale linearly to at least 32 threads. Double the threads, half the build time.
-
增量。在构建索引时查询它。添加向量与在搜索结果中找到它之间没有延迟。
-
Incremental. Query your index as you build it. No delay between adding a vector and being able to find it in search results.
-
易于嵌入。API 旨在便于在生产中使用的人员进行嵌入。
-
Easy to embed. API designed for easy embedding, by people using it in production.
Prerequisites
-
一个
EmbeddingModel
实例用于计算文档嵌入。这通常配置为 Spring Bean。有几个选项可用:-
Transformers Embedding
- 在您的本地环境中计算嵌入。默认是通过 ONNX 和 all-MiniLM-L6-v2 Sentence Transformers。这开箱即用。 -
Transformers Embedding
- computes the embedding in your local environment. The default is via ONNX and the all-MiniLM-L6-v2 Sentence Transformers. This just works. -
如果您想使用 OpenAI 的嵌入 - 使用 OpenAI 嵌入端点。您需要在 OpenAI Signup 创建一个帐户,并在 API Keys 生成 API 密钥令牌。
-
If you want to use OpenAI’s Embeddings - uses the OpenAI embedding endpoint. You need to create an account at OpenAI Signup and generate the api-key token at API Keys.
-
还有更多选择,请参阅
Embeddings API
文档。 -
There are many more choices, see
Embeddings API
docs.
-
-
A
EmbeddingModel
instance to compute the document embeddings. This is usually configured as a Spring Bean. Several options are available:-
Transformers Embedding
- 在您的本地环境中计算嵌入。默认是通过 ONNX 和 all-MiniLM-L6-v2 Sentence Transformers。这开箱即用。 -
Transformers Embedding
- computes the embedding in your local environment. The default is via ONNX and the all-MiniLM-L6-v2 Sentence Transformers. This just works. -
如果您想使用 OpenAI 的嵌入 - 使用 OpenAI 嵌入端点。您需要在 OpenAI Signup 创建一个帐户,并在 API Keys 生成 API 密钥令牌。
-
If you want to use OpenAI’s Embeddings - uses the OpenAI embedding endpoint. You need to create an account at OpenAI Signup and generate the api-key token at API Keys.
-
还有更多选择,请参阅
Embeddings API
文档。 -
There are many more choices, see
Embeddings API
docs.
-
-
一个 Apache Cassandra 实例,从 5.0-beta1 版本开始[style="loweralpha"]
-
An Apache Cassandra instance, from version 5.0-beta1[style="loweralpha"]
Dependencies
Spring AI 自动配置、启动器模块的工件名称发生了重大变化。请参阅 upgrade notes 以获取更多信息。 There has been a significant change in the Spring AI auto-configuration, starter modules' artifact names. Please refer to the upgrade notes for more information. |
以下是使用 Gemini 翻译后的文本:中文翻译:对于依赖项管理,我们建议使用 Spring AI BOM,具体说明请参见 Dependency Management 章节。 |
For dependency management, we recommend using the Spring AI BOM as explained in the Dependency Management section. |
将这些依赖项添加到你的项目中:
Add these dependencies to your project:
-
当然,这是使用 Gemini 翻译后的文本:仅适用于 Cassandra 向量数据库:
-
For just the Cassandra Vector Store:
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-cassandra-store</artifactId>
</dependency>
-
或者,对于 RAG 应用程序所需的一切(使用默认的 ONNX 嵌入模型):
-
Or, for everything you need in a RAG application (using the default ONNX Embedding Model):
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-vector-store-cassandra</artifactId>
</dependency>
Configuration Properties
您可以在Spring Boot配置中使用以下属性来定制Apache Cassandra向量存储。
You can use the following properties in your Spring Boot configuration to customize the Apache Cassandra vector store.
Property | Default Value |
---|---|
|
springframework |
|
ai_vector_store |
|
false |
|
|
|
content |
|
embedding |
|
16 |
Usage
Basic Usage
创建一个CassandraVectorStore实例作为Spring Bean:
Create a CassandraVectorStore instance as a Spring Bean:
@Bean
public VectorStore vectorStore(CqlSession session, EmbeddingModel embeddingModel) {
return CassandraVectorStore.builder(embeddingModel)
.session(session)
.keyspace("my_keyspace")
.table("my_vectors")
.build();
}
一旦您拥有向量存储实例,就可以添加文档并执行搜索:
Once you have the vector store instance, you can add documents and perform searches:
// Add documents
vectorStore.add(List.of(
new Document("1", "content1", Map.of("key1", "value1")),
new Document("2", "content2", Map.of("key2", "value2"))
));
// Search with filters
List<Document> results = vectorStore.similaritySearch(
SearchRequest.query("search text")
.withTopK(5)
.withSimilarityThreshold(0.7f)
.withFilterExpression("metadata.key1 == 'value1'")
);
Advanced Configuration
使用Gemini,这段文字的中文翻译是:对于更复杂的用例,你可以在你的Spring Bean中配置额外的设置:
For more complex use cases, you can configure additional settings in your Spring Bean:
@Bean
public VectorStore vectorStore(CqlSession session, EmbeddingModel embeddingModel) {
return CassandraVectorStore.builder(embeddingModel)
.session(session)
.keyspace("my_keyspace")
.table("my_vectors")
// Configure primary keys
.partitionKeys(List.of(
new SchemaColumn("id", DataTypes.TEXT),
new SchemaColumn("category", DataTypes.TEXT)
))
.clusteringKeys(List.of(
new SchemaColumn("timestamp", DataTypes.TIMESTAMP)
))
// Add metadata columns with optional indexing
.addMetadataColumns(
new SchemaColumn("category", DataTypes.TEXT, SchemaColumnTags.INDEXED),
new SchemaColumn("score", DataTypes.DOUBLE)
)
// Customize column names
.contentColumnName("text")
.embeddingColumnName("vector")
// Performance tuning
.fixedThreadPoolExecutorSize(32)
// Schema management
.initializeSchema(true)
// Custom batching strategy
.batchingStrategy(new TokenCountBatchingStrategy())
.build();
}
Connection Configuration
用Gemini将这段文字翻译成中文:连接到 Cassandra 有两种配置方式:
There are two ways to configure the connection to Cassandra:
-
使用Gemini翻译这段文字成中文:使用注入的 CqlSession(推荐):
-
Using an injected CqlSession (recommended):
@Bean
public VectorStore vectorStore(CqlSession session, EmbeddingModel embeddingModel) {
return CassandraVectorStore.builder(embeddingModel)
.session(session)
.keyspace("my_keyspace")
.table("my_vectors")
.build();
}
-
在构建器中直接使用连接详细信息:
-
Using connection details directly in the builder:
@Bean
public VectorStore vectorStore(EmbeddingModel embeddingModel) {
return CassandraVectorStore.builder(embeddingModel)
.contactPoint(new InetSocketAddress("localhost", 9042))
.localDatacenter("datacenter1")
.keyspace("my_keyspace")
.build();
}
Metadata Filtering
您可以利用 CassandraVectorStore 的通用、便携式元数据过滤器。要使元数据列可搜索,它们必须是主键或 SAI 索引。要使非主键列可索引,请使用 SchemaColumnTags.INDEXED
配置元数据列。
You can leverage the generic, portable metadata filters with the CassandraVectorStore. For metadata columns to be searchable they must be either primary keys or SAI indexed. To make non-primary-key columns indexed, configure the metadata column with the SchemaColumnTags.INDEXED
.
例如,你可以使用文本表达式语言:
For example, you can use either the text expression language:
vectorStore.similaritySearch(
SearchRequest.builder().query("The World")
.topK(5)
.filterExpression("country in ['UK', 'NL'] && year >= 2020").build());
或使用表达式 DSL 以编程方式:
or programmatically using the expression DSL:
Filter.Expression f = new FilterExpressionBuilder()
.and(
f.in("country", "UK", "NL"),
f.gte("year", 2020)
).build();
vectorStore.similaritySearch(
SearchRequest.builder().query("The World")
.topK(5)
.filterExpression(f).build());
可移植的过滤器表达式会自动转换为 CQL queries 。
The portable filter expressions get automatically converted into CQL queries.
Advanced Example: Vector Store on top of Wikipedia Dataset
以下示例演示了如何在现有模式上使用存储。这里我们使用来自 [role="bare"] [role="bare"]https://github.com/datastax-labs/colbert-wikipedia-data 项目的模式,该项目附带完整的维基百科数据集,已为您准备好向量化。
The following example demonstrates how to use the store on an existing schema. Here we use the schema from the [role="bare"]https://github.com/datastax-labs/colbert-wikipedia-data project which comes with the full wikipedia dataset ready vectorized for you.
首先,在 Cassandra 数据库中创建模式:
First, create the schema in the Cassandra database:
wget https://s.apache.org/colbert-wikipedia-schema-cql -O colbert-wikipedia-schema.cql
cqlsh -f colbert-wikipedia-schema.cql
然后使用构建器模式配置存储:
Then configure the store using the builder pattern:
@Bean
public VectorStore vectorStore(CqlSession session, EmbeddingModel embeddingModel) {
List<SchemaColumn> partitionColumns = List.of(
new SchemaColumn("wiki", DataTypes.TEXT),
new SchemaColumn("language", DataTypes.TEXT),
new SchemaColumn("title", DataTypes.TEXT)
);
List<SchemaColumn> clusteringColumns = List.of(
new SchemaColumn("chunk_no", DataTypes.INT),
new SchemaColumn("bert_embedding_no", DataTypes.INT)
);
List<SchemaColumn> extraColumns = List.of(
new SchemaColumn("revision", DataTypes.INT),
new SchemaColumn("id", DataTypes.INT)
);
return CassandraVectorStore.builder()
.session(session)
.embeddingModel(embeddingModel)
.keyspace("wikidata")
.table("articles")
.partitionKeys(partitionColumns)
.clusteringKeys(clusteringColumns)
.contentColumnName("body")
.embeddingColumnName("all_minilm_l6_v2_embedding")
.indexName("all_minilm_l6_v2_ann")
.initializeSchema(false)
.addMetadataColumns(extraColumns)
.primaryKeyTranslator((List<Object> primaryKeys) -> {
if (primaryKeys.isEmpty()) {
return "test§¶0";
}
return String.format("%s§¶%s", primaryKeys.get(2), primaryKeys.get(3));
})
.documentIdTranslator((id) -> {
String[] parts = id.split("§¶");
String title = parts[0];
int chunk_no = parts.length > 1 ? Integer.parseInt(parts[1]) : 0;
return List.of("simplewiki", "en", title, chunk_no, 0);
})
.build();
}
@Bean
public EmbeddingModel embeddingModel() {
// default is ONNX all-MiniLM-L6-v2 which is what we want
return new TransformersEmbeddingModel();
}
Loading the Complete Wikipedia Dataset
要加载完整的维基百科数据集:
To load the full wikipedia dataset:
-
从 [role="bare"] [role="bare"]https://s.apache.org/simplewiki-sstable-tar 下载
simplewiki-sstable.tar
(这将需要一段时间,文件有几十 GB) -
Download
simplewiki-sstable.tar
from [role="bare"]https://s.apache.org/simplewiki-sstable-tar (this will take a while, the file is tens of GBs) -
Load the data:
tar -xf simplewiki-sstable.tar -C ${CASSANDRA_DATA}/data/wikidata/articles-*/ nodetool import wikidata articles ${CASSANDRA_DATA}/data/wikidata/articles-*/
|
Accessing the Native Client
Cassandra 向量存储实现通过 getNativeClient()
方法提供对底层原生 Cassandra 客户端 ( CqlSession
) 的访问:
The Cassandra Vector Store implementation provides access to the underlying native Cassandra client (CqlSession
) through the getNativeClient()
method:
CassandraVectorStore vectorStore = context.getBean(CassandraVectorStore.class);
Optional<CqlSession> nativeClient = vectorStore.getNativeClient();
if (nativeClient.isPresent()) {
CqlSession session = nativeClient.get();
// Use the native client for Cassandra-specific operations
}
原生客户端允许您访问 Cassandra 特有的功能和操作,这些功能和操作可能不会通过 VectorStore
接口公开。
The native client gives you access to Cassandra-specific features and operations that might not be exposed through the VectorStore
interface.