ETL Pipeline
提取、转换和加载 (ETL) 框架充当检索增强生成 (RAG) 使用案例中的数据处理主干。
The Extract, Transform, and Load (ETL) framework serves as the backbone of data processing within the Retrieval Augmented Generation (RAG) use case.
ETL 管道将数据流从原始数据源协调到结构化向量存储中,确保数据处于 AI 模型检索的最佳格式中。
The ETL pipeline orchestrates the flow from raw data sources to a structured vector store, ensuring data is in the optimal format for retrieval by the AI model.
RAG 使用案例是一个文本,它通过从数据主体中检索相关信息来增强生成模型的能力,以提升生成输出的质量和关联性。
The RAG use case is text to augment the capabilities of generative models by retrieving relevant information from a body of data to enhance the quality and relevance of the generated output.
API Overview
ETL 管道创建、转换和存储 Document
实例。
The ETL pipelines creates, transforms and stores Document
instances.

Document
类包含文本、元数据以及可选的其他媒体类型,如图像、音频和视频。
The Document
class contains text, metadata and optionally additional media types like images, audio and video.
ETL 管道有三个主要组件,
There are three main components of the ETL pipeline,
-
DocumentReader
that implementsSupplier<List<Document>>
-
DocumentTransformer
它实现Function<List<Document>, List<Document>>
-
DocumentTransformer
that implementsFunction<List<Document>, List<Document>>
-
DocumentWriter
that implementsConsumer<List<Document>>
Document
类内容是借助 DocumentReader
从 PDF、文本文件和其他文档类型创建的。
The Document
class content is created from PDFs, text files and other document types with the help of DocumentReader
.
要构建一个简单的 ETL 管道,您可以将每种类型的实例链接在一起。
To construct a simple ETL pipeline, you can chain together an instance of each type.

假设我们有以下三个 ETL 类型的实例
Let’s say we have the following instances of those three ETL types
-
PagePdfDocumentReader
DocumentReader
的实现 -
PagePdfDocumentReader
an implementation ofDocumentReader
-
TokenTextSplitter
DocumentTransformer
的实现 -
TokenTextSplitter
an implementation ofDocumentTransformer
-
VectorStore
DocumentWriter
的实现 -
VectorStore
an implementation ofDocumentWriter
要执行将数据基本加载到向量数据库以用于检索增强生成模式的操作,请在 Java 函数样式语法中使用以下代码。
To perform the basic loading of data into a Vector Database for use with the Retrieval Augmented Generation pattern, use the following code in Java function style syntax.
vectorStore.accept(tokenTextSplitter.apply(pdfReader.get()));
或者,您可以使用对该领域更自然表达的方法名称
Alternatively, you can use method names that are more naturally expressive for the domain
vectorStore.write(tokenTextSplitter.split(pdfReader.read()));
ETL Interfaces
ETL 管道由以下接口和实现组成。详细的 ETL 类图显示在 ETL Class Diagram 部分中。
The ETL pipeline is composed of the following interfaces and implementations. Detailed ETL class diagram is shown in the ETL Class Diagram section.
DocumentReader
提供各类来源的文档。
Provides a source of documents from diverse origins.
public interface DocumentReader extends Supplier<List<Document>> {
default List<Document> read() {
return get();
}
}
DocumentTransformer
将一批文档作为处理工作流程的一部分进行转换。
Transforms a batch of documents as part of the processing workflow.
public interface DocumentTransformer extends Function<List<Document>, List<Document>> {
default List<Document> transform(List<Document> transform) {
return apply(transform);
}
}
DocumentReaders
JSON
JsonReader
处理JSON文档,将其转换为 Document
对象的列表。
The JsonReader
processes JSON documents, converting them into a list of Document
objects.
Example
@Component
class MyJsonReader {
private final Resource resource;
MyJsonReader(@Value("classpath:bikes.json") Resource resource) {
this.resource = resource;
}
List<Document> loadJsonAsDocuments() {
JsonReader jsonReader = new JsonReader(this.resource, "description", "content");
return jsonReader.get();
}
}
Constructor Options
JsonReader
提供了几种构造函数选项:
The JsonReader
provides several constructor options:
-
JsonReader(Resource resource)
-
JsonReader(Resource resource, String… jsonKeysToUse)
-
JsonReader(Resource resource, JsonMetadataGenerator jsonMetadataGenerator, String…​ jsonKeysToUse)
-
JsonReader(Resource resource, JsonMetadataGenerator jsonMetadataGenerator, String… jsonKeysToUse)
Parameters
-
resource
:一个指向JSON文件的SpringResource
对象。 -
resource
: A SpringResource
object pointing to the JSON file. -
jsonKeysToUse
:JSON中应在生成的Document
对象中用作文本内容的键数组。 -
jsonKeysToUse
: An array of keys from the JSON that should be used as the text content in the resultingDocument
objects. -
jsonMetadataGenerator
:一个可选的JsonMetadataGenerator
,用于为每个Document
创建元数据。 -
jsonMetadataGenerator
: An optionalJsonMetadataGenerator
to create metadata for eachDocument
.
Behavior
JsonReader
处理JSON内容如下:
The JsonReader
processes JSON content as follows:
-
它可以处理JSON数组和单个JSON对象。
-
It can handle both JSON arrays and single JSON objects.
-
对于每个JSON对象(无论是数组中的还是单个对象):
-
它根据指定的
jsonKeysToUse
提取内容。 -
It extracts the content based on the specified
jsonKeysToUse
. -
如果未指定键,则将整个JSON对象用作内容。
-
If no keys are specified, it uses the entire JSON object as content.
-
它使用提供的
JsonMetadataGenerator
(如果未提供则使用空JsonMetadataGenerator
)生成元数据。 -
It generates metadata using the provided
JsonMetadataGenerator
(or an empty one if not provided). -
它创建具有提取内容和元数据的
Document
对象。 -
It creates a
Document
object with the extracted content and metadata.
-
-
For each JSON object (either in an array or a single object):
-
它根据指定的
jsonKeysToUse
提取内容。 -
It extracts the content based on the specified
jsonKeysToUse
. -
如果未指定键,则将整个JSON对象用作内容。
-
If no keys are specified, it uses the entire JSON object as content.
-
它使用提供的
JsonMetadataGenerator
(如果未提供则使用空JsonMetadataGenerator
)生成元数据。 -
It generates metadata using the provided
JsonMetadataGenerator
(or an empty one if not provided). -
它创建具有提取内容和元数据的
Document
对象。 -
It creates a
Document
object with the extracted content and metadata.
-
Using JSON Pointers
JsonReader
现在支持使用JSON Pointer检索JSON文档的特定部分。此功能允许您轻松地从复杂的JSON结构中提取嵌套数据。
The JsonReader
now supports retrieving specific parts of a JSON document using JSON Pointers. This feature allows you to easily extract nested data from complex JSON structures.
The get(String pointer)
method
public List<Document> get(String pointer)
此方法允许您使用JSON Pointer检索JSON文档的特定部分。
This method allows you to use a JSON Pointer to retrieve a specific part of the JSON document.
Parameters
-
pointer
:一个JSON Pointer字符串(如RFC 6901中所定义),用于在JSON结构中定位所需元素。 -
pointer
: A JSON Pointer string (as defined in RFC 6901) to locate the desired element within the JSON structure.
Return Value
-
返回一个
List<Document>
,其中包含从指针定位的JSON元素中解析的文档。 -
Returns a
List<Document>
containing the documents parsed from the JSON element located by the pointer.
Behavior
-
该方法使用提供的JSON Pointer导航到JSON结构中的特定位置。
-
The method uses the provided JSON Pointer to navigate to a specific location in the JSON structure.
-
如果指针有效并指向现有元素:
-
对于 JSON 对象:它返回一个包含单个文档的列表。
-
For a JSON object: it returns a list with a single Document.
-
对于 JSON 数组:它返回一个文档列表,数组中的每个元素对应一个文档。
-
For a JSON array: it returns a list of Documents, one for each element in the array.
-
-
If the pointer is valid and points to an existing element:
-
对于 JSON 对象:它返回一个包含单个文档的列表。
-
For a JSON object: it returns a list with a single Document.
-
对于 JSON 数组:它返回一个文档列表,数组中的每个元素对应一个文档。
-
For a JSON array: it returns a list of Documents, one for each element in the array.
-
-
如果指针无效或指向不存在的元素,它会抛出
IllegalArgumentException
。 -
If the pointer is invalid or points to a non-existent element, it throws an
IllegalArgumentException
.
Example JSON Structure
[
{
"id": 1,
"brand": "Trek",
"description": "A high-performance mountain bike for trail riding."
},
{
"id": 2,
"brand": "Cannondale",
"description": "An aerodynamic road bike for racing enthusiasts."
}
]
在此示例中,如果 JsonReader
将 "description"
配置为 jsonKeysToUse
,它将创建 Document
对象,其中内容是数组中每辆自行车的“description”字段的值。
In this example, if the JsonReader
is configured with "description"
as the jsonKeysToUse
, it will create Document
objects where the content is the value of the "description" field for each bike in the array.
Notes
-
JsonReader
使用 Jackson 进行 JSON 解析。 -
The
JsonReader
uses Jackson for JSON parsing. -
它可以通过对数组使用流式处理来高效地处理大型 JSON 文件。
-
It can handle large JSON files efficiently by using streaming for arrays.
-
如果在
jsonKeysToUse
中指定了多个键,则内容将是这些键值的串联。 -
If multiple keys are specified in
jsonKeysToUse
, the content will be a concatenation of the values for those keys. -
该阅读器非常灵活,可以通过自定义
jsonKeysToUse
和JsonMetadataGenerator
来适应各种 JSON 结构。 -
The reader is flexible and can be adapted to various JSON structures by customizing the
jsonKeysToUse
andJsonMetadataGenerator
.
Text
TextReader
处理纯文本文档,将它们转换为 Document
对象的列表。
The TextReader
processes plain text documents, converting them into a list of Document
objects.
Example
@Component
class MyTextReader {
private final Resource resource;
MyTextReader(@Value("classpath:text-source.txt") Resource resource) {
this.resource = resource;
}
List<Document> loadText() {
TextReader textReader = new TextReader(this.resource);
textReader.getCustomMetadata().put("filename", "text-source.txt");
return textReader.read();
}
}
Constructor Options
TextReader
提供两种构造函数选项:
The TextReader
provides two constructor options:
-
TextReader(String resourceUrl)
-
TextReader(Resource resource)
Parameters
-
resourceUrl
:表示要读取资源的 URL 的字符串。 -
resourceUrl
: A string representing the URL of the resource to be read. -
resource
:指向文本文件的 SpringResource
对象。 -
resource
: A SpringResource
object pointing to the text file.
Configuration
-
setCharset(Charset charset)
:设置用于读取文本文件的字符集。默认为 UTF-8。 -
setCharset(Charset charset)
: Sets the character set used for reading the text file. Default is UTF-8. -
getCustomMetadata()
:返回一个可变映射,您可以在其中为文档添加自定义元数据。 -
getCustomMetadata()
: Returns a mutable map where you can add custom metadata for the documents.
Behavior
TextReader
按如下方式处理文本内容:
The TextReader
processes text content as follows:
-
它将文本文件的全部内容读取到一个
Document
对象中。 -
It reads the entire content of the text file into a single
Document
object. -
文件的内容成为
Document
的内容。 -
The content of the file becomes the content of the
Document
. -
元数据会自动添加到
Document
中:-
charset
:用于读取文件的字符集(默认值:“UTF-8”)。 -
charset
: The character set used to read the file (default: "UTF-8"). -
source
:源文本文件的文件名。 -
source
: The filename of the source text file.
-
-
Metadata is automatically added to the
Document
:-
charset
:用于读取文件的字符集(默认值:“UTF-8”)。 -
charset
: The character set used to read the file (default: "UTF-8"). -
source
:源文本文件的文件名。 -
source
: The filename of the source text file.
-
-
通过
getCustomMetadata()
添加的任何自定义元数据都包含在Document
中。 -
Any custom metadata added via
getCustomMetadata()
is included in theDocument
.
Notes
-
TextReader
将整个文件内容读入内存,因此可能不适用于非常大的文件。 -
The
TextReader
reads the entire file content into memory, so it may not be suitable for very large files. -
如果您需要将文本分成更小的块,您可以在读取文档后使用文本分割器,例如
TokenTextSplitter
: -
If you need to split the text into smaller chunks, you can use a text splitter like
TokenTextSplitter
after reading the document:
List<Document> documents = textReader.get();
List<Document> splitDocuments = new TokenTextSplitter().apply(this.documents);
-
读取器使用 Spring 的
Resource
抽象,允许它从各种来源(类路径、文件系统、URL 等)读取。 -
The reader uses Spring’s
Resource
abstraction, allowing it to read from various sources (classpath, file system, URL, etc.). -
可以使用
getCustomMetadata()
方法将自定义元数据添加到读取器创建的所有文档中。 -
Custom metadata can be added to all documents created by the reader using the
getCustomMetadata()
method.
HTML (JSoup)
JsoupDocumentReader
处理 HTML 文档,使用 JSoup 库将其转换为 Document
对象列表。
The JsoupDocumentReader
processes HTML documents, converting them into a list of Document
objects using the JSoup library.
Example
@Component
class MyHtmlReader {
private final Resource resource;
MyHtmlReader(@Value("classpath:/my-page.html") Resource resource) {
this.resource = resource;
}
List<Document> loadHtml() {
JsoupDocumentReaderConfig config = JsoupDocumentReaderConfig.builder()
.selector("article p") // Extract paragraphs within <article> tags
.charset("ISO-8859-1") // Use ISO-8859-1 encoding
.includeLinkUrls(true) // Include link URLs in metadata
.metadataTags(List.of("author", "date")) // Extract author and date meta tags
.additionalMetadata("source", "my-page.html") // Add custom metadata
.build();
JsoupDocumentReader reader = new JsoupDocumentReader(this.resource, config);
return reader.get();
}
}
JsoupDocumentReaderConfig
允许您自定义 JsoupDocumentReader
的行为:
The JsoupDocumentReaderConfig
allows you to customize the behavior of the JsoupDocumentReader
:
-
charset
:指定 HTML 文档的字符编码(默认为“UTF-8”)。 -
charset
: Specifies the character encoding of the HTML document (defaults to "UTF-8"). -
selector
:一个 JSoup CSS 选择器,用于指定从哪些元素中提取文本(默认为“body”)。 -
selector
: A JSoup CSS selector to specify which elements to extract text from (defaults to "body"). -
separator
:用于连接来自多个选定元素的文本的字符串(默认为“\n”)。 -
separator
: The string used to join text from multiple selected elements (defaults to "\n"). -
allElements
:如果true
,则从<body>
元素中提取所有文本,忽略selector
(默认为false
)。 -
allElements
: Iftrue
, extracts all text from the<body>
element, ignoring theselector
(defaults tofalse
). -
groupByElement
:如果true
,则为由selector
匹配的每个元素创建一个单独的Document
(默认为false
)。 -
groupByElement
: Iftrue
, creates a separateDocument
for each element matched by theselector
(defaults tofalse
). -
includeLinkUrls
:如果true
,则提取绝对链接 URL 并将其添加到元数据中 (默认为false
)。 -
includeLinkUrls
: Iftrue
, extracts absolute link URLs and adds them to the metadata (defaults tofalse
). -
tags
(metadataTags
):要从中提取内容的Document
(<meta>
) 标签名称列表(默认为[body]
(["description", "keywords"]
))。 -
metadataTags
: A list of<meta>
tag names to extract content from (defaults to["description", "keywords"]
). -
metadata
(additionalMetadata
):允许您向所有创建的Document
(Document
) 对象添加自定义元数据。 -
additionalMetadata
: Allows you to add custom metadata to all createdDocument
objects.
Sample Document: my-page.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>My Web Page</title>
<meta name="description" content="A sample web page for Spring AI">
<meta name="keywords" content="spring, ai, html, example">
<meta name="author" content="John Doe">
<meta name="date" content="2024-01-15">
<link rel="stylesheet" href="style.css">
</head>
<body>
<header>
<h1>Welcome to My Page</h1>
</header>
<nav>
<ul>
<li><a href="/">Home</a></li>
<li><a href="/about">About</a></li>
</ul>
</nav>
<article>
<h2>Main Content</h2>
<p>This is the main content of my web page.</p>
<p>It contains multiple paragraphs.</p>
<a href="https://www.example.com">External Link</a>
</article>
<footer>
<p>© 2024 John Doe</p>
</footer>
</body>
</html>
行为:
Behavior:
The JsoupDocumentReader
processes the HTML content and creates Document
objects based on the configuration:
The JsoupDocumentReader
processes the HTML content and creates Document
objects based on the configuration:
-
selector
决定了哪些元素用于文本提取。 -
The
selector
determines which elements are used for text extraction. -
如果
allElements
是true
,则<body>
中的所有文本都将提取到一个Document
中。 -
If
allElements
istrue
, all text within the<body>
is extracted into a singleDocument
. -
如果
groupByElement
是true
,则每个与selector
匹配的元素都会创建一个单独的Document
。 -
If
groupByElement
istrue
, each element matching theselector
creates a separateDocument
. -
如果
allElements
和groupByElement
都不是true
,则将所有与selector
匹配的元素的文本使用separator
连接起来。 -
If neither
allElements
norgroupByElement
istrue
, text from all elements matching theselector
is joined using theseparator
. -
文档标题、指定
<meta>
标签的内容以及(可选)链接 URL 都将添加到Document
元数据中。 -
The document title, content from specified
<meta>
tags, and (optionally) link URLs are added to theDocument
metadata. -
用于解析相对链接的基本 URI 将从 URL 资源中提取。
-
The base URI, for resolving relative links, will be extracted from URL resources.
读取器保留所选元素的文本内容,但会删除其中的任何 HTML 标签。
The reader preserves the text content of the selected elements, but removes any HTML tags within them.
Markdown
MarkdownDocumentReader
处理 Markdown 文档,将其转换为 Document
对象的列表。
The MarkdownDocumentReader
processes Markdown documents, converting them into a list of Document
objects.
Example
@Component
class MyMarkdownReader {
private final Resource resource;
MyMarkdownReader(@Value("classpath:code.md") Resource resource) {
this.resource = resource;
}
List<Document> loadMarkdown() {
MarkdownDocumentReaderConfig config = MarkdownDocumentReaderConfig.builder()
.withHorizontalRuleCreateDocument(true)
.withIncludeCodeBlock(false)
.withIncludeBlockquote(false)
.withAdditionalMetadata("filename", "code.md")
.build();
MarkdownDocumentReader reader = new MarkdownDocumentReader(this.resource, config);
return reader.get();
}
}
MarkdownDocumentReaderConfig
允许您自定义 MarkdownDocumentReader 的行为:
The MarkdownDocumentReaderConfig
allows you to customize the behavior of the MarkdownDocumentReader:
-
horizontalRuleCreateDocument
:当设置为true
时,Markdown 中的水平线将创建新的Document
对象。 -
horizontalRuleCreateDocument
: When set totrue
, horizontal rules in the Markdown will create newDocument
objects. -
includeCodeBlock
:当设置为true
时,代码块将包含在与周围文本相同的Document
中。当false
时,代码块会创建单独的Document
对象。 -
includeCodeBlock
: When set totrue
, code blocks will be included in the sameDocument
as the surrounding text. Whenfalse
, code blocks create separateDocument
objects. -
includeBlockquote
:当设置为true
时,引用块将包含在与周围文本相同的Document
中。当false
时,引用块会创建单独的Document
对象。 -
includeBlockquote
: When set totrue
, blockquotes will be included in the sameDocument
as the surrounding text. Whenfalse
, blockquotes create separateDocument
objects. -
additionalMetadata
:允许您向所有创建的Document
对象添加自定义元数据。 -
additionalMetadata
: Allows you to add custom metadata to all createdDocument
objects.
Sample Document: code.md
This is a Java sample application:
```java
package com.example.demo;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
@SpringBootApplication
public class DemoApplication {
public static void main(String[] args) {
SpringApplication.run(DemoApplication.class, args);
}
}
```
Markdown also provides the possibility to `use inline code formatting throughout` the entire sentence.
---
Another possibility is to set block code without specific highlighting:
```
./mvnw spring-javaformat:apply
```
行为:MarkdownDocumentReader 处理 Markdown 内容并根据配置创建文档对象:
Behavior: The MarkdownDocumentReader processes the Markdown content and creates Document objects based on the configuration:
-
标题成为文档对象中的元数据。
-
Headers become metadata in the Document objects.
-
段落成为文档对象的内容。
-
Paragraphs become the content of Document objects.
-
代码块可以分离到自己的文档对象中,也可以包含在周围文本中。
-
Code blocks can be separated into their own Document objects or included with surrounding text.
-
引用块可以分离到自己的文档对象中,也可以包含在周围文本中。
-
Blockquotes can be separated into their own Document objects or included with surrounding text.
-
水平线可用于将内容拆分为单独的文档对象。
-
Horizontal rules can be used to split the content into separate Document objects.
读者保留了文档对象内容中的格式,例如内联代码、列表和文本样式。
The reader preserves formatting like inline code, lists, and text styling within the content of the Document objects.
PDF Page
PagePdfDocumentReader
使用 Apache PdfBox 库解析 PDF 文档
The PagePdfDocumentReader
uses Apache PdfBox library to parse PDF documents
使用 Maven 或 Gradle 将依赖项添加到您的项目中。
Add the dependency to your project using Maven or Gradle.
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-pdf-document-reader</artifactId>
</dependency>
或添加到 Gradle build.gradle
构建文件中。
or to your Gradle build.gradle
build file.
dependencies {
implementation 'org.springframework.ai:spring-ai-pdf-document-reader'
}
Example
@Component
public class MyPagePdfDocumentReader {
List<Document> getDocsFromPdf() {
PagePdfDocumentReader pdfReader = new PagePdfDocumentReader("classpath:/sample1.pdf",
PdfDocumentReaderConfig.builder()
.withPageTopMargin(0)
.withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
.withNumberOfTopTextLinesToDelete(0)
.build())
.withPagesPerDocument(1)
.build());
return pdfReader.read();
}
}
PDF Paragraph
ParagraphPdfDocumentReader
使用 PDF 目录(例如 TOC)信息将输入的 PDF 拆分为文本段落,并为每个段落输出一个 Document
。注意:并非所有 PDF 文档都包含 PDF 目录。
The ParagraphPdfDocumentReader
uses the PDF catalog (e.g. TOC) information to split the input PDF into text paragraphs and output a single Document
per paragraph.
NOTE: Not all PDF documents contain the PDF catalog.
Dependencies
使用 Maven 或 Gradle 将依赖项添加到您的项目中。
Add the dependency to your project using Maven or Gradle.
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-pdf-document-reader</artifactId>
</dependency>
或添加到 Gradle build.gradle
构建文件中。
or to your Gradle build.gradle
build file.
dependencies {
implementation 'org.springframework.ai:spring-ai-pdf-document-reader'
}
Example
@Component
public class MyPagePdfDocumentReader {
List<Document> getDocsFromPdfWithCatalog() {
ParagraphPdfDocumentReader pdfReader = new ParagraphPdfDocumentReader("classpath:/sample1.pdf",
PdfDocumentReaderConfig.builder()
.withPageTopMargin(0)
.withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
.withNumberOfTopTextLinesToDelete(0)
.build())
.withPagesPerDocument(1)
.build());
return pdfReader.read();
}
}
Tika (DOCX, PPTX, HTML…)
TikaDocumentReader
使用 Apache Tika 从各种文档格式中提取文本,例如 PDF、DOC/DOCX、PPT/PPTX 和 HTML。有关支持格式的完整列表,请参阅 Tika documentation 。
The TikaDocumentReader
uses Apache Tika to extract text from a variety of document formats, such as PDF, DOC/DOCX, PPT/PPTX, and HTML. For a comprehensive list of supported formats, refer to the Tika documentation.
Dependencies
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-tika-document-reader</artifactId>
</dependency>
或添加到 Gradle build.gradle
构建文件中。
or to your Gradle build.gradle
build file.
dependencies {
implementation 'org.springframework.ai:spring-ai-tika-document-reader'
}
Example
@Component
class MyTikaDocumentReader {
private final Resource resource;
MyTikaDocumentReader(@Value("classpath:/word-sample.docx")
Resource resource) {
this.resource = resource;
}
List<Document> loadText() {
TikaDocumentReader tikaDocumentReader = new TikaDocumentReader(this.resource);
return tikaDocumentReader.read();
}
}
Transformers
TextSplitter
TextSplitter
一个抽象基类,用于帮助将文档分隔为适合 AI 模型上下文窗口。
The TextSplitter
an abstract base class that helps divides documents to fit the AI model’s context window.
TokenTextSplitter
TokenTextSplitter
是 TextSplitter
的实现,它使用 CL100K_BASE 编码根据令牌计数将文本拆分为块。
The TokenTextSplitter
is an implementation of TextSplitter
that splits text into chunks based on token count, using the CL100K_BASE encoding.
Usage
@Component
class MyTokenTextSplitter {
public List<Document> splitDocuments(List<Document> documents) {
TokenTextSplitter splitter = new TokenTextSplitter();
return splitter.apply(documents);
}
public List<Document> splitCustomized(List<Document> documents) {
TokenTextSplitter splitter = new TokenTextSplitter(1000, 400, 10, 5000, true);
return splitter.apply(documents);
}
}
Constructor Options
TokenTextSplitter
提供了两种构造函数选项:
The TokenTextSplitter
provides two constructor options:
-
TokenTextSplitter()
:使用默认设置创建拆分器。 -
TokenTextSplitter()
: Creates a splitter with default settings. -
TokenTextSplitter(int defaultChunkSize, int minChunkSizeChars, int minChunkLengthToEmbed, int maxNumChunks, boolean keepSeparator)
-
TokenTextSplitter(int defaultChunkSize, int minChunkSizeChars, int minChunkLengthToEmbed, int maxNumChunks, boolean keepSeparator)
Parameters
-
defaultChunkSize
:每个文本块的目标大小(以令牌为单位)(默认值:800)。 -
defaultChunkSize
: The target size of each text chunk in tokens (default: 800). -
minChunkSizeChars
:每个文本块的最小大小(以字符为单位)(默认值:350)。 -
minChunkSizeChars
: The minimum size of each text chunk in characters (default: 350). -
minChunkLengthToEmbed
:要包含的块的最小长度(默认值:5)。 -
minChunkLengthToEmbed
: The minimum length of a chunk to be included (default: 5). -
maxNumChunks
:从文本生成的最大块数(默认值:10000)。 -
maxNumChunks
: The maximum number of chunks to generate from a text (default: 10000). -
keepSeparator
:是否在块中保留分隔符(如换行符)(默认值:true)。 -
keepSeparator
: Whether to keep separators (like newlines) in the chunks (default: true).
Behavior
TokenTextSplitter
按如下方式处理文本内容:
The TokenTextSplitter
processes text content as follows:
-
它使用 CL100K_BASE 编码将输入文本编码为令牌。
-
It encodes the input text into tokens using the CL100K_BASE encoding.
-
它根据
defaultChunkSize
将编码文本拆分为块。 -
It splits the encoded text into chunks based on the
defaultChunkSize
. -
For each chunk:[style="loweralpha"]
-
它将块解码回文本。
-
It decodes the chunk back into text.
-
它尝试在
minChunkSizeChars
之后找到合适的断点(句号、问号、感叹号或换行符)。 -
It attempts to find a suitable break point (period, question mark, exclamation mark, or newline) after the
minChunkSizeChars
. -
如果找到断点,它将在此点截断块。
-
If a break point is found, it truncates the chunk at that point.
-
它根据
keepSeparator
设置修剪块并可选地删除换行符。 -
It trims the chunk and optionally removes newline characters based on the
keepSeparator
setting. -
如果生成的块长度超过
minChunkLengthToEmbed
,则将其添加到输出中。 -
If the resulting chunk is longer than
minChunkLengthToEmbed
, it’s added to the output.
-
此过程持续进行,直到所有标记处理完毕或达到
maxNumChunks
。 -
This process continues until all tokens are processed or
maxNumChunks
is reached. -
如果任何剩余文本的长度超过
minChunkLengthToEmbed
,则将其添加为最终块。 -
Any remaining text is added as a final chunk if it’s longer than
minChunkLengthToEmbed
.
Example
Document doc1 = new Document("This is a long piece of text that needs to be split into smaller chunks for processing.",
Map.of("source", "example.txt"));
Document doc2 = new Document("Another document with content that will be split based on token count.",
Map.of("source", "example2.txt"));
TokenTextSplitter splitter = new TokenTextSplitter();
List<Document> splitDocuments = this.splitter.apply(List.of(this.doc1, this.doc2));
for (Document doc : splitDocuments) {
System.out.println("Chunk: " + doc.getContent());
System.out.println("Metadata: " + doc.getMetadata());
}
Notes
-
TokenTextSplitter
使用来自jtokkit
库的 CL100K_BASE 编码,该编码与较新的 OpenAI 模型兼容。 -
The
TokenTextSplitter
uses the CL100K_BASE encoding from thejtokkit
library, which is compatible with newer OpenAI models. -
拆分器尝试通过在可能的情况下在句子边界处断开来创建具有语义意义的块。
-
The splitter attempts to create semantically meaningful chunks by breaking at sentence boundaries where possible.
-
原始文档中的元数据被保留并复制到源自该文档的所有块中。
-
Metadata from the original documents is preserved and copied to all chunks derived from that document.
-
如果
copyContentFormatter
设置为true
(默认行为),则原始文档中的内容格式化程序(如果设置)也会复制到派生块中。 -
The content formatter (if set) from the original document is also copied to the derived chunks if
copyContentFormatter
is set totrue
(default behavior). -
此拆分器对于为具有标记限制的大型语言模型准备文本特别有用,可确保每个块都在模型的处理容量范围内。
-
This splitter is particularly useful for preparing text for large language models that have token limits, ensuring that each chunk is within the model’s processing capacity.
KeywordMetadataEnricher
KeywordMetadataEnricher
是一个 DocumentTransformer
,它使用生成式 AI 模型从文档内容中提取关键字并将其添加为元数据。
The KeywordMetadataEnricher
is a DocumentTransformer
that uses a generative AI model to extract keywords from document content and add them as metadata.
Usage
@Component
class MyKeywordEnricher {
private final ChatModel chatModel;
MyKeywordEnricher(ChatModel chatModel) {
this.chatModel = chatModel;
}
List<Document> enrichDocuments(List<Document> documents) {
KeywordMetadataEnricher enricher = new KeywordMetadataEnricher(this.chatModel, 5);
return enricher.apply(documents);
}
}
Constructor
KeywordMetadataEnricher
构造函数接受两个参数:
The KeywordMetadataEnricher
constructor takes two parameters:
-
ChatModel chatModel
:用于生成关键字的 AI 模型。 -
ChatModel chatModel
: The AI model used for generating keywords. -
int keywordCount
:为每个文档提取的关键字数量。 -
int keywordCount
: The number of keywords to extract for each document.
Behavior
KeywordMetadataEnricher
处理文档如下:
The KeywordMetadataEnricher
processes documents as follows:
-
对于每个输入文档,它使用文档内容创建提示。
-
For each input document, it creates a prompt using the document’s content.
-
它将此提示发送到提供的
ChatModel
以生成关键字。 -
It sends this prompt to the provided
ChatModel
to generate keywords. -
生成的关键字将添加到文档的元数据中,键为“excerpt_keywords”。
-
The generated keywords are added to the document’s metadata under the key "excerpt_keywords".
-
返回富化文档。
-
The enriched documents are returned.
Customization
可以通过修改类中的 KEYWORDS_TEMPLATE
常量来自定义关键字提取提示。默认模板是:
The keyword extraction prompt can be customized by modifying the KEYWORDS_TEMPLATE
constant in the class. The default template is:
\{context_str}. Give %s unique keywords for this document. Format as comma separated. Keywords:
其中 {context_str}
被替换为文档内容, %s
被替换为指定的关键字计数。
Where {context_str}
is replaced with the document content, and %s
is replaced with the specified keyword count.
Example
ChatModel chatModel = // initialize your chat model
KeywordMetadataEnricher enricher = new KeywordMetadataEnricher(chatModel, 5);
Document doc = new Document("This is a document about artificial intelligence and its applications in modern technology.");
List<Document> enrichedDocs = enricher.apply(List.of(this.doc));
Document enrichedDoc = this.enrichedDocs.get(0);
String keywords = (String) this.enrichedDoc.getMetadata().get("excerpt_keywords");
System.out.println("Extracted keywords: " + keywords);
Notes
-
The
KeywordMetadataEnricher
requires a functioningChatModel
to generate keywords. -
The
KeywordMetadataEnricher
requires a functioningChatModel
to generate keywords. -
关键词计数必须为 1 或更大。
-
The keyword count must be 1 or greater.
-
丰富器为每个已处理的文档添加“excerpt_keywords”元数据字段。
-
The enricher adds the "excerpt_keywords" metadata field to each processed document.
-
生成的关键词以逗号分隔的字符串形式返回。
-
The generated keywords are returned as a comma-separated string.
-
此丰富器对于提高文档可搜索性以及为文档生成标签或类别特别有用。
-
This enricher is particularly useful for improving document searchability and for generating tags or categories for documents.
SummaryMetadataEnricher
SummaryMetadataEnricher
是一个使用生成式 AI 模型为文档创建摘要并将其添加为元数据的 DocumentTransformer
。它可以为当前文档以及相邻文档(上一篇和下一篇)生成摘要。
The SummaryMetadataEnricher
is a DocumentTransformer
that uses a generative AI model to create summaries for documents and add them as metadata. It can generate summaries for the current document, as well as adjacent documents (previous and next).
Usage
@Configuration
class EnricherConfig {
@Bean
public SummaryMetadataEnricher summaryMetadata(OpenAiChatModel aiClient) {
return new SummaryMetadataEnricher(aiClient,
List.of(SummaryType.PREVIOUS, SummaryType.CURRENT, SummaryType.NEXT));
}
}
@Component
class MySummaryEnricher {
private final SummaryMetadataEnricher enricher;
MySummaryEnricher(SummaryMetadataEnricher enricher) {
this.enricher = enricher;
}
List<Document> enrichDocuments(List<Document> documents) {
return this.enricher.apply(documents);
}
}
Constructor
SummaryMetadataEnricher
提供两个构造函数:
The SummaryMetadataEnricher
provides two constructors:
-
SummaryMetadataEnricher(ChatModel chatModel, List<SummaryType> summaryTypes)
-
SummaryMetadataEnricher(ChatModel chatModel, List<SummaryType> summaryTypes, String summaryTemplate, MetadataMode metadataMode)
-
SummaryMetadataEnricher(ChatModel chatModel, List<SummaryType> summaryTypes, String summaryTemplate, MetadataMode metadataMode)
Parameters
-
chatModel
: 用于生成摘要的 AI 模型。 -
chatModel
: The AI model used for generating summaries. -
summaryTypes
:SummaryType
枚举值列表,指示要生成哪些摘要(PREVIOUS、CURRENT、NEXT)。 -
summaryTypes
: A list ofSummaryType
enum values indicating which summaries to generate (PREVIOUS, CURRENT, NEXT). -
summaryTemplate
: 摘要生成的自定义模板(可选)。 -
summaryTemplate
: A custom template for summary generation (optional). -
metadataMode
: 指定在生成摘要时如何处理文档元数据(可选)。 -
metadataMode
: Specifies how to handle document metadata when generating summaries (optional).
Behavior
SummaryMetadataEnricher
按以下方式处理文档:
The SummaryMetadataEnricher
processes documents as follows:
-
对于每个输入文档,它使用文档内容和指定的摘要模板创建提示。
-
For each input document, it creates a prompt using the document’s content and the specified summary template.
-
它将此提示发送到提供的
ChatModel
以生成摘要。 -
It sends this prompt to the provided
ChatModel
to generate a summary. -
根据指定的
summaryTypes
,它为每个文档添加以下元数据:-
section_summary
: 当前文档的摘要。 -
section_summary
: Summary of the current document. -
prev_section_summary
: 上一个文档的摘要(如果可用且已请求)。 -
prev_section_summary
: Summary of the previous document (if available and requested). -
next_section_summary
:下一文档的摘要(如果可用且已请求)。 -
next_section_summary
: Summary of the next document (if available and requested).
-
-
Depending on the specified
summaryTypes
, it adds the following metadata to each document:-
section_summary
: 当前文档的摘要。 -
section_summary
: Summary of the current document. -
prev_section_summary
: 上一个文档的摘要(如果可用且已请求)。 -
prev_section_summary
: Summary of the previous document (if available and requested). -
next_section_summary
:下一文档的摘要(如果可用且已请求)。 -
next_section_summary
: Summary of the next document (if available and requested).
-
-
返回富化文档。
-
The enriched documents are returned.
Customization
通过提供自定义的 summaryTemplate
可以自定义摘要生成提示。默认模板是:
The summary generation prompt can be customized by providing a custom summaryTemplate
. The default template is:
"""
Here is the content of the section:
{context_str}
Summarize the key topics and entities of the section.
Summary:
"""
Example
ChatModel chatModel = // initialize your chat model
SummaryMetadataEnricher enricher = new SummaryMetadataEnricher(chatModel,
List.of(SummaryType.PREVIOUS, SummaryType.CURRENT, SummaryType.NEXT));
Document doc1 = new Document("Content of document 1");
Document doc2 = new Document("Content of document 2");
List<Document> enrichedDocs = enricher.apply(List.of(this.doc1, this.doc2));
// Check the metadata of the enriched documents
for (Document doc : enrichedDocs) {
System.out.println("Current summary: " + doc.getMetadata().get("section_summary"));
System.out.println("Previous summary: " + doc.getMetadata().get("prev_section_summary"));
System.out.println("Next summary: " + doc.getMetadata().get("next_section_summary"));
}
所提供的示例演示了预期的行为:
The provided example demonstrates the expected behavior:
-
对于两个文档的列表,两个文档都收到一个
section_summary
。 -
For a list of two documents, both documents receive a
section_summary
. -
第一个文档收到一个
next_section_summary
但没有prev_section_summary
。 -
The first document receives a
next_section_summary
but noprev_section_summary
. -
第二个文档收到一个
prev_section_summary
但没有next_section_summary
。 -
The second document receives a
prev_section_summary
but nonext_section_summary
. -
第一个文档的
section_summary
与第二个文档的prev_section_summary
匹配。 -
The
section_summary
of the first document matches theprev_section_summary
of the second document. -
第一个文档的
next_section_summary
与第二个文档的section_summary
匹配。 -
The
next_section_summary
of the first document matches thesection_summary
of the second document.
Notes
-
SummaryMetadataEnricher
需要一个功能正常的ChatModel
来生成摘要。 -
The
SummaryMetadataEnricher
requires a functioningChatModel
to generate summaries. -
富化器可以处理任何大小的文档列表,正确处理第一个和最后一个文档的边缘情况。
-
The enricher can handle document lists of any size, properly handling edge cases for the first and last documents.
-
此富化器对于创建上下文感知摘要特别有用,可以更好地理解序列中的文档关系。
-
This enricher is particularly useful for creating context-aware summaries, allowing for better understanding of document relationships in a sequence.
-
MetadataMode
参数允许控制如何将现有元数据合并到摘要生成过程中。 -
The
MetadataMode
parameter allows control over how existing metadata is incorporated into the summary generation process.
Writers
File
FileDocumentWriter
是一个 DocumentWriter
实现,它将 Document
对象列表的内容写入文件。
The FileDocumentWriter
is a DocumentWriter
implementation that writes the content of a list of Document
objects into a file.
Usage
@Component
class MyDocumentWriter {
public void writeDocuments(List<Document> documents) {
FileDocumentWriter writer = new FileDocumentWriter("output.txt", true, MetadataMode.ALL, false);
writer.accept(documents);
}
}
Constructors
FileDocumentWriter
提供三个构造函数:
The FileDocumentWriter
provides three constructors:
-
FileDocumentWriter(String fileName)
-
FileDocumentWriter(String fileName, boolean withDocumentMarkers)
-
FileDocumentWriter(String fileName, boolean withDocumentMarkers, MetadataMode metadataMode, boolean append)
-
FileDocumentWriter(String fileName, boolean withDocumentMarkers, MetadataMode metadataMode, boolean append)
Parameters
-
fileName
:要写入文档的文件名。 -
fileName
: The name of the file to write the documents to. -
withDocumentMarkers
:是否在输出中包含文档标记(默认:false)。 -
withDocumentMarkers
: Whether to include document markers in the output (default: false). -
metadataMode
:指定要写入文件的文档内容(默认:MetadataMode.NONE)。 -
metadataMode
: Specifies what document content to be written to the file (default: MetadataMode.NONE). -
append
:如果为 true,数据将写入文件末尾而不是开头(默认:false)。 -
append
: If true, data will be written to the end of the file rather than the beginning (default: false).
Behavior
FileDocumentWriter
按以下方式处理文档:
The FileDocumentWriter
processes documents as follows:
-
它为指定的文件名打开一个 FileWriter。
-
It opens a FileWriter for the specified file name.
-
对于输入列表中的每个文档:[style="loweralpha"]
-
如果
withDocumentMarkers
为 true,它会写入一个文档标记,包括文档索引和页码。 -
If
withDocumentMarkers
is true, it writes a document marker including the document index and page numbers. -
它根据指定的
metadataMode
写入文档的格式化内容。 -
It writes the formatted content of the document based on the specified
metadataMode
.
-
For each document in the input list:[style="loweralpha"]
-
如果
withDocumentMarkers
为 true,它会写入一个文档标记,包括文档索引和页码。 -
If
withDocumentMarkers
is true, it writes a document marker including the document index and page numbers. -
它根据指定的
metadataMode
写入文档的格式化内容。 -
It writes the formatted content of the document based on the specified
metadataMode
.
-
写入所有文档后,文件将关闭。
-
The file is closed after all documents have been written.
Document Markers
当 withDocumentMarkers
设置为 true 时,写入器会为每个文档包含以下格式的标记:
When withDocumentMarkers
is set to true, the writer includes markers for each document in the following format:
### Doc: [index], pages:[start_page_number,end_page_number]
Metadata Handling
写入器使用两个特定的元数据键:
The writer uses two specific metadata keys:
-
page_number
:表示文档的起始页码。 -
page_number
: Represents the starting page number of the document. -
end_page_number
:表示文档的结束页码。 -
end_page_number
: Represents the ending page number of the document.
这些用于写入文档标记。
These are used when writing document markers.
Example
List<Document> documents = // initialize your documents
FileDocumentWriter writer = new FileDocumentWriter("output.txt", true, MetadataMode.ALL, true);
writer.accept(documents);
这将把所有文档写入“output.txt”,包括文档标记,使用所有可用的元数据,如果文件已经存在则追加到文件中。
This will write all documents to "output.txt", including document markers, using all available metadata, and appending to the file if it already exists.
Notes
-
写入器使用
FileWriter
,因此它使用操作系统的默认字符编码写入文本文件。 -
The writer uses
FileWriter
, so it writes text files with the default character encoding of the operating system. -
如果在写入过程中发生错误,将抛出
RuntimeException
,并将原始异常作为其原因。 -
If an error occurs during writing, a
RuntimeException
is thrown with the original exception as its cause. -
metadataMode
参数允许控制如何将现有元数据合并到写入的内容中。 -
The
metadataMode
parameter allows control over how existing metadata is incorporated into the written content. -
此写入器对于调试或创建文档集合的人类可读输出特别有用。
-
This writer is particularly useful for debugging or creating human-readable outputs of document collections.
VectorStore
与各种向量存储进行集成。有关完整列表,请参阅 Vector DB Documentation
。
Provides integration with various vector stores. See Vector DB Documentation for a full listing.