ETL Pipeline

提取、转换和加载 (ETL) 框架充当检索增强生成 (RAG) 使用案例中的数据处理主干。

The Extract, Transform, and Load (ETL) framework serves as the backbone of data processing within the Retrieval Augmented Generation (RAG) use case.

ETL 管道将数据流从原始数据源协调到结构化向量存储中,确保数据处于 AI 模型检索的最佳格式中。

The ETL pipeline orchestrates the flow from raw data sources to a structured vector store, ensuring data is in the optimal format for retrieval by the AI model.

RAG 使用案例是一个文本,它通过从数据主体中检索相关信息来增强生成模型的能力,以提升生成输出的质量和关联性。

The RAG use case is text to augment the capabilities of generative models by retrieving relevant information from a body of data to enhance the quality and relevance of the generated output.

API Overview

ETL 管道创建、转换和存储 Document 实例。

The ETL pipelines creates, transforms and stores Document instances.

spring ai document1 api

Document 类包含文本、元数据以及可选的其他媒体类型,如图像、音频和视频。

The Document class contains text, metadata and optionally additional media types like images, audio and video.

ETL 管道有三个主要组件,

There are three main components of the ETL pipeline,

  • DocumentReader that implements Supplier<List<Document>>

  • DocumentTransformer 它实现 Function&lt;List&lt;Document&gt;, List&lt;Document&gt;&gt;

  • DocumentTransformer that implements Function<List<Document>, List<Document>>

  • DocumentWriter that implements Consumer<List<Document>>

Document 类内容是借助 DocumentReader 从 PDF、文本文件和其他文档类型创建的。

The Document class content is created from PDFs, text files and other document types with the help of DocumentReader.

要构建一个简单的 ETL 管道,您可以将每种类型的实例链接在一起。

To construct a simple ETL pipeline, you can chain together an instance of each type.

etl pipeline

假设我们有以下三个 ETL 类型的实例

Let’s say we have the following instances of those three ETL types

  • PagePdfDocumentReader DocumentReader 的实现

  • PagePdfDocumentReader an implementation of DocumentReader

  • TokenTextSplitter DocumentTransformer 的实现

  • TokenTextSplitter an implementation of DocumentTransformer

  • VectorStore DocumentWriter 的实现

  • VectorStore an implementation of DocumentWriter

要执行将数据基本加载到向量数据库以用于检索增强生成模式的操作,请在 Java 函数样式语法中使用以下代码。

To perform the basic loading of data into a Vector Database for use with the Retrieval Augmented Generation pattern, use the following code in Java function style syntax.

vectorStore.accept(tokenTextSplitter.apply(pdfReader.get()));

或者,您可以使用对该领域更自然表达的方法名称

Alternatively, you can use method names that are more naturally expressive for the domain

vectorStore.write(tokenTextSplitter.split(pdfReader.read()));

ETL Interfaces

ETL 管道由以下接口和实现组成。详细的 ETL 类图显示在 ETL Class Diagram 部分中。

The ETL pipeline is composed of the following interfaces and implementations. Detailed ETL class diagram is shown in the ETL Class Diagram section.

DocumentReader

提供各类来源的文档。

Provides a source of documents from diverse origins.

public interface DocumentReader extends Supplier<List<Document>> {

    default List<Document> read() {
		return get();
	}
}

DocumentTransformer

将一批文档作为处理工作流程的一部分进行转换。

Transforms a batch of documents as part of the processing workflow.

public interface DocumentTransformer extends Function<List<Document>, List<Document>> {

    default List<Document> transform(List<Document> transform) {
		return apply(transform);
	}
}

DocumentWriter

管理 ETL 流程的最后阶段,为存储准备文档。

Manages the final stage of the ETL process, preparing documents for storage.

public interface DocumentWriter extends Consumer<List<Document>> {

    default void write(List<Document> documents) {
		accept(documents);
	}
}

ETL Class Diagram

下图展示了 ETL 接口和实现。

The following class diagram illustrates the ETL interfaces and implementations.

etl class diagram

DocumentReaders

JSON

JsonReader 处理JSON文档,将其转换为 Document 对象的列表。

The JsonReader processes JSON documents, converting them into a list of Document objects.

Example

@Component
class MyJsonReader {

	private final Resource resource;

    MyJsonReader(@Value("classpath:bikes.json") Resource resource) {
        this.resource = resource;
    }

	List<Document> loadJsonAsDocuments() {
        JsonReader jsonReader = new JsonReader(this.resource, "description", "content");
        return jsonReader.get();
	}
}

Constructor Options

JsonReader 提供了几种构造函数选项:

The JsonReader provides several constructor options:

  1. JsonReader(Resource resource)

  2. JsonReader(Resource resource, String…​ jsonKeysToUse)

  3. JsonReader(Resource resource, JsonMetadataGenerator jsonMetadataGenerator, String&#8230;&#8203; jsonKeysToUse)

  4. JsonReader(Resource resource, JsonMetadataGenerator jsonMetadataGenerator, String…​ jsonKeysToUse)

Parameters

  • resource :一个指向JSON文件的Spring Resource 对象。

  • resource: A Spring Resource object pointing to the JSON file.

  • jsonKeysToUse :JSON中应在生成的 Document 对象中用作文本内容的键数组。

  • jsonKeysToUse: An array of keys from the JSON that should be used as the text content in the resulting Document objects.

  • jsonMetadataGenerator :一个可选的 JsonMetadataGenerator ,用于为每个 Document 创建元数据。

  • jsonMetadataGenerator: An optional JsonMetadataGenerator to create metadata for each Document.

Behavior

JsonReader 处理JSON内容如下:

The JsonReader processes JSON content as follows:

  • 它可以处理JSON数组和单个JSON对象。

  • It can handle both JSON arrays and single JSON objects.

  • 对于每个JSON对象(无论是数组中的还是单个对象):

    • 它根据指定的 jsonKeysToUse 提取内容。

    • It extracts the content based on the specified jsonKeysToUse.

    • 如果未指定键,则将整个JSON对象用作内容。

    • If no keys are specified, it uses the entire JSON object as content.

    • 它使用提供的 JsonMetadataGenerator (如果未提供则使用空 JsonMetadataGenerator )生成元数据。

    • It generates metadata using the provided JsonMetadataGenerator (or an empty one if not provided).

    • 它创建具有提取内容和元数据的 Document 对象。

    • It creates a Document object with the extracted content and metadata.

  • For each JSON object (either in an array or a single object):

    • 它根据指定的 jsonKeysToUse 提取内容。

    • It extracts the content based on the specified jsonKeysToUse.

    • 如果未指定键,则将整个JSON对象用作内容。

    • If no keys are specified, it uses the entire JSON object as content.

    • 它使用提供的 JsonMetadataGenerator (如果未提供则使用空 JsonMetadataGenerator )生成元数据。

    • It generates metadata using the provided JsonMetadataGenerator (or an empty one if not provided).

    • 它创建具有提取内容和元数据的 Document 对象。

    • It creates a Document object with the extracted content and metadata.

Using JSON Pointers

JsonReader 现在支持使用JSON Pointer检索JSON文档的特定部分。此功能允许您轻松地从复杂的JSON结构中提取嵌套数据。

The JsonReader now supports retrieving specific parts of a JSON document using JSON Pointers. This feature allows you to easily extract nested data from complex JSON structures.

The get(String pointer) method
public List<Document> get(String pointer)

此方法允许您使用JSON Pointer检索JSON文档的特定部分。

This method allows you to use a JSON Pointer to retrieve a specific part of the JSON document.

Parameters
  • pointer :一个JSON Pointer字符串(如RFC 6901中所定义),用于在JSON结构中定位所需元素。

  • pointer: A JSON Pointer string (as defined in RFC 6901) to locate the desired element within the JSON structure.

Return Value
  • 返回一个 List&lt;Document&gt; ,其中包含从指针定位的JSON元素中解析的文档。

  • Returns a List<Document> containing the documents parsed from the JSON element located by the pointer.

Behavior
  • 该方法使用提供的JSON Pointer导航到JSON结构中的特定位置。

  • The method uses the provided JSON Pointer to navigate to a specific location in the JSON structure.

  • 如果指针有效并指向现有元素:

    • 对于 JSON 对象:它返回一个包含单个文档的列表。

    • For a JSON object: it returns a list with a single Document.

    • 对于 JSON 数组:它返回一个文档列表,数组中的每个元素对应一个文档。

    • For a JSON array: it returns a list of Documents, one for each element in the array.

  • If the pointer is valid and points to an existing element:

    • 对于 JSON 对象:它返回一个包含单个文档的列表。

    • For a JSON object: it returns a list with a single Document.

    • 对于 JSON 数组:它返回一个文档列表,数组中的每个元素对应一个文档。

    • For a JSON array: it returns a list of Documents, one for each element in the array.

  • 如果指针无效或指向不存在的元素,它会抛出 IllegalArgumentException

  • If the pointer is invalid or points to a non-existent element, it throws an IllegalArgumentException.

Example
JsonReader jsonReader = new JsonReader(resource, "description");
List<Document> documents = this.jsonReader.get("/store/books/0");

Example JSON Structure

[
  {
    "id": 1,
    "brand": "Trek",
    "description": "A high-performance mountain bike for trail riding."
  },
  {
    "id": 2,
    "brand": "Cannondale",
    "description": "An aerodynamic road bike for racing enthusiasts."
  }
]

在此示例中,如果 JsonReader"description" 配置为 jsonKeysToUse ,它将创建 Document 对象,其中内容是数组中每辆自行车的“description”字段的值。

In this example, if the JsonReader is configured with "description" as the jsonKeysToUse, it will create Document objects where the content is the value of the "description" field for each bike in the array.

Notes

  • JsonReader 使用 Jackson 进行 JSON 解析。

  • The JsonReader uses Jackson for JSON parsing.

  • 它可以通过对数组使用流式处理来高效地处理大型 JSON 文件。

  • It can handle large JSON files efficiently by using streaming for arrays.

  • 如果在 jsonKeysToUse 中指定了多个键,则内容将是这些键值的串联。

  • If multiple keys are specified in jsonKeysToUse, the content will be a concatenation of the values for those keys.

  • 该阅读器非常灵活,可以通过自定义 jsonKeysToUseJsonMetadataGenerator 来适应各种 JSON 结构。

  • The reader is flexible and can be adapted to various JSON structures by customizing the jsonKeysToUse and JsonMetadataGenerator.

Text

TextReader 处理纯文本文档,将它们转换为 Document 对象的列表。

The TextReader processes plain text documents, converting them into a list of Document objects.

Example

@Component
class MyTextReader {

    private final Resource resource;

    MyTextReader(@Value("classpath:text-source.txt") Resource resource) {
        this.resource = resource;
    }

	List<Document> loadText() {
		TextReader textReader = new TextReader(this.resource);
		textReader.getCustomMetadata().put("filename", "text-source.txt");

		return textReader.read();
    }
}

Constructor Options

TextReader 提供两种构造函数选项:

The TextReader provides two constructor options:

  1. TextReader(String resourceUrl)

  2. TextReader(Resource resource)

Parameters

  • resourceUrl :表示要读取资源的 URL 的字符串。

  • resourceUrl: A string representing the URL of the resource to be read.

  • resource :指向文本文件的 Spring Resource 对象。

  • resource: A Spring Resource object pointing to the text file.

Configuration

  • setCharset(Charset charset) :设置用于读取文本文件的字符集。默认为 UTF-8。

  • setCharset(Charset charset): Sets the character set used for reading the text file. Default is UTF-8.

  • getCustomMetadata() :返回一个可变映射,您可以在其中为文档添加自定义元数据。

  • getCustomMetadata(): Returns a mutable map where you can add custom metadata for the documents.

Behavior

TextReader 按如下方式处理文本内容:

The TextReader processes text content as follows:

  • 它将文本文件的全部内容读取到一个 Document 对象中。

  • It reads the entire content of the text file into a single Document object.

  • 文件的内容成为 Document 的内容。

  • The content of the file becomes the content of the Document.

  • 元数据会自动添加到 Document 中:

    • charset :用于读取文件的字符集(默认值:“UTF-8”)。

    • charset: The character set used to read the file (default: "UTF-8").

    • source :源文本文件的文件名。

    • source: The filename of the source text file.

  • Metadata is automatically added to the Document:

    • charset :用于读取文件的字符集(默认值:“UTF-8”)。

    • charset: The character set used to read the file (default: "UTF-8").

    • source :源文本文件的文件名。

    • source: The filename of the source text file.

  • 通过 getCustomMetadata() 添加的任何自定义元数据都包含在 Document 中。

  • Any custom metadata added via getCustomMetadata() is included in the Document.

Notes

  • TextReader 将整个文件内容读入内存,因此可能不适用于非常大的文件。

  • The TextReader reads the entire file content into memory, so it may not be suitable for very large files.

  • 如果您需要将文本分成更小的块,您可以在读取文档后使用文本分割器,例如 TokenTextSplitter

  • If you need to split the text into smaller chunks, you can use a text splitter like TokenTextSplitter after reading the document:

List<Document> documents = textReader.get();
List<Document> splitDocuments = new TokenTextSplitter().apply(this.documents);
  • 读取器使用 Spring 的 Resource 抽象,允许它从各种来源(类路径、文件系统、URL 等)读取。

  • The reader uses Spring’s Resource abstraction, allowing it to read from various sources (classpath, file system, URL, etc.).

  • 可以使用 getCustomMetadata() 方法将自定义元数据添加到读取器创建的所有文档中。

  • Custom metadata can be added to all documents created by the reader using the getCustomMetadata() method.

HTML (JSoup)

JsoupDocumentReader 处理 HTML 文档,使用 JSoup 库将其转换为 Document 对象列表。

The JsoupDocumentReader processes HTML documents, converting them into a list of Document objects using the JSoup library.

Example

@Component
class MyHtmlReader {

    private final Resource resource;

    MyHtmlReader(@Value("classpath:/my-page.html") Resource resource) {
        this.resource = resource;
    }

    List<Document> loadHtml() {
        JsoupDocumentReaderConfig config = JsoupDocumentReaderConfig.builder()
            .selector("article p") // Extract paragraphs within <article> tags
            .charset("ISO-8859-1")  // Use ISO-8859-1 encoding
            .includeLinkUrls(true) // Include link URLs in metadata
            .metadataTags(List.of("author", "date")) // Extract author and date meta tags
            .additionalMetadata("source", "my-page.html") // Add custom metadata
            .build();

        JsoupDocumentReader reader = new JsoupDocumentReader(this.resource, config);
        return reader.get();
    }
}

JsoupDocumentReaderConfig 允许您自定义 JsoupDocumentReader 的行为:

The JsoupDocumentReaderConfig allows you to customize the behavior of the JsoupDocumentReader:

  • charset :指定 HTML 文档的字符编码(默认为“UTF-8”)。

  • charset: Specifies the character encoding of the HTML document (defaults to "UTF-8").

  • selector :一个 JSoup CSS 选择器,用于指定从哪些元素中提取文本(默认为“body”)。

  • selector: A JSoup CSS selector to specify which elements to extract text from (defaults to "body").

  • separator :用于连接来自多个选定元素的文本的字符串(默认为“\n”)。

  • separator: The string used to join text from multiple selected elements (defaults to "\n").

  • allElements :如果 true ,则从 &lt;body&gt; 元素中提取所有文本,忽略 selector (默认为 false )。

  • allElements: If true, extracts all text from the <body> element, ignoring the selector (defaults to false).

  • groupByElement :如果 true ,则为由 selector 匹配的每个元素创建一个单独的 Document (默认为 false )。

  • groupByElement: If true, creates a separate Document for each element matched by the selector (defaults to false).

  • includeLinkUrls :如果 true ,则提取绝对链接 URL 并将其添加到元数据中 (默认为 false )。

  • includeLinkUrls: If true, extracts absolute link URLs and adds them to the metadata (defaults to false).

  • tags ( metadataTags ):要从中提取内容的 Document ( &lt;meta&gt; ) 标签名称列表(默认为 [body] ( ["description", "keywords"] ))。

  • metadataTags: A list of <meta> tag names to extract content from (defaults to ["description", "keywords"]).

  • metadata ( additionalMetadata ):允许您向所有创建的 Document ( Document ) 对象添加自定义元数据。

  • additionalMetadata: Allows you to add custom metadata to all created Document objects.

Sample Document: my-page.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>My Web Page</title>
    <meta name="description" content="A sample web page for Spring AI">
    <meta name="keywords" content="spring, ai, html, example">
    <meta name="author" content="John Doe">
    <meta name="date" content="2024-01-15">
    <link rel="stylesheet" href="style.css">
</head>
<body>
    <header>
        <h1>Welcome to My Page</h1>
    </header>
    <nav>
        <ul>
            <li><a href="/">Home</a></li>
            <li><a href="/about">About</a></li>
        </ul>
    </nav>
    <article>
        <h2>Main Content</h2>
        <p>This is the main content of my web page.</p>
        <p>It contains multiple paragraphs.</p>
        <a href="https://www.example.com">External Link</a>
    </article>
    <footer>
        <p>© 2024 John Doe</p>
    </footer>
</body>
</html>

行为:

Behavior:

The JsoupDocumentReader processes the HTML content and creates Document objects based on the configuration:

The JsoupDocumentReader processes the HTML content and creates Document objects based on the configuration:

  • selector 决定了哪些元素用于文本提取。

  • The selector determines which elements are used for text extraction.

  • 如果 allElementstrue ,则 &lt;body&gt; 中的所有文本都将提取到一个 Document 中。

  • If allElements is true, all text within the <body> is extracted into a single Document.

  • 如果 groupByElementtrue ,则每个与 selector 匹配的元素都会创建一个单独的 Document

  • If groupByElement is true, each element matching the selector creates a separate Document.

  • 如果 allElementsgroupByElement 都不是 true ,则将所有与 selector 匹配的元素的文本使用 separator 连接起来。

  • If neither allElements nor groupByElement is true, text from all elements matching the selector is joined using the separator.

  • 文档标题、指定 &lt;meta&gt; 标签的内容以及(可选)链接 URL 都将添加到 Document 元数据中。

  • The document title, content from specified <meta> tags, and (optionally) link URLs are added to the Document metadata.

  • 用于解析相对链接的基本 URI 将从 URL 资源中提取。

  • The base URI, for resolving relative links, will be extracted from URL resources.

读取器保留所选元素的文本内容,但会删除其中的任何 HTML 标签。

The reader preserves the text content of the selected elements, but removes any HTML tags within them.

Markdown

MarkdownDocumentReader 处理 Markdown 文档,将其转换为 Document 对象的列表。

The MarkdownDocumentReader processes Markdown documents, converting them into a list of Document objects.

Example

@Component
class MyMarkdownReader {

    private final Resource resource;

    MyMarkdownReader(@Value("classpath:code.md") Resource resource) {
        this.resource = resource;
    }

    List<Document> loadMarkdown() {
        MarkdownDocumentReaderConfig config = MarkdownDocumentReaderConfig.builder()
            .withHorizontalRuleCreateDocument(true)
            .withIncludeCodeBlock(false)
            .withIncludeBlockquote(false)
            .withAdditionalMetadata("filename", "code.md")
            .build();

        MarkdownDocumentReader reader = new MarkdownDocumentReader(this.resource, config);
        return reader.get();
    }
}

MarkdownDocumentReaderConfig 允许您自定义 MarkdownDocumentReader 的行为:

The MarkdownDocumentReaderConfig allows you to customize the behavior of the MarkdownDocumentReader:

  • horizontalRuleCreateDocument :当设置为 true 时,Markdown 中的水平线将创建新的 Document 对象。

  • horizontalRuleCreateDocument: When set to true, horizontal rules in the Markdown will create new Document objects.

  • includeCodeBlock :当设置为 true 时,代码块将包含在与周围文本相同的 Document 中。当 false 时,代码块会创建单独的 Document 对象。

  • includeCodeBlock: When set to true, code blocks will be included in the same Document as the surrounding text. When false, code blocks create separate Document objects.

  • includeBlockquote :当设置为 true 时,引用块将包含在与周围文本相同的 Document 中。当 false 时,引用块会创建单独的 Document 对象。

  • includeBlockquote: When set to true, blockquotes will be included in the same Document as the surrounding text. When false, blockquotes create separate Document objects.

  • additionalMetadata :允许您向所有创建的 Document 对象添加自定义元数据。

  • additionalMetadata: Allows you to add custom metadata to all created Document objects.

Sample Document: code.md

This is a Java sample application:

```java
package com.example.demo;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
public class DemoApplication {
    public static void main(String[] args) {
        SpringApplication.run(DemoApplication.class, args);
    }
}
```

Markdown also provides the possibility to `use inline code formatting throughout` the entire sentence.

---

Another possibility is to set block code without specific highlighting:

```
./mvnw spring-javaformat:apply
```

行为:MarkdownDocumentReader 处理 Markdown 内容并根据配置创建文档对象:

Behavior: The MarkdownDocumentReader processes the Markdown content and creates Document objects based on the configuration:

  • 标题成为文档对象中的元数据。

  • Headers become metadata in the Document objects.

  • 段落成为文档对象的内容。

  • Paragraphs become the content of Document objects.

  • 代码块可以分离到自己的文档对象中,也可以包含在周围文本中。

  • Code blocks can be separated into their own Document objects or included with surrounding text.

  • 引用块可以分离到自己的文档对象中,也可以包含在周围文本中。

  • Blockquotes can be separated into their own Document objects or included with surrounding text.

  • 水平线可用于将内容拆分为单独的文档对象。

  • Horizontal rules can be used to split the content into separate Document objects.

读者保留了文档对象内容中的格式,例如内联代码、列表和文本样式。

The reader preserves formatting like inline code, lists, and text styling within the content of the Document objects.

PDF Page

PagePdfDocumentReader 使用 Apache PdfBox 库解析 PDF 文档

The PagePdfDocumentReader uses Apache PdfBox library to parse PDF documents

使用 Maven 或 Gradle 将依赖项添加到您的项目中。

Add the dependency to your project using Maven or Gradle.

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-pdf-document-reader</artifactId>
</dependency>

或添加到 Gradle build.gradle 构建文件中。

or to your Gradle build.gradle build file.

dependencies {
    implementation 'org.springframework.ai:spring-ai-pdf-document-reader'
}

Example

@Component
public class MyPagePdfDocumentReader {

	List<Document> getDocsFromPdf() {

		PagePdfDocumentReader pdfReader = new PagePdfDocumentReader("classpath:/sample1.pdf",
				PdfDocumentReaderConfig.builder()
					.withPageTopMargin(0)
					.withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
						.withNumberOfTopTextLinesToDelete(0)
						.build())
					.withPagesPerDocument(1)
					.build());

		return pdfReader.read();
    }

}

PDF Paragraph

ParagraphPdfDocumentReader 使用 PDF 目录(例如 TOC)信息将输入的 PDF 拆分为文本段落,并为每个段落输出一个 Document。注意:并非所有 PDF 文档都包含 PDF 目录。

The ParagraphPdfDocumentReader uses the PDF catalog (e.g. TOC) information to split the input PDF into text paragraphs and output a single Document per paragraph. NOTE: Not all PDF documents contain the PDF catalog.

Dependencies

使用 Maven 或 Gradle 将依赖项添加到您的项目中。

Add the dependency to your project using Maven or Gradle.

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-pdf-document-reader</artifactId>
</dependency>

或添加到 Gradle build.gradle 构建文件中。

or to your Gradle build.gradle build file.

dependencies {
    implementation 'org.springframework.ai:spring-ai-pdf-document-reader'
}

Example

@Component
public class MyPagePdfDocumentReader {

	List<Document> getDocsFromPdfWithCatalog() {

        ParagraphPdfDocumentReader pdfReader = new ParagraphPdfDocumentReader("classpath:/sample1.pdf",
                PdfDocumentReaderConfig.builder()
                    .withPageTopMargin(0)
                    .withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
                        .withNumberOfTopTextLinesToDelete(0)
                        .build())
                    .withPagesPerDocument(1)
                    .build());

	    return pdfReader.read();
    }
}

Tika (DOCX, PPTX, HTML…​)

TikaDocumentReader 使用 Apache Tika 从各种文档格式中提取文本,例如 PDF、DOC/DOCX、PPT/PPTX 和 HTML。有关支持格式的完整列表,请参阅 Tika documentation

The TikaDocumentReader uses Apache Tika to extract text from a variety of document formats, such as PDF, DOC/DOCX, PPT/PPTX, and HTML. For a comprehensive list of supported formats, refer to the Tika documentation.

Dependencies

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-tika-document-reader</artifactId>
</dependency>

或添加到 Gradle build.gradle 构建文件中。

or to your Gradle build.gradle build file.

dependencies {
    implementation 'org.springframework.ai:spring-ai-tika-document-reader'
}

Example

@Component
class MyTikaDocumentReader {

    private final Resource resource;

    MyTikaDocumentReader(@Value("classpath:/word-sample.docx")
                            Resource resource) {
        this.resource = resource;
    }

    List<Document> loadText() {
        TikaDocumentReader tikaDocumentReader = new TikaDocumentReader(this.resource);
        return tikaDocumentReader.read();
    }
}

Transformers

TextSplitter

TextSplitter 一个抽象基类,用于帮助将文档分隔为适合 AI 模型上下文窗口。

The TextSplitter an abstract base class that helps divides documents to fit the AI model’s context window.

TokenTextSplitter

TokenTextSplitterTextSplitter 的实现,它使用 CL100K_BASE 编码根据令牌计数将文本拆分为块。

The TokenTextSplitter is an implementation of TextSplitter that splits text into chunks based on token count, using the CL100K_BASE encoding.

Usage

@Component
class MyTokenTextSplitter {

    public List<Document> splitDocuments(List<Document> documents) {
        TokenTextSplitter splitter = new TokenTextSplitter();
        return splitter.apply(documents);
    }

    public List<Document> splitCustomized(List<Document> documents) {
        TokenTextSplitter splitter = new TokenTextSplitter(1000, 400, 10, 5000, true);
        return splitter.apply(documents);
    }
}

Constructor Options

TokenTextSplitter 提供了两种构造函数选项:

The TokenTextSplitter provides two constructor options:

  1. TokenTextSplitter() :使用默认设置创建拆分器。

  2. TokenTextSplitter(): Creates a splitter with default settings.

  3. TokenTextSplitter(int defaultChunkSize, int minChunkSizeChars, int minChunkLengthToEmbed, int maxNumChunks, boolean keepSeparator)

  4. TokenTextSplitter(int defaultChunkSize, int minChunkSizeChars, int minChunkLengthToEmbed, int maxNumChunks, boolean keepSeparator)

Parameters

  • defaultChunkSize :每个文本块的目标大小(以令牌为单位)(默认值:800)。

  • defaultChunkSize: The target size of each text chunk in tokens (default: 800).

  • minChunkSizeChars :每个文本块的最小大小(以字符为单位)(默认值:350)。

  • minChunkSizeChars: The minimum size of each text chunk in characters (default: 350).

  • minChunkLengthToEmbed :要包含的块的最小长度(默认值:5)。

  • minChunkLengthToEmbed: The minimum length of a chunk to be included (default: 5).

  • maxNumChunks :从文本生成的最大块数(默认值:10000)。

  • maxNumChunks: The maximum number of chunks to generate from a text (default: 10000).

  • keepSeparator :是否在块中保留分隔符(如换行符)(默认值:true)。

  • keepSeparator: Whether to keep separators (like newlines) in the chunks (default: true).

Behavior

TokenTextSplitter 按如下方式处理文本内容:

The TokenTextSplitter processes text content as follows:

  1. 它使用 CL100K_BASE 编码将输入文本编码为令牌。

  2. It encodes the input text into tokens using the CL100K_BASE encoding.

  3. 它根据 defaultChunkSize 将编码文本拆分为块。

  4. It splits the encoded text into chunks based on the defaultChunkSize.

  5. For each chunk:[style="loweralpha"]

  1. 它将块解码回文本。

  2. It decodes the chunk back into text.

  3. 它尝试在 minChunkSizeChars 之后找到合适的断点(句号、问号、感叹号或换行符)。

  4. It attempts to find a suitable break point (period, question mark, exclamation mark, or newline) after the minChunkSizeChars.

  5. 如果找到断点,它将在此点截断块。

  6. If a break point is found, it truncates the chunk at that point.

  7. 它根据 keepSeparator 设置修剪块并可选地删除换行符。

  8. It trims the chunk and optionally removes newline characters based on the keepSeparator setting.

  9. 如果生成的块长度超过 minChunkLengthToEmbed ,则将其添加到输出中。

  10. If the resulting chunk is longer than minChunkLengthToEmbed, it’s added to the output.

  1. 此过程持续进行,直到所有标记处理完毕或达到 maxNumChunks

  2. This process continues until all tokens are processed or maxNumChunks is reached.

  3. 如果任何剩余文本的长度超过 minChunkLengthToEmbed ,则将其添加为最终块。

  4. Any remaining text is added as a final chunk if it’s longer than minChunkLengthToEmbed.

Example

Document doc1 = new Document("This is a long piece of text that needs to be split into smaller chunks for processing.",
        Map.of("source", "example.txt"));
Document doc2 = new Document("Another document with content that will be split based on token count.",
        Map.of("source", "example2.txt"));

TokenTextSplitter splitter = new TokenTextSplitter();
List<Document> splitDocuments = this.splitter.apply(List.of(this.doc1, this.doc2));

for (Document doc : splitDocuments) {
    System.out.println("Chunk: " + doc.getContent());
    System.out.println("Metadata: " + doc.getMetadata());
}

Notes

  • TokenTextSplitter 使用来自 jtokkit 库的 CL100K_BASE 编码,该编码与较新的 OpenAI 模型兼容。

  • The TokenTextSplitter uses the CL100K_BASE encoding from the jtokkit library, which is compatible with newer OpenAI models.

  • 拆分器尝试通过在可能的情况下在句子边界处断开来创建具有语义意义的块。

  • The splitter attempts to create semantically meaningful chunks by breaking at sentence boundaries where possible.

  • 原始文档中的元数据被保留并复制到源自该文档的所有块中。

  • Metadata from the original documents is preserved and copied to all chunks derived from that document.

  • 如果 copyContentFormatter 设置为 true (默认行为),则原始文档中的内容格式化程序(如果设置)也会复制到派生块中。

  • The content formatter (if set) from the original document is also copied to the derived chunks if copyContentFormatter is set to true (default behavior).

  • 此拆分器对于为具有标记限制的大型语言模型准备文本特别有用,可确保每个块都在模型的处理容量范围内。

  • This splitter is particularly useful for preparing text for large language models that have token limits, ensuring that each chunk is within the model’s processing capacity.

ContentFormatTransformer

确保所有文档具有统一的内容格式。

Ensures uniform content formats across all documents.

KeywordMetadataEnricher

KeywordMetadataEnricher 是一个 DocumentTransformer ,它使用生成式 AI 模型从文档内容中提取关键字并将其添加为元数据。

The KeywordMetadataEnricher is a DocumentTransformer that uses a generative AI model to extract keywords from document content and add them as metadata.

Usage

@Component
class MyKeywordEnricher {

    private final ChatModel chatModel;

    MyKeywordEnricher(ChatModel chatModel) {
        this.chatModel = chatModel;
    }

    List<Document> enrichDocuments(List<Document> documents) {
        KeywordMetadataEnricher enricher = new KeywordMetadataEnricher(this.chatModel, 5);
        return enricher.apply(documents);
    }
}

Constructor

KeywordMetadataEnricher 构造函数接受两个参数:

The KeywordMetadataEnricher constructor takes two parameters:

  1. ChatModel chatModel :用于生成关键字的 AI 模型。

  2. ChatModel chatModel: The AI model used for generating keywords.

  3. int keywordCount :为每个文档提取的关键字数量。

  4. int keywordCount: The number of keywords to extract for each document.

Behavior

KeywordMetadataEnricher 处理文档如下:

The KeywordMetadataEnricher processes documents as follows:

  1. 对于每个输入文档,它使用文档内容创建提示。

  2. For each input document, it creates a prompt using the document’s content.

  3. 它将此提示发送到提供的 ChatModel 以生成关键字。

  4. It sends this prompt to the provided ChatModel to generate keywords.

  5. 生成的关键字将添加到文档的元数据中,键为“excerpt_keywords”。

  6. The generated keywords are added to the document’s metadata under the key "excerpt_keywords".

  7. 返回富化文档。

  8. The enriched documents are returned.

Customization

可以通过修改类中的 KEYWORDS_TEMPLATE 常量来自定义关键字提取提示。默认模板是:

The keyword extraction prompt can be customized by modifying the KEYWORDS_TEMPLATE constant in the class. The default template is:

\{context_str}. Give %s unique keywords for this document. Format as comma separated. Keywords:

其中 {context_str} 被替换为文档内容, %s 被替换为指定的关键字计数。

Where {context_str} is replaced with the document content, and %s is replaced with the specified keyword count.

Example

ChatModel chatModel = // initialize your chat model
KeywordMetadataEnricher enricher = new KeywordMetadataEnricher(chatModel, 5);

Document doc = new Document("This is a document about artificial intelligence and its applications in modern technology.");

List<Document> enrichedDocs = enricher.apply(List.of(this.doc));

Document enrichedDoc = this.enrichedDocs.get(0);
String keywords = (String) this.enrichedDoc.getMetadata().get("excerpt_keywords");
System.out.println("Extracted keywords: " + keywords);

Notes

  • The KeywordMetadataEnricher requires a functioning ChatModel to generate keywords.

  • The KeywordMetadataEnricher requires a functioning ChatModel to generate keywords.

  • 关键词计数必须为 1 或更大。

  • The keyword count must be 1 or greater.

  • 丰富器为每个已处理的文档添加“excerpt_keywords”元数据字段。

  • The enricher adds the "excerpt_keywords" metadata field to each processed document.

  • 生成的关键词以逗号分隔的字符串形式返回。

  • The generated keywords are returned as a comma-separated string.

  • 此丰富器对于提高文档可搜索性以及为文档生成标签或类别特别有用。

  • This enricher is particularly useful for improving document searchability and for generating tags or categories for documents.

SummaryMetadataEnricher

SummaryMetadataEnricher 是一个使用生成式 AI 模型为文档创建摘要并将其添加为元数据的 DocumentTransformer 。它可以为当前文档以及相邻文档(上一篇和下一篇)生成摘要。

The SummaryMetadataEnricher is a DocumentTransformer that uses a generative AI model to create summaries for documents and add them as metadata. It can generate summaries for the current document, as well as adjacent documents (previous and next).

Usage

@Configuration
class EnricherConfig {

    @Bean
    public SummaryMetadataEnricher summaryMetadata(OpenAiChatModel aiClient) {
        return new SummaryMetadataEnricher(aiClient,
            List.of(SummaryType.PREVIOUS, SummaryType.CURRENT, SummaryType.NEXT));
    }
}

@Component
class MySummaryEnricher {

    private final SummaryMetadataEnricher enricher;

    MySummaryEnricher(SummaryMetadataEnricher enricher) {
        this.enricher = enricher;
    }

    List<Document> enrichDocuments(List<Document> documents) {
        return this.enricher.apply(documents);
    }
}

Constructor

SummaryMetadataEnricher 提供两个构造函数:

The SummaryMetadataEnricher provides two constructors:

  1. SummaryMetadataEnricher(ChatModel chatModel, List<SummaryType> summaryTypes)

  2. SummaryMetadataEnricher(ChatModel chatModel, List&lt;SummaryType&gt; summaryTypes, String summaryTemplate, MetadataMode metadataMode)

  3. SummaryMetadataEnricher(ChatModel chatModel, List<SummaryType> summaryTypes, String summaryTemplate, MetadataMode metadataMode)

Parameters

  • chatModel : 用于生成摘要的 AI 模型。

  • chatModel: The AI model used for generating summaries.

  • summaryTypes : SummaryType 枚举值列表,指示要生成哪些摘要(PREVIOUS、CURRENT、NEXT)。

  • summaryTypes: A list of SummaryType enum values indicating which summaries to generate (PREVIOUS, CURRENT, NEXT).

  • summaryTemplate : 摘要生成的自定义模板(可选)。

  • summaryTemplate: A custom template for summary generation (optional).

  • metadataMode : 指定在生成摘要时如何处理文档元数据(可选)。

  • metadataMode: Specifies how to handle document metadata when generating summaries (optional).

Behavior

SummaryMetadataEnricher 按以下方式处理文档:

The SummaryMetadataEnricher processes documents as follows:

  1. 对于每个输入文档,它使用文档内容和指定的摘要模板创建提示。

  2. For each input document, it creates a prompt using the document’s content and the specified summary template.

  3. 它将此提示发送到提供的 ChatModel 以生成摘要。

  4. It sends this prompt to the provided ChatModel to generate a summary.

  5. 根据指定的 summaryTypes ,它为每个文档添加以下元数据:

    • section_summary : 当前文档的摘要。

    • section_summary: Summary of the current document.

    • prev_section_summary : 上一个文档的摘要(如果可用且已请求)。

    • prev_section_summary: Summary of the previous document (if available and requested).

    • next_section_summary :下一文档的摘要(如果可用且已请求)。

    • next_section_summary: Summary of the next document (if available and requested).

  1. Depending on the specified summaryTypes, it adds the following metadata to each document:

    • section_summary : 当前文档的摘要。

    • section_summary: Summary of the current document.

    • prev_section_summary : 上一个文档的摘要(如果可用且已请求)。

    • prev_section_summary: Summary of the previous document (if available and requested).

    • next_section_summary :下一文档的摘要(如果可用且已请求)。

    • next_section_summary: Summary of the next document (if available and requested).

  1. 返回富化文档。

  2. The enriched documents are returned.

Customization

通过提供自定义的 summaryTemplate 可以自定义摘要生成提示。默认模板是:

The summary generation prompt can be customized by providing a custom summaryTemplate. The default template is:

"""
Here is the content of the section:
{context_str}

Summarize the key topics and entities of the section.

Summary:
"""

Example

ChatModel chatModel = // initialize your chat model
SummaryMetadataEnricher enricher = new SummaryMetadataEnricher(chatModel,
    List.of(SummaryType.PREVIOUS, SummaryType.CURRENT, SummaryType.NEXT));

Document doc1 = new Document("Content of document 1");
Document doc2 = new Document("Content of document 2");

List<Document> enrichedDocs = enricher.apply(List.of(this.doc1, this.doc2));

// Check the metadata of the enriched documents
for (Document doc : enrichedDocs) {
    System.out.println("Current summary: " + doc.getMetadata().get("section_summary"));
    System.out.println("Previous summary: " + doc.getMetadata().get("prev_section_summary"));
    System.out.println("Next summary: " + doc.getMetadata().get("next_section_summary"));
}

所提供的示例演示了预期的行为:

The provided example demonstrates the expected behavior:

  • 对于两个文档的列表,两个文档都收到一个 section_summary

  • For a list of two documents, both documents receive a section_summary.

  • 第一个文档收到一个 next_section_summary 但没有 prev_section_summary

  • The first document receives a next_section_summary but no prev_section_summary.

  • 第二个文档收到一个 prev_section_summary 但没有 next_section_summary

  • The second document receives a prev_section_summary but no next_section_summary.

  • 第一个文档的 section_summary 与第二个文档的 prev_section_summary 匹配。

  • The section_summary of the first document matches the prev_section_summary of the second document.

  • 第一个文档的 next_section_summary 与第二个文档的 section_summary 匹配。

  • The next_section_summary of the first document matches the section_summary of the second document.

Notes

  • SummaryMetadataEnricher 需要一个功能正常的 ChatModel 来生成摘要。

  • The SummaryMetadataEnricher requires a functioning ChatModel to generate summaries.

  • 富化器可以处理任何大小的文档列表,正确处理第一个和最后一个文档的边缘情况。

  • The enricher can handle document lists of any size, properly handling edge cases for the first and last documents.

  • 此富化器对于创建上下文感知摘要特别有用,可以更好地理解序列中的文档关系。

  • This enricher is particularly useful for creating context-aware summaries, allowing for better understanding of document relationships in a sequence.

  • MetadataMode 参数允许控制如何将现有元数据合并到摘要生成过程中。

  • The MetadataMode parameter allows control over how existing metadata is incorporated into the summary generation process.

Writers

File

FileDocumentWriter 是一个 DocumentWriter 实现,它将 Document 对象列表的内容写入文件。

The FileDocumentWriter is a DocumentWriter implementation that writes the content of a list of Document objects into a file.

Usage

@Component
class MyDocumentWriter {

    public void writeDocuments(List<Document> documents) {
        FileDocumentWriter writer = new FileDocumentWriter("output.txt", true, MetadataMode.ALL, false);
        writer.accept(documents);
    }
}

Constructors

FileDocumentWriter 提供三个构造函数:

The FileDocumentWriter provides three constructors:

  1. FileDocumentWriter(String fileName)

  2. FileDocumentWriter(String fileName, boolean withDocumentMarkers)

  3. FileDocumentWriter(String fileName, boolean withDocumentMarkers, MetadataMode metadataMode, boolean append)

  4. FileDocumentWriter(String fileName, boolean withDocumentMarkers, MetadataMode metadataMode, boolean append)

Parameters

  • fileName :要写入文档的文件名。

  • fileName: The name of the file to write the documents to.

  • withDocumentMarkers :是否在输出中包含文档标记(默认:false)。

  • withDocumentMarkers: Whether to include document markers in the output (default: false).

  • metadataMode :指定要写入文件的文档内容(默认:MetadataMode.NONE)。

  • metadataMode: Specifies what document content to be written to the file (default: MetadataMode.NONE).

  • append :如果为 true,数据将写入文件末尾而不是开头(默认:false)。

  • append: If true, data will be written to the end of the file rather than the beginning (default: false).

Behavior

FileDocumentWriter 按以下方式处理文档:

The FileDocumentWriter processes documents as follows:

  1. 它为指定的文件名打开一个 FileWriter。

  2. It opens a FileWriter for the specified file name.

  3. 对于输入列表中的每个文档:[style="loweralpha"]

  1. 如果 withDocumentMarkers 为 true,它会写入一个文档标记,包括文档索引和页码。

  2. If withDocumentMarkers is true, it writes a document marker including the document index and page numbers.

  3. 它根据指定的 metadataMode 写入文档的格式化内容。

  4. It writes the formatted content of the document based on the specified metadataMode.

  1. For each document in the input list:[style="loweralpha"]

  1. 如果 withDocumentMarkers 为 true,它会写入一个文档标记,包括文档索引和页码。

  2. If withDocumentMarkers is true, it writes a document marker including the document index and page numbers.

  3. 它根据指定的 metadataMode 写入文档的格式化内容。

  4. It writes the formatted content of the document based on the specified metadataMode.

  1. 写入所有文档后,文件将关闭。

  2. The file is closed after all documents have been written.

Document Markers

withDocumentMarkers 设置为 true 时,写入器会为每个文档包含以下格式的标记:

When withDocumentMarkers is set to true, the writer includes markers for each document in the following format:

### Doc: [index], pages:[start_page_number,end_page_number]

Metadata Handling

写入器使用两个特定的元数据键:

The writer uses two specific metadata keys:

  • page_number :表示文档的起始页码。

  • page_number: Represents the starting page number of the document.

  • end_page_number :表示文档的结束页码。

  • end_page_number: Represents the ending page number of the document.

这些用于写入文档标记。

These are used when writing document markers.

Example

List<Document> documents = // initialize your documents
FileDocumentWriter writer = new FileDocumentWriter("output.txt", true, MetadataMode.ALL, true);
writer.accept(documents);

这将把所有文档写入“output.txt”,包括文档标记,使用所有可用的元数据,如果文件已经存在则追加到文件中。

This will write all documents to "output.txt", including document markers, using all available metadata, and appending to the file if it already exists.

Notes

  • 写入器使用 FileWriter ,因此它使用操作系统的默认字符编码写入文本文件。

  • The writer uses FileWriter, so it writes text files with the default character encoding of the operating system.

  • 如果在写入过程中发生错误,将抛出 RuntimeException ,并将原始异常作为其原因。

  • If an error occurs during writing, a RuntimeException is thrown with the original exception as its cause.

  • metadataMode 参数允许控制如何将现有元数据合并到写入的内容中。

  • The metadataMode parameter allows control over how existing metadata is incorporated into the written content.

  • 此写入器对于调试或创建文档集合的人类可读输出特别有用。

  • This writer is particularly useful for debugging or creating human-readable outputs of document collections.

VectorStore

与各种向量存储进行集成。有关完整列表,请参阅 Vector DB Documentation

Provides integration with various vector stores. See Vector DB Documentation for a full listing.