Evaluation Testing

测试 AI 应用程序需要评估生成的内容,以确保 AI 模型没有产生幻觉响应。

Testing AI applications requires evaluating the generated content to ensure the AI model has not produced a hallucinated response.

评估响应的一种方法是使用 AI 模型本身进行评估。选择用于评估的最佳 AI 模型,该模型可能与用于生成响应的模型不同。

One method to evaluate the response is to use the AI model itself for evaluation. Select the best AI model for the evaluation, which may not be the same model used to generate the response.

用于评估响应的 Spring AI 接口是 Evaluator ,定义如下:

The Spring AI interface for evaluating responses is Evaluator, defined as:

@FunctionalInterface
public interface Evaluator {
    EvaluationResponse evaluate(EvaluationRequest evaluationRequest);
}

评估的输入是 EvaluationRequest ,定义如下:

The input to the evaluation is the EvaluationRequest defined as

public class EvaluationRequest {

	private final String userText;

	private final List<Content> dataList;

	private final String responseContent;

	public EvaluationRequest(String userText, List<Content> dataList, String responseContent) {
		this.userText = userText;
		this.dataList = dataList;
		this.responseContent = responseContent;
	}

  ...
}
  • userText :用户提供的原始输入,类型为 String

  • userText: The raw input from the user as a String

  • dataList :上下文数据,例如来自检索增强生成的数据,附加到原始输入中。

  • dataList: Contextual data, such as from Retrieval Augmented Generation, appended to the raw input.

  • responseContent :AI 模型的响应内容,类型为 String

  • responseContent: The AI model’s response content as a String

Relevancy Evaluator

RelevancyEvaluatorEvaluator 接口的实现,旨在根据提供的上下文评估 AI 生成响应的相关性。此评估器通过确定 AI 模型的响应是否与用户输入(相对于检索到的上下文)相关来帮助评估 RAG 流的质量。

The RelevancyEvaluator is an implementation of the Evaluator interface, designed to assess the relevance of AI-generated responses against provided context. This evaluator helps assess the quality of a RAG flow by determining if the AI model’s response is relevant to the user’s input with respect to the retrieved context.

评估基于用户输入、AI 模型的响应和上下文信息。它使用提示模板询问 AI 模型响应是否与用户输入和上下文相关。

The evaluation is based on the user input, the AI model’s response, and the context information. It uses a prompt template to ask the AI model if the response is relevant to the user input and context.

这是 RelevancyEvaluator 使用的默认提示模板:

This is the default prompt template used by the RelevancyEvaluator:

Your task is to evaluate if the response for the query
is in line with the context information provided.

You have two options to answer. Either YES or NO.

Answer YES, if the response for the query
is in line with context information otherwise NO.

Query:
{query}

Response:
{response}

Context:
{context}

Answer:

您可以通过 .promptTemplate() 构建器方法提供自己的 PromptTemplate 对象来自定义提示模板。有关详细信息,请参阅 Custom Template

You can customize the prompt template by providing your own PromptTemplate object via the .promptTemplate() builder method. See _custom_template for details.

Usage in Integration Tests

以下是 RelevancyEvaluator 在集成测试中使用的一个示例,它使用 RetrievalAugmentationAdvisor 验证 RAG 流的结果:

Here is an example of usage of the RelevancyEvaluator in an integration test, validating the result of a RAG flow using the RetrievalAugmentationAdvisor:

@Test
void evaluateRelevancy() {
    String question = "Where does the adventure of Anacletus and Birba take place?";

    RetrievalAugmentationAdvisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
        .documentRetriever(VectorStoreDocumentRetriever.builder()
            .vectorStore(pgVectorStore)
            .build())
        .build();

    ChatResponse chatResponse = ChatClient.builder(chatModel).build()
        .prompt(question)
        .advisors(ragAdvisor)
        .call()
        .chatResponse();

    EvaluationRequest evaluationRequest = new EvaluationRequest(
        // The original user question
        question,
        // The retrieved context from the RAG flow
        chatResponse.getMetadata().get(RetrievalAugmentationAdvisor.DOCUMENT_CONTEXT),
        // The AI model's response
        chatResponse.getResult().getOutput().getText()
    );

    RelevancyEvaluator evaluator = new RelevancyEvaluator(ChatClient.builder(chatModel));

    EvaluationResponse evaluationResponse = evaluator.evaluate(evaluationRequest);

    assertThat(evaluationResponse.isPass()).isTrue();
}

您可以在 Spring AI 项目中找到多个使用 RelevancyEvaluator 来测试 QuestionAnswerAdvisor (参见 tests ) 和 RetrievalAugmentationAdvisor (参见 tests ) 功能的集成测试。

You can find several integration tests in the Spring AI project that use the RelevancyEvaluator to test the functionality of the QuestionAnswerAdvisor (see tests) and RetrievalAugmentationAdvisor (see tests).

Custom Template

RelevancyEvaluator 使用默认模板提示 AI 模型进行评估。您可以通过 .promptTemplate() 构建器方法提供自己的 PromptTemplate 对象来自定义此行为。

The RelevancyEvaluator uses a default template to prompt the AI model for evaluation. You can customize this behavior by providing your own PromptTemplate object via the .promptTemplate() builder method.

自定义的 PromptTemplate 可以使用任何 TemplateRenderer 实现(默认情况下,它使用基于 StringTemplate 引擎的 StPromptTemplate )。重要的要求是模板必须包含以下占位符:

The custom PromptTemplate can use any TemplateRenderer implementation (by default, it uses StPromptTemplate based on the StringTemplate engine). The important requirement is that the template must contain the following placeholders:

  • 一个 query 占位符用于接收用户问题。

  • a query placeholder to receive the user question.

  • 一个 response 占位符,用于接收 AI 模型的响应。

  • a response placeholder to receive the AI model’s response.

  • 一个 context 占位符,用于接收上下文信息。

  • a context placeholder to receive the context information.

FactCheckingEvaluator

FactCheckingEvaluator 是 Evaluator 接口的另一个实现,旨在根据提供的上下文评估 AI 生成响应的事实准确性。此评估器通过验证给定语句(声明)是否在逻辑上受提供的上下文(文档)支持来帮助检测和减少 AI 输出中的幻觉。

The FactCheckingEvaluator is another implementation of the Evaluator interface, designed to assess the factual accuracy of AI-generated responses against provided context. This evaluator helps detect and reduce hallucinations in AI outputs by verifying if a given statement (claim) is logically supported by the provided context (document).

“主张”和“文档”被提交给人工智能模型进行评估。目前有专门用于此目的的更小、更高效的人工智能模型,例如 Bespoke 的 Minicheck,与 GPT-4 等旗舰模型相比,它有助于降低执行这些检查的成本。Minicheck 也可以通过 Ollama 使用。

The 'claim' and 'document' are presented to the AI model for evaluation. Smaller and more efficient AI models dedicated to this purpose are available, such as Bespoke’s Minicheck, which helps reduce the cost of performing these checks compared to flagship models like GPT-4. Minicheck is also available for use through Ollama.

Usage

FactCheckingEvaluator 构造函数接受 ChatClient.Builder 作为参数:

The FactCheckingEvaluator constructor takes a ChatClient.Builder as a parameter:

public FactCheckingEvaluator(ChatClient.Builder chatClientBuilder) {
  this.chatClientBuilder = chatClientBuilder;
}

评估器使用以下提示模板进行事实核查:

The evaluator uses the following prompt template for fact-checking:

Document: {document}
Claim: {claim}

其中 {document} 是上下文信息, {claim} 是要评估的人工智能模型的响应。

Where {document} is the context information, and {claim} is the AI model’s response to be evaluated.

Example

以下是使用基于 Ollama 的 ChatModel(特别是 Bespoke-Minicheck 模型)的 FactCheckingEvaluator 的示例:

Here’s an example of how to use the FactCheckingEvaluator with an Ollama-based ChatModel, specifically the Bespoke-Minicheck model:

@Test
void testFactChecking() {
  // Set up the Ollama API
  OllamaApi ollamaApi = new OllamaApi("http://localhost:11434");

  ChatModel chatModel = new OllamaChatModel(ollamaApi,
				OllamaOptions.builder().model(BESPOKE_MINICHECK).numPredict(2).temperature(0.0d).build())


  // Create the FactCheckingEvaluator
  var factCheckingEvaluator = new FactCheckingEvaluator(ChatClient.builder(chatModel));

  // Example context and claim
  String context = "The Earth is the third planet from the Sun and the only astronomical object known to harbor life.";
  String claim = "The Earth is the fourth planet from the Sun.";

  // Create an EvaluationRequest
  EvaluationRequest evaluationRequest = new EvaluationRequest(context, Collections.emptyList(), claim);

  // Perform the evaluation
  EvaluationResponse evaluationResponse = factCheckingEvaluator.evaluate(evaluationRequest);

  assertFalse(evaluationResponse.isPass(), "The claim should not be supported by the context");

}