Evaluation Testing
测试 AI 应用程序需要评估生成的内容,以确保 AI 模型没有产生幻觉响应。
Testing AI applications requires evaluating the generated content to ensure the AI model has not produced a hallucinated response.
评估响应的一种方法是使用 AI 模型本身进行评估。选择用于评估的最佳 AI 模型,该模型可能与用于生成响应的模型不同。
One method to evaluate the response is to use the AI model itself for evaluation. Select the best AI model for the evaluation, which may not be the same model used to generate the response.
用于评估响应的 Spring AI 接口是 Evaluator
,定义如下:
The Spring AI interface for evaluating responses is Evaluator
, defined as:
@FunctionalInterface
public interface Evaluator {
EvaluationResponse evaluate(EvaluationRequest evaluationRequest);
}
评估的输入是 EvaluationRequest
,定义如下:
The input to the evaluation is the EvaluationRequest
defined as
public class EvaluationRequest {
private final String userText;
private final List<Content> dataList;
private final String responseContent;
public EvaluationRequest(String userText, List<Content> dataList, String responseContent) {
this.userText = userText;
this.dataList = dataList;
this.responseContent = responseContent;
}
...
}
-
userText
:用户提供的原始输入,类型为String
-
userText
: The raw input from the user as aString
-
dataList
:上下文数据,例如来自检索增强生成的数据,附加到原始输入中。 -
dataList
: Contextual data, such as from Retrieval Augmented Generation, appended to the raw input. -
responseContent
:AI 模型的响应内容,类型为String
-
responseContent
: The AI model’s response content as aString
Relevancy Evaluator
RelevancyEvaluator
是 Evaluator
接口的实现,旨在根据提供的上下文评估 AI 生成响应的相关性。此评估器通过确定 AI 模型的响应是否与用户输入(相对于检索到的上下文)相关来帮助评估 RAG 流的质量。
The RelevancyEvaluator
is an implementation of the Evaluator
interface, designed to assess the relevance of AI-generated responses against provided context. This evaluator helps assess the quality of a RAG flow by determining if the AI model’s response is relevant to the user’s input with respect to the retrieved context.
评估基于用户输入、AI 模型的响应和上下文信息。它使用提示模板询问 AI 模型响应是否与用户输入和上下文相关。
The evaluation is based on the user input, the AI model’s response, and the context information. It uses a prompt template to ask the AI model if the response is relevant to the user input and context.
这是 RelevancyEvaluator
使用的默认提示模板:
This is the default prompt template used by the RelevancyEvaluator
:
Your task is to evaluate if the response for the query
is in line with the context information provided.
You have two options to answer. Either YES or NO.
Answer YES, if the response for the query
is in line with context information otherwise NO.
Query:
{query}
Response:
{response}
Context:
{context}
Answer:
您可以通过 |
You can customize the prompt template by providing your own |
Usage in Integration Tests
以下是 RelevancyEvaluator
在集成测试中使用的一个示例,它使用 RetrievalAugmentationAdvisor
验证 RAG 流的结果:
Here is an example of usage of the RelevancyEvaluator
in an integration test, validating the result of a RAG flow using the RetrievalAugmentationAdvisor
:
@Test
void evaluateRelevancy() {
String question = "Where does the adventure of Anacletus and Birba take place?";
RetrievalAugmentationAdvisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
.documentRetriever(VectorStoreDocumentRetriever.builder()
.vectorStore(pgVectorStore)
.build())
.build();
ChatResponse chatResponse = ChatClient.builder(chatModel).build()
.prompt(question)
.advisors(ragAdvisor)
.call()
.chatResponse();
EvaluationRequest evaluationRequest = new EvaluationRequest(
// The original user question
question,
// The retrieved context from the RAG flow
chatResponse.getMetadata().get(RetrievalAugmentationAdvisor.DOCUMENT_CONTEXT),
// The AI model's response
chatResponse.getResult().getOutput().getText()
);
RelevancyEvaluator evaluator = new RelevancyEvaluator(ChatClient.builder(chatModel));
EvaluationResponse evaluationResponse = evaluator.evaluate(evaluationRequest);
assertThat(evaluationResponse.isPass()).isTrue();
}
您可以在 Spring AI 项目中找到多个使用 RelevancyEvaluator
来测试 QuestionAnswerAdvisor
(参见 tests ) 和 RetrievalAugmentationAdvisor
(参见 tests ) 功能的集成测试。
You can find several integration tests in the Spring AI project that use the RelevancyEvaluator
to test the functionality of the QuestionAnswerAdvisor
(see tests) and RetrievalAugmentationAdvisor
(see tests).
Custom Template
RelevancyEvaluator
使用默认模板提示 AI 模型进行评估。您可以通过 .promptTemplate()
构建器方法提供自己的 PromptTemplate
对象来自定义此行为。
The RelevancyEvaluator
uses a default template to prompt the AI model for evaluation. You can customize this behavior by providing your own PromptTemplate
object via the .promptTemplate()
builder method.
自定义的 PromptTemplate
可以使用任何 TemplateRenderer
实现(默认情况下,它使用基于 StringTemplate 引擎的 StPromptTemplate
)。重要的要求是模板必须包含以下占位符:
The custom PromptTemplate
can use any TemplateRenderer
implementation (by default, it uses StPromptTemplate
based on the StringTemplate engine). The important requirement is that the template must contain the following placeholders:
-
一个
query
占位符用于接收用户问题。 -
a
query
placeholder to receive the user question. -
一个
response
占位符,用于接收 AI 模型的响应。 -
a
response
placeholder to receive the AI model’s response. -
一个
context
占位符,用于接收上下文信息。 -
a
context
placeholder to receive the context information.
FactCheckingEvaluator
FactCheckingEvaluator 是 Evaluator 接口的另一个实现,旨在根据提供的上下文评估 AI 生成响应的事实准确性。此评估器通过验证给定语句(声明)是否在逻辑上受提供的上下文(文档)支持来帮助检测和减少 AI 输出中的幻觉。
The FactCheckingEvaluator is another implementation of the Evaluator interface, designed to assess the factual accuracy of AI-generated responses against provided context. This evaluator helps detect and reduce hallucinations in AI outputs by verifying if a given statement (claim) is logically supported by the provided context (document).
“主张”和“文档”被提交给人工智能模型进行评估。目前有专门用于此目的的更小、更高效的人工智能模型,例如 Bespoke 的 Minicheck,与 GPT-4 等旗舰模型相比,它有助于降低执行这些检查的成本。Minicheck 也可以通过 Ollama 使用。
The 'claim' and 'document' are presented to the AI model for evaluation. Smaller and more efficient AI models dedicated to this purpose are available, such as Bespoke’s Minicheck, which helps reduce the cost of performing these checks compared to flagship models like GPT-4. Minicheck is also available for use through Ollama.
Usage
FactCheckingEvaluator 构造函数接受 ChatClient.Builder 作为参数:
The FactCheckingEvaluator constructor takes a ChatClient.Builder as a parameter:
public FactCheckingEvaluator(ChatClient.Builder chatClientBuilder) {
this.chatClientBuilder = chatClientBuilder;
}
评估器使用以下提示模板进行事实核查:
The evaluator uses the following prompt template for fact-checking:
Document: {document}
Claim: {claim}
其中 {document}
是上下文信息, {claim}
是要评估的人工智能模型的响应。
Where {document}
is the context information, and {claim}
is the AI model’s response to be evaluated.
Example
以下是使用基于 Ollama 的 ChatModel(特别是 Bespoke-Minicheck 模型)的 FactCheckingEvaluator 的示例:
Here’s an example of how to use the FactCheckingEvaluator with an Ollama-based ChatModel, specifically the Bespoke-Minicheck model:
@Test
void testFactChecking() {
// Set up the Ollama API
OllamaApi ollamaApi = new OllamaApi("http://localhost:11434");
ChatModel chatModel = new OllamaChatModel(ollamaApi,
OllamaOptions.builder().model(BESPOKE_MINICHECK).numPredict(2).temperature(0.0d).build())
// Create the FactCheckingEvaluator
var factCheckingEvaluator = new FactCheckingEvaluator(ChatClient.builder(chatModel));
// Example context and claim
String context = "The Earth is the third planet from the Sun and the only astronomical object known to harbor life.";
String claim = "The Earth is the fourth planet from the Sun.";
// Create an EvaluationRequest
EvaluationRequest evaluationRequest = new EvaluationRequest(context, Collections.emptyList(), claim);
// Perform the evaluation
EvaluationResponse evaluationResponse = factCheckingEvaluator.evaluate(evaluationRequest);
assertFalse(evaluationResponse.isPass(), "The claim should not be supported by the context");
}