Batch Processing and Transactions
Simple Batching with No Retry
考虑以下一个简单的嵌套批处理示例,它没有重试。它展示了批处理的常见场景:一个输入源被处理直到耗尽,并且在处理的“块”的末尾周期性地提交。
Consider the following simple example of a nested batch with no retries. It shows a common scenario for batch processing: An input source is processed until exhausted, and it commits periodically at the end of a “chunk” of processing.
1 | REPEAT(until=exhausted) { | 2 | TX { 3 | REPEAT(size=5) { 3.1 | input; 3.2 | output; | } | } | | }
输入操作 (3.1) 可以是基于消息的接收(例如来自 JMS)或基于文件的读取,但为了恢复并继续处理,同时有机会完成整个作业,它必须是事务性的。这同样适用于操作 3.2。它必须是事务性的或幂等的。
The input operation (3.1) could be a message-based receive (such as from JMS) or a file-based read, but to recover and continue processing with a chance of completing the whole job, it must be transactional. The same applies to the operation at 3.2. It must be either transactional or idempotent.
如果 3.2 处的数据库异常导致 REPEAT
(3) 处的块失败,那么 TX
(2) 必须回滚整个块。
If the chunk at REPEAT
(3) fails because of a database exception at 3.2, then TX
(2)
must roll back the whole chunk.
Simple Stateless Retry
对于非事务性操作(例如对 Web 服务或其他远程资源的调用),使用重试也很有用,如下面的示例所示:
It is also useful to use a retry for an operation which is not transactional, such as a call to a web-service or other remote resource, as the following example shows:
0 | TX { 1 | input; 1.1 | output; 2 | RETRY { 2.1 | remote access; | } | }
这实际上是重试最有用的一种应用,因为远程调用的失败和重试可能性远大于数据库更新。只要远程访问 (2.1) 最终成功,事务 TX
(0) 就会提交。如果远程访问 (2.1) 最终失败,事务 TX
(0) 必然会回滚。
This is actually one of the most useful applications of a retry, since a remote call is
much more likely to fail and be retryable than a database update. As long as the remote
access (2.1) eventually succeeds, the transaction, TX
(0), commits. If the remote
access (2.1) eventually fails, the transaction, TX
(0), is guaranteed to roll
back.
Typical Repeat-Retry Pattern
最典型的批处理模式是为块的内部块添加重试,如下面的示例所示:
The most typical batch processing pattern is to add a retry to the inner block of the chunk, as the following example shows:
1 | REPEAT(until=exhausted, exception=not critical) { | 2 | TX { 3 | REPEAT(size=5) { | 4 | RETRY(stateful, exception=deadlock loser) { 4.1 | input; 5 | } PROCESS { 5.1 | output; 6 | } SKIP and RECOVER { | notify; | } | | } | } | | }
内部 RETRY
(4) 块标记为 “stateful”。有关有状态重试的说明,请参见 the typical use case。这意味着,如果重试 PROCESS
(5) 块失败,RETRY
(4) 的行为如下:
The inner RETRY
(4) block is marked as “stateful”. See the typical use case
for a description of a stateful retry. This means that, if the
retry PROCESS
(5) block fails, the behavior of the RETRY
(4) is as follows:
-
Throw an exception, rolling back the transaction,
TX
(2), at the chunk level, and allowing the item to be re-presented to the input queue. -
When the item re-appears, it might be retried, depending on the retry policy in place, and executing
PROCESS
(5) again. The second and subsequent attempts might fail again and re-throw the exception. -
Eventually, the item reappears for the final time. The retry policy disallows another attempt, so
PROCESS
(5) is never executed. In this case, we follow theRECOVER
(6) path, effectively “skipping” the item that was received and is being processed.
请注意,在计划中用于 RETRY
(4) 的标记明确地表明输入步骤 (4.1) 是重试的一部分。它还明确地表明处理有两个备用路径:由 PROCESS
(5) 表示的正常情况,以及由单独的 RECOVER
(6) 块表示的恢复路径。这两个备用路径是完全不同的。在正常情况下,只选择一个。
Note that the notation used for the RETRY
(4) in the plan explicitly shows that
the input step (4.1) is part of the retry. It also makes clear that there are two
alternate paths for processing: the normal case, as denoted by PROCESS
(5), and the
recovery path, as denoted in a separate block by RECOVER
(6). The two alternate paths
are completely distinct. Only one is ever taken in normal circumstances.
在特殊情况下(例如特殊的 TranscationValidException
类型),重试策略也许能够确定 RECOVER
(6) 路径可以在 PROCESS
(5) 刚刚失败后的最后尝试中执行,而不是等待重新呈现该项。这不是默认行为,因为它需要详细了解 PROCESS
(5) 块内部发生的情况,而这通常是不可用的。例如,如果输出在失败前包含写访问,则应重新抛出该异常以确保事务完整性。
In special cases (such as a special TranscationValidException
type), the retry policy
might be able to determine that the RECOVER
(6) path can be taken on the last attempt
after PROCESS
(5) has just failed, instead of waiting for the item to be re-presented.
This is not the default behavior, because it requires detailed knowledge of what has
happened inside the PROCESS
(5) block, which is not usually available. For example, if
the output included write access before the failure, the exception should be
re-thrown to ensure transactional integrity.
外部 REPEAT
(1) 中的完成策略对于计划的成功至关重要。如果输出 (5.1) 失败,它可能会抛出一个异常(通常会这样做,如所述),在这种情况下,事务 TX
(2) 失败,并且该异常可能会通过外部批处理 REPEAT
(1) 传播。我们不希望整个批处理停止,因为如果我们再次尝试,RETRY
(4) 仍然可能成功,因此我们在外部 REPEAT
(1) 中添加了 exception=not critical
。
The completion policy in the outer REPEAT
(1) is crucial to the success of the
plan. If the output (5.1) fails, it may throw an exception (it usually does, as
described), in which case the transaction, TX
(2), fails, and the exception could
propagate up through the outer batch REPEAT
(1). We do not want the whole batch to
stop, because the RETRY
(4) might still be successful if we try again, so we add
exception=not critical
to the outer REPEAT
(1).
但请注意,如果 TX
(2) 失败并且我们还是尝试再次执行,根据外部完成策略,在内部 REPEAT
(3) 中接下来处理的项并不能确保是刚刚失败的那个项。它可能是,但这取决于输入 (4.1) 的实现。因此,输出 (5.1) 可能再次因一个新项或旧项而失败。批处理的客户端不应假设每个 RETRY
(4) 尝试都会处理与上次失败相同的项。例如,如果 REPEAT
(1) 的终止策略是在 10 次尝试后失败,那么它将在 10 次连续尝试后失败,但不一定是在同一项。这与整体重试策略是一致的。内部 RETRY
(4) 了解每个项的历史记录,并且可以决定是否再次尝试。
Note, however, that if the TX
(2) fails and we do try again, by virtue of the outer
completion policy, the item that is next processed in the inner REPEAT
(3) is not
guaranteed to be the one that just failed. It might be, but it depends on the
implementation of the input (4.1). Thus, the output (5.1) might fail again on either a
new item or the old one. The client of the batch should not assume that each RETRY
(4)
attempt is going to process the same items as the last one that failed. For example, if
the termination policy for REPEAT
(1) is to fail after 10 attempts, it fails after 10
consecutive attempts but not necessarily at the same item. This is consistent with the
overall retry strategy. The inner RETRY
(4) is aware of the history of each item and
can decide whether or not to have another attempt at it.
Asynchronous Chunk Processing
typical example 中的内部批处理或块可以通过将外部批处理配置为使用 AsyncTaskExecutor
来并发执行。外部批处理等待所有块完成才能完成。以下示例显示异步块处理:
The inner batches or chunks in the typical example can be executed
concurrently by configuring the outer batch to use an AsyncTaskExecutor
. The outer
batch waits for all the chunks to complete before completing. The following example shows
asynchronous chunk processing:
1 | REPEAT(until=exhausted, concurrent, exception=not critical) { | 2 | TX { 3 | REPEAT(size=5) { | 4 | RETRY(stateful, exception=deadlock loser) { 4.1 | input; 5 | } PROCESS { | output; 6 | } RECOVER { | recover; | } | | } | } | | }
Asynchronous Item Processing
原则上,typical example 中块中的各个项也可以并发处理。在这种情况下,事务边界必须移动到各个项的级别,以便每个事务都在单个线程上,如下例所示:
The individual items in chunks in the typical example can also, in principle, be processed concurrently. In this case, the transaction boundary has to move to the level of the individual item, so that each transaction is on a single thread, as the following example shows:
1 | REPEAT(until=exhausted, exception=not critical) { | 2 | REPEAT(size=5, concurrent) { | 3 | TX { 4 | RETRY(stateful, exception=deadlock loser) { 4.1 | input; 5 | } PROCESS { | output; 6 | } RECOVER { | recover; | } | } | | } | | }
此计划牺牲了简单计划具有的将所有事务资源分块组合在一起的优化优势。仅当处理成本 (5) 远高于事务管理成本 (3) 时才有用。
This plan sacrifices the optimization benefit, which the simple plan had, of having all the transactional resources chunked together. It is useful only if the cost of the processing (5) is much higher than the cost of transaction management (3).
Interactions Between Batching and Transaction Propagation
批次重试和事务管理之间的耦合比我们理想的要紧密。特别是,无状态重试不能用于重试不支持嵌套传播的事务管理器数据库操作。
There is a tighter coupling between batch-retry and transaction management than we would ideally like. In particular, a stateless retry cannot be used to retry database operations with a transaction manager that does not support NESTED propagation.
以下示例使用不带重复的重试:
The following example uses retry without repeat:
1 | TX { | 1.1 | input; 2.2 | database access; 2 | RETRY { 3 | TX { 3.1 | database access; | } | } | | }
同样地,出于同样的原因,内部事务 TX
(3) 可能导致外部事务 TX
(1) 失败,即使 RETRY
(2) 最终成功。
Again, and for the same reason, the inner transaction, TX
(3), can cause the outer
transaction, TX
(1), to fail, even if the RETRY
(2) is eventually successful.
不幸的是,如果存在,相同的效果会从重试块渗透到周围的重复批次,如下面的示例所示:
Unfortunately, the same effect percolates from the retry block up to the surrounding repeat batch if there is one, as the following example shows:
1 | TX { | 2 | REPEAT(size=5) { 2.1 | input; 2.2 | database access; 3 | RETRY { 4 | TX { 4.1 | database access; | } | } | } | | }
现在,如果 TX (3) 回滚,它可能污染 TX (1) 处的整个批次并强制它在最后回滚。
Now, if TX (3) rolls back, it can pollute the whole batch at TX (1) and force it to roll back at the end.
非默认传播如何?
What about non-default propagation?
-
In the preceding example,
PROPAGATION_REQUIRES_NEW
atTX
(3) prevents the outerTX
(1) from being polluted if both transactions are eventually successful. But ifTX
(3) commits andTX
(1) rolls back,TX
(3) stays committed, so we violate the transaction contract forTX
(1). IfTX
(3) rolls back,TX
(1) does not necessarily roll back (but it probably does in practice, because the retry throws a roll back exception). -
PROPAGATION_NESTED
atTX
(3) works as we require in the retry case (and for a batch with skips):TX
(3) can commit but subsequently be rolled back by the outer transaction,TX
(1). IfTX
(3) rolls back,TX
(1) rolls back in practice. This option is only available on some platforms, not including Hibernate or JTA, but it is the only one that consistently works.
因此,如果重试块包含任何数据库访问,则 NESTED
模式最佳。
Consequently, the NESTED
pattern is best if the retry block contains any database
access.
Special Case: Transactions with Orthogonal Resources
对于没有嵌套数据库事务的简单案例,默认传播始终可以接受。考虑以下示例,其中 SESSION
和 TX
不是全局 XA
资源,因此它们的资源是正交的:
Default propagation is always OK for simple cases where there are no nested database
transactions. Consider the following example, where the SESSION
and TX
are not
global XA
resources, so their resources are orthogonal:
0 | SESSION { 1 | input; 2 | RETRY { 3 | TX { 3.1 | database access; | } | } | }
此处有一个事务消息 SESSION
(0),但它不参与具有 PlatformTransactionManager
的其他事务,因此它不会在 TX
(3) 启动时传播。RETRY
(2) 块外部没有数据库访问。如果 TX
(3) 失败,然后最终在重试中成功,则 SESSION
(0) 可以提交(独立于 TX
块)。这类似于普通的“尽力一阶段提交”方案。最坏的情况是当 RETRY
(2) 成功并且 SESSION
(0) 无法提交时(例如,因为消息系统不可用)发生重复消息。
Here there is a transactional message, SESSION
(0), but it does not participate in other
transactions with PlatformTransactionManager
, so it does not propagate when TX
(3)
starts. There is no database access outside the RETRY
(2) block. If TX
(3) fails and
then eventually succeeds on a retry, SESSION
(0) can commit (independently of a TX
block). This is similar to the vanilla “best-efforts-one-phase-commit” scenario. The
worst that can happen is a duplicate message when the RETRY
(2) succeeds and the
SESSION
(0) cannot commit (for example, because the message system is unavailable).
Stateless Retry Cannot Recover
前面所示的典型示例中,无状态重试和有状态重试之间的区别非常重要。实际上最终是一个事务约束迫使该区别,并且此约束也使该区别的存在原因很明显。
The distinction between a stateless and a stateful retry in the typical example shown earlier is important. It is actually ultimately a transactional constraint that forces the distinction, and this constraint also makes it obvious why the distinction exists.
我们从以下观察开始:除非我们将项目处理包装在一个事务中,否则无法跳过失败的项目并成功提交该块的其余部分。因此,我们简化了典型的批次执行计划,如下所示:
We start with the observation that there is no way to skip an item that failed and successfully commit the rest of the chunk unless we wrap the item processing in a transaction. Consequently, we simplify the typical batch execution plan to be as follows:
0 | REPEAT(until=exhausted) { | 1 | TX { 2 | REPEAT(size=5) { | 3 | RETRY(stateless) { 4 | TX { 4.1 | input; 4.2 | database access; | } 5 | } RECOVER { 5.1 | skip; | } | | } | } | | }
前面的示例展示了一个无状态的 RETRY
(3),其具有在最后一次尝试失败后启动的 RECOVER
(5) 路径。stateless
标签表示该块重复执行,而不会向某个限制抛出任何异常。仅当事务 TX
(4) 具有嵌套传播时,此方法才有效。
The preceding example shows a stateless RETRY
(3) with a RECOVER
(5) path that kicks
in after the final attempt fails. The stateless
label means that the block is repeated
without re-throwing any exception up to some limit. This works only if the transaction,
TX
(4), has propagation nested.
如果内部 TX
(4) 具有默认传播属性并回滚,则它会污染外部 TX
(1)。事务管理器假定内部事务已损坏事务资源,因此不能再次使用它。
If the inner TX
(4) has default propagation properties and rolls back, it pollutes the
outer TX
(1). The inner transaction is assumed by the transaction manager to have
corrupted the transactional resource, so it cannot be used again.
对嵌套传播的支持足够罕见,因此我们选择在当前版本的 Spring Batch 中不支持无状态重试恢复。通过使用前面显示的典型模式,始终可以实现相同的效果(以重复更多处理为代价)。
Support for nested propagation is sufficiently rare that we choose not to support recovery with stateless retries in the current versions of Spring Batch. The same effect can always be achieved (at the expense of repeating more processing) by using the typical pattern shown earlier.