Issue Details (XML | Word | Printable)

Key: BATCH-654
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Dave Syer
Reporter: Telematica
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Spring Batch

Ensure best efforts are made to commit StepExecution when commit fails

Created: 05/Jun/08 10:31 AM   Updated: 07/Aug/08 10:07 AM
Component/s: Core
Affects Version/s: 1.0.1
Fix Version/s: 1.1.0

Time Tracking:
Original Estimate: 0.25d
Original Estimate - 0.25d
Remaining Estimate: 0d
Time Spent - 0.25d
Time Spent: 0.25d
Time Spent - 0.25d

Environment: Noticed the failure when using JTA, with JOTM.


 Description  « Hide
When the commit operation fails in an ItemOrientedStep, the writing of the Spring Batch metadata fails (trying to write the rollback information).

When trying to commit the transaction, SB has already set the information it's about to persist in the StepExecutionContext. Then, in another transaction, SB wants to store that a rollback ocurred. Nevertheless, SB hasn't read the current persisted state from the database, so it still has the Version it read when trying to commit. That's why we think it fails.

Furthermore, even if it succeeded, the persisted information wouldn't be accurate, because, as I've pointed out, the StepExecutionContext hasn't been reset.

 All   Comments   Work Log   Change History   FishEye   Builds      Sort Order: Ascending order - Click to sort in descending order
Dave Syer added a comment - 05/Jun/08 10:50 AM
I'm not entirely sure what the problem is here. If a commit fails then the system is almost by definition in an inconsistent state. You can't really expect anything sensible to happen after a failed commit - even a rollback would not be trusted. I must be misunderstanding something, so can you show an example of what might happen, and what you think should happen in the scenario you describe?

Robert Kasanicky added a comment - 05/Jun/08 11:30 AM - edited
When the metadata database is separate from business database we should be storing 'FAILED' (or maybe 'UNKNOWN') status for the StepExecution if the business database failed to commit.

Robert Kasanicky added a comment - 06/Jun/08 02:49 AM
I'm wondering about the priority of the issue - I suppose the JobExecution metadata gets written and only the StepExecution is left in inconsistent state. The job therefore can't be restarted which is fine because the actual state is undefined as Dave pointed out.

So I think the high level behavior is correct (job's status is 'FAILED' and can't be restarted), but the error message is misleading and StepExecution's status should be set to 'UNKNOWN'. It is also very unlikely to occur (successful flush but failed commit immediately after). Not really a blocker issue I think - or am I missing something?

Dave Syer added a comment - 06/Jun/08 06:18 AM
I agree - changed the priority. We should be able to set the StepExecution status and maybe the exit message (I thought the message was something clear like "commit failed", but maybe it could be clearer if we say that the result is an undefined state and restart is impossible as well).

Robert Kasanicky added a comment - 06/Jun/08 09:40 AM
I think I was wrong about the 'job can't be restarted'. Job only checks for 'UNKNOWN' status to refuse restart, but in this case we'll end up with 'STARTED' (we'll fail to write UNKNOWN due to OptimisticLockingFailure). The same scenario applies for failed update of the execution context which we definitely need to handle well.

Lucas Ward added a comment - 06/Jun/08 09:44 AM
Regarding flush. I think it's assumed if you're a restartable writer that after a failure you have to check to ensure that the state you had stored in the execution context is consistent with where you're going to start writing. In the example of a file, if there's extra lines on top of what the execution context says you should have, the file should be truncated to the EC indicated level

Wayne Lund added a comment - 06/Jun/08 10:13 AM
I'd like to see the original reporter provide the example of where it was failing with JTA and JOTM. We used to do a lot of extensive use of JTA/JOTM on our legacy batch system and found there were scenarios where JOTM behaved very differently from our production system (WebSphere's JTA). Could we get a sample failure attached?

Telematica added a comment - 10/Jun/08 03:21 AM
I'll explain the failure:

Here's the context configuration:
<!-- DATASOURCES -->
<bean id="modelDataSource"
class="org.enhydra.jdbc.pool.StandardXAPoolDataSource">
<property name="dataSource">
<bean class="org.enhydra.jdbc.standard.StandardXADataSource">
<property name="transactionManager" ref="jotm" />
<property name="driverName" value="...ModelDriver" />
</bean>
</property>
</bean>

<bean id="dataSource"
class="org.enhydra.jdbc.pool.StandardXAPoolDataSource">
<property name="dataSource">
<bean class="org.enhydra.jdbc.standard.StandardXADataSource">
<property name="transactionManager" ref="jotm" />
<property name="driverName" value="...MetaDataDriver" />
</bean>
</property>
</bean>

<!-- TRANSACTION MANAGER (small timeOut to force rollback)-->
<bean id="jotm" class="org.springframework.transaction.jta.JotmFactoryBean" />
<bean id="transactionManager" class="org.springframework.transaction.jta.JtaTransactionManager">
<property name="userTransaction">
<ref local="jotm" />
</property>
<property name="allowCustomIsolationLevels" value="true"/>
<property name="defaultTimeout" value="10"/>
</bean>

And now the log output. Up to this point, we've written 10 items to the database. We are bout to commit the transaction but we won't let that happen, wating for more time than the max timeout.

09:52:31,764 DEBUG main DataSourceUtils:112 - Fetching JDBC Connection from DataSource
09:52:31,764 DEBUG main DataSourceUtils:116 - Registering transaction synchronization for JDBC Connection
09:52:31,764 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 1, parameter value [Tue Jun 10 09:52:22 CEST 2008], value class [java.util.Date], SQL type 93
09:52:31,764 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 2, parameter value [null], value class [null], SQL type 93
09:52:31,764 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 3, parameter value [STARTED], value class [java.lang.String], SQL type 12
09:52:31,764 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 4, parameter value [2], value class [java.lang.Integer], SQL type 4
09:52:31,764 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 5, parameter value [20], value class [java.lang.Integer], SQL type 4
09:52:31,764 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 6, parameter value [Y], value class [java.lang.String], SQL type 1
09:52:31,764 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 7, parameter value [UNKNOWN], value class [java.lang.String], SQL type 12
09:52:31,764 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 8, parameter value [], value class [java.lang.String], SQL type 12
09:52:31,764 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 9, parameter value [2], value class [java.lang.Integer], SQL type 4
09:52:31,764 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 10, parameter value [71], value class [java.lang.Long], SQL type 4
09:52:31,764 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 11, parameter value [1], value class [java.lang.Integer], SQL type 4
09:52:31,764 DEBUG main JdbcTemplate:797 - SQL update affected 1 rows
09:52:31,764 INFO main ChunkLogAdvice:45 - [JOB monitorizacionJobJTA; STEP step1] CHUNK: 20 elemento(s) procesado(s).

//Ten seconds and the maximum timeout expires. A rollback is imperative

09:52:41,515 INFO JonasBatch jotm:691 - set rollback only (tx=bb14111130343639633564386631613030615f305f
09:52:42,688 DEBUG main DataSourceUtils:312 - Returning JDBC Connection to DataSource
09:52:42,688 DEBUG main DataSourceUtils:312 - Returning JDBC Connection to DataSource
09:52:42,703 ERROR main ItemOrientedStep:277 - Fatal error detected during commit.
09:52:42,719 ERROR main AbstractStep:183 - Encountered an error executing the step
09:52:42,719 DEBUG main JdbcTemplate:787 - Executing prepared SQL update
09:52:42,719 DEBUG main JdbcTemplate:571 - Executing prepared SQL statement [UPDATE BATCH_STEP_EXECUTION set START_TIME = ?, END_TIME = ?, STATUS = ?, COMMIT_COUNT = ?, ITEM_COUNT = ?, CONTINUABLE = ? , EXIT_CODE = ?, EXIT_MESSAGE = ?, VERSION = ? where STEP_EXECUTION_ID = ? and VERSION = ?]
09:52:42,719 DEBUG main DataSourceUtils:112 - Fetching JDBC Connection from DataSource
09:52:42,719 DEBUG main DataSourceUtils:116 - Registering transaction synchronization for JDBC Connection
09:52:42,719 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 1, parameter value [Tue Jun 10 09:52:22 CEST 2008], value class [java.util.Date], SQL type 93
09:52:42,719 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 2, parameter value [Tue Jun 10 09:52:42 CEST 2008], value class [java.util.Date], SQL type 93
09:52:42,719 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 3, parameter value [UNKNOWN], value class [java.lang.String], SQL type 12
09:52:42,719 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 4, parameter value [2], value class [java.lang.Integer], SQL type 4
09:52:42,719 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 5, parameter value [20], value class [java.lang.Integer], SQL type 4
09:52:42,719 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 6, parameter value [N], value class [java.lang.String], SQL type 1
09:52:42,719 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 7, parameter value [FAILED], value class [java.lang.String], SQL type 12
09:52:42,719 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 8, parameter value [org.springframework.batch.core.step.AbstractStep$FatalException: Fatal error detected during commit
at org.springframework.batch.core.step.item.ItemOrientedStep$1.doInIteration(ItemOrientedStep.java:278)
at org.springframework.batch.repeat.suppor], value class [java.lang.String], SQL type 12
09:52:42,719 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 9, parameter value [3], value class [java.lang.Integer], SQL type 4
09:52:42,719 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 10, parameter value [71], value class [java.lang.Long], SQL type 4
09:52:42,719 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 11, parameter value [2], value class [java.lang.Integer], SQL type 4
09:52:42,750 DEBUG main JdbcTemplate:797 - SQL update affected 0 rows
09:52:42,750 DEBUG main JdbcTemplate:636 - Executing prepared SQL query
09:52:42,750 DEBUG main JdbcTemplate:571 - Executing prepared SQL statement [SELECT VERSION FROM BATCH_STEP_EXECUTION WHERE STEP_EXECUTION_ID=?]
09:52:42,750 DEBUG main StatementCreatorUtils:203 - Setting SQL statement parameter value: column index 1, parameter value [71], value class [java.lang.Long], SQL type unknown
09:52:42,828 DEBUG main DataSourceUtils:312 - Returning JDBC Connection to DataSource
09:52:42,828 ERROR main AbstractStep:216 - Encountered an error saving batch meta data.This job is now in an unknown state and should not be restarted.
org.springframework.dao.OptimisticLockingFailureException: Attempt to update step execution id=71 with wrong version (2), where current version is 1
at org.springframework.batch.core.repository.dao.JdbcStepExecutionDao.updateStepExecution(JdbcStepExecutionDao.java:333)
at org.springframework.batch.core.repository.support.SimpleJobRepository.saveOrUpdate(SimpleJobRepository.java:239)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:615)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:310)
at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:182)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:106)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:90)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
at $Proxy0.saveOrUpdate(Unknown Source)
at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:201)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:615)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:310)
at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:182)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
at org.springframework.aop.aspectj.AspectJAfterAdvice.invoke(AspectJAfterAdvice.java:42)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:160)
at org.springframework.aop.framework.adapter.MethodBeforeAdviceInterceptor.invoke(MethodBeforeAdviceInterceptor.java:50)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:160)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:90)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
at $Proxy1.execute(Unknown Source)
at org.springframework.batch.core.job.SimpleJob.execute(SimpleJob.java:125)
at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:86)
at org.springframework.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:49)
at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(SimpleJobLauncher.java:81)

Robert Kasanicky added a comment - 10/Jun/08 03:32 AM
Raised the priority to major due to the restart implications described in my previous comment.

I don't yet see a fully satisfactory way to fix the problem - decrementing version number in step looks awful. Maybe the repository could do it in case execution context's update fails, but it doesn't help with failed commit.

Dave Syer added a comment - 13/Jun/08 06:49 AM
I didn't catch where the timeout came from in that example? Isn't it artificial in some way? I suppose in principle if all the resources in the job are fully XA capable (i.e. no file I/O) then we might expect to be able to recover from a commit failure, but only if we can distinguish a true commit failure from a failure in some other component (e.g. a listener). Is that going to be the target here?

Also, surely we can create a test case that doesn't need JTA?

Telematica added a comment - 16/Jun/08 09:15 AM
Yep, this is artificial in this example because we force a rollback explicitly waiting for a timeout to expire. We could have, as well, unplugged our network connection before the commit took place. But this is not the point here. The problem is that the ExecutionContext is not synchronized with the current database status after the commit failure is detected, because it hasn't been re-queried.

Surely, every test sample that comes with the distribution has no need for JTA. The problem is, as stated, when you need 2 datasources. Then you need XA resources and two-phase commits.

Dave Syer added a comment - 18/Jun/08 04:47 AM - edited
One of the problems here is that the execution context needs to be saved independently of the step execution (so if there is a rollback, the step execution is not affected and in particular so that the version is not incremented). Thus SimpleJobRepository.saveOrUpdateExecutionContext *should not* call saveOrUpdate(StepExecution) - it is up to the caller to decide when it is necessary. I have made this change and the tests pass, but really I think there is a weakness here that needs to be addressed through the interface JobRepository (see BATCH-668). It could also be addressed by providing an explicit load() method in the JobRepository, so that the latest value from the database can be loaded after a rollback.

So the changes are:

* make sure saveOrUpdate(StepExecution) is called explicitly *before* any calls to save the execution context - this is fragile but necessary unless we change the JobRepository
* if a fatal execption was detected before the rollback was attempted, try to save the step execution so that the UNKNOWN status is persisteed properly
* a test for the above (ItemOrientedStepIntegrationTests)

The original description in of this issue seems to be slightly wrong: the execution context should always rollback in such situations, as long as the database is still available.

Dave Syer added a comment - 07/Aug/08 10:07 AM
Assume closed as resolved and released