|
Well I have some ideas (expressed in the forum), none are really great solutions:
- 2 columns in the BATCH_STEP_EXECUTION telling running yes/no, with the oracle (or other) sessionid used. When checking if running or not, it checks in Oracle if the sessionid is still alive, because Oracle closes sessions when the client is killed. => I'm aware this is a kind of ugly DB-specific solution, but we've used it in another application and it's working well. - The batch when running creates a temporary file (the ones deleted when the jvm exits) with a carefully designed name. When checking if running or not, it can check if the file exists or not. => This solution prevents 2 same jobs running on the same machine thought, not on distinct ones. I guess in most of the case you don't start the same batch on two different machines. Gérard COLLIN Seems like maybe we need a strategy for this so people can plug in their own varieties. How likely is it really that a job is killed with "kill -9" on the process, or by killing the session on the database? To my mind this is by far the least likely way that a job could fail, so I would argue that we can postpone this beyond 1.0. Is that reasonable?
Well, kill -9 is quite improbable, and is just an example.
But the server could crash, or the network could fail, or no space left on disk, or OutOfMemory, impossible to commit to DB, all that prevents the execution of the exception handling that will commit the FAILED state to the DB. These events can occurs, but of course they are very rare. So I understand it's not for 1.0, just letting customization without having to override - JobDao.createNewJobInstance (), - JobInstance.createJobExecution() would be nice. But by overriding these methods we can plug our own isRunning() method right now. Gérard COLLIN Maybe you can lower the bug's priority.
I don't seem to have enough right for it. Gérard COLLIN I don't think this is something we could really tackle in 1.1, moving out to 2.0
When a new batch is started, the job executor should write heartbeats to the db. When you restart the application, the application can check if the flag for jobexecution is set. If the last heartbeat entry is too old, the batch should be recovered. It should be possible to add recovery listeners the the job steps which would be called if the batch is recovered and then you could implement your own stuff for recovery. This solution can also been used for a clustered environment. One server is executing the batch job and the other servers could check for the heartbeats. If the server with the running batch job crashs, the other implementations on the running servers could start a recovery and maybe restarts the batch job again.
Reto A heartbeat is precisely the right solution, thank you. Some care will be needed to design it for pluggability, because there are a number of different levels of robustness that could be provided, in case someone wants to be more or less robust than the default.
In particular the reliability of an apparently *dead* heartbeat is always questionable from the outside - is it dead or just slow? So there are some special considerations for outside agents (like a replacement job execution) trying to figure out if an apparently dead job is still there. In general the procedure for the outside agent should be: check for heartbeat, if detected do nothing, if not detected *make sure the other guy is really dead* before proceeding with a recovery. That has a pleasingly ruthless sound to it, doesn't it? From the job's point of view the means the the contract is: do not commit *anything* unless you are sure that your heartbeat hasn't been interrupted. This allows an operator or replacement job to shoot the questionable maybe-dead execution and make sure that it never lives again. The StepExecution.version is a pretty good mechanism for that - the maybe-dead step will already die pretty quickly if you update its StepExecution in the database. It's a short journey from there (where we already are) to dying before a commit. I was reviewing the potential implementation of this issue and had a couple of thoughts on the implementation. It's obvious that a lastUpdated column should be in the StepExecution table. it makes perfect sense that we update this value on every update of the StepExecution (which is at chunk boundaries) However, we currently only look at the JobExecution when determining if we can create a new execution. We could also update the JobExecution when we update the StepExecution, but it seems a bit odd to do so, since the JobExecution data hasn't really changed, so putting another column there seems weird. Furthermore, we would need to make two separate calls to the database to update both tables, unless we broke our current encapsulation and let the StepExecutionDao do it. It would be preferred to not make anymore database calls then we have to. It makes the most sense to me to just pull back the list of StepExecutions too ( in the case of an existing job) and inspect them, if all are complete or failed, we're fine, however, if running, we would need to inspect the lastUpdated time. It still seems a bit complicated to me though. Unless we created some kind of other table to hold the overall heartbeat for a JobInstance? It still seems like any solution besides the StepExecution one above would break the encapsulation we've maintained so well. It's the same problem as aggregating status on the JobInstance table.
Actually I thought about this too, and came to the opposite conclusion. I believe it is actually mandatory to track stop signals at the job execution level, even if the heartbeat is in the step execution (which seems natural and as Thomas pointed out may not require any additional queries). The reason is concurrency, basically: the signal to stop might reach a step execution too late - it has already finished and the job moved on to the next step before anyone noticed. The best place to check for a stop signal is still in the step (that way any paralell executions can all find the signal), but it may require a query to something at a higher level (probably the job execution). I don't think it necessarily breaks encapsulation if at the step level we use the JobRepository (but maybe at the dao level - which by the way has already been eroded by the requirements of the JobOperator).
I have updated the StepExecutionDao to store a new field in the StepExecution: LAST_UPDATED. The SimpleJobRepository, however, is what is updating the field. The createJobExecution method in the repository now needs to be updated to check for a stale update (along with adding a setter for the stale timeframe, which will need to be in the factory as well) The final update will be for the stop functionality that Dave has mentioned (although, we could theoretically put that in another issue)
After going back and forth on a lot of different possibilities on this issue, (and discussing with Thomas and Dave), I kept it simple and added a check to the SimpleJobRepository to determine if the status of the JobExecution of the currently executing StepExecution has been set to STOPPING or not. If so, setTerminateOnly is set to true (throwing an exception will cause the job to fail, not stop) and the step will stop. It's by no means complete, but it took a lot of back and forth to even decide that this was the right direction.
One other issue to note is that we really need to pick some common names and stick to them. In the step we called it 'interruption' (and even have a policy named that), the status is called 'STOPPED' and 'STOPPING', and you set the stepExecution to 'terminateOnly'. I don't care which one of the three we go with (stop, interrupt, terminate), but we should be consistent across the board. (although I heavily favor stop, since it will go well with pause) I added the lastUpdated property to JobExecution and a column in it's respective database. There is also a new functional test for ensuring that modifying the status of a JobExecution will cause it to stop. I just need to add some functionality to the daos and the JobOperator to support calling stop that way. The column changes I've made were also only done in the immeadiate hsql schema in src/main/resources. The base ones used for the template need to be updated as well.
I added the support for stop into the SimpleJobOperator, which was committed. I'll need a little more time to get the sample working using it. There's a bit of a mess to untangle with some of the dao's, etc.
I'm not sure if the stop interface in JobOperator really needs to return a boolean. We don't have anyway of knowing whether or not the stop is successful, short of sleeping and checking to see if it's stopped, which is probably not the operator implementation's responsibility. Jobs can now successfully be stopped via the JobOperator interface. There's still some polishing that needs to be done between all the collaborators, but I think that should be a separate issue, so I'm resolving this one.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
If you have any suggestions how to do it better, we are definitely listening. Maybe at least a log message that makes it clear what has happened and suggesting that the state needs to be manually updated to allow a restart? The point is that there is no way for the framework to tell the difference between a running job and one that was terminated without prejudice, so there has to be an external intervention to get back to a restartable state.