Issue Details (XML | Word | Printable)

Key: BATCH-7
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Dave Syer
Reporter: Lucas Ward
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Spring Batch

Remove transaction synchronization and state management from input/output sources (formerly buffering)

Created: 29/May/07 10:33 PM   Updated: 07/Aug/08 10:06 AM
Component/s: Infrastructure
Affects Version/s: 1.0-m1
Fix Version/s: 1.0.0.m4

Time Tracking:
Original Estimate: 0.5d
Original Estimate - 0.5d
Remaining Estimate: 0d
Time Spent - 0.75d
Time Spent: 0.75d
Time Spent - 0.75d

Issue Links:
Depends
 


 Description  « Hide
Currently, all input templates (simpleFileInput is an exception) store their state in a threadlocal. However, this impacts readability and is overkill for most cases. It may be a better approach to either remove the thread local with the expectation that either the input or the entire module itself will be wrapped in a threadlocal scope(since many cases may result in a non-thread safe module being created and then needing partitioning for performance reasons) Or the InputState could be created as a dependency to the inputs (something like SqlInputState) which would default to a simple, non thread safe implementation, but could be replaced with an implementation that uses either a wrapper or composite pattern to wrap the state in a thread local.

 All   Comments   Work Log   Change History   FishEye   Builds      Sort Order: Ascending order - Click to sort in descending order
Dave Syer added a comment - 12/Jun/07 01:27 PM
Changed title to reflect current usage / terminology.

Discussed with Rob: we would prefer the input / output sources to be created as needed, rather than sitting there as Spring beans. This forces a factory/context pattern up the interface layers into the ItemProvider or its implementations.

Dave Syer added a comment - 24/Jun/07 06:18 AM
Custom scopes might be the best approach. There is now a StepScope available, but the ThreadLocals still need to be replaced. Keeping the state encapsulated in an object that expresses the class invariants still makes sense though.

Dave Syer added a comment - 05/Jul/07 08:46 AM
Replaced ThreadLocal everywhere with equivalent RepeatContext mechanism.

Actually it might be better to do the TX synchronization at the ItemProvider level anyway. There is no easy way to synchronize access to a file that is open for XML reading, so it's probably better to buffer the items themselves rather than try to mark an input stream. If we did that it would handle Skippable but not Restartable in the ItemProvider. Would also solve BATCH-26 (sort of) - at least it wouldn't be the same problem any more.

Lucas Ward added a comment - 06/Jul/07 10:46 AM
I agree that buffering may be another approach to transaction synchronization that may be more appropriate in certain scenarios. (especially XML) However, I'm not sure I understand the advantage of moving it to an ItemProvider. The InputSource already has state about the input, so it still seems natural that it would buffer the input if needed. Moving it to the ItemProvider would then require that the provider have state about what has been read so far, in addition to what the input source is already holding.

I think this also highlights another issue we've discussed before, at times, the line between ItemProvider and InputSource isn't very clear. It gets especially blurry when reading from XML or any other input source that would retrun a mapped object. Ideally, it would be the best if ItemProviders had nothing but developer code handling the business scenarios of loading the data, with no architectural concerns. However, given the interfaces such as Restartable that must be called, it's impossible to do.

Dave Syer added a comment - 06/Jul/07 01:09 PM
The advantage of doing the buffering at the ItemProvider is that it is generic for all InputSources. I guess it could just as easily be done by the InputSources using a helper object to do the buffering. The developer touch point is normally adding a mapper or something to an ItemProvider anyway, so it won't much matter how we do it.

Lucas Ward added a comment - 12/Sep/07 02:12 PM
As I was working through reviewing some of the new xml input sources, I had a new thought about buffering the objects returned by Input Source or Item Provider. One issue with doing so would be either the Item Provider (in the case of buffering an InputSource) or ItemProcessor modifying the object that had been buffered after it had been returned, if the transaction is rolled-back and that same object given back to the itemprocessor or provider, there could be issues if it depended upon the object being in it's initial state.

We could try and tell developers not to modify objects returned from a transactional buffer, or encourage them to be immutable, but there's no garantees, and these types of errors would be extremely hard to debug.

Wayne Lund added a comment - 12/Sep/07 03:05 PM
I don't think its possible to ask for these to be immutable. Therefore it sounds like you're spot on with saying that we can't buffer. Too bad we always lose something between language generations. Forte Tool actually rolled back instance variables as well as transactions but I've never seen another language that supports that feature.

Lucas Ward added a comment - 12/Sep/07 03:30 PM
The only way that I can see to still buffer and not have to worry about objects within the buffer being invalidated would be to either A) Make a copy of them, which would almost certainly cause a performance issue, not to mention there would need to be some kind of contract on the domain objects to ensure they are 'copyable' B) Buffer the objects getting written out. Afterall, they're the only things you really care about, since any given output will always be the same given it's corresponding input.

Dave Syer added a comment - 03/Oct/07 09:52 AM
The original impetus for this was the XML input sources, and they have solved the problem in a different way. So buffering would be nice for maintenance / reliability but arguably adds very little to the user's experience. Leaving this issue open but re-prioritised.

Dave Syer added a comment - 29/Nov/07 11:04 AM
Buffering is a non-issue if there is no TX synchronization - it won't be needed when all the retry happens at the chunk level.

Dave Syer added a comment - 08/Jan/08 01:56 AM
The best approach seems to be to put the mark(), reset() (and markSupported()) methods into ItemReader/Writer. Then let the clients of those classes decied what to do with them.

Dave Syer added a comment - 04/Feb/08 09:11 AM
A lot of work was done on this that is more relevant to BATCH-222 and BATCH-229. This one is finished really, with some tidying up to do (which will be covered there).

Dave Syer added a comment - 04/Feb/08 09:12 AM
ItemStream and mark(), reset() introduced on all the relevant Item*. The buffering is now an option, if someone wants to wrap an existing Item* (e.g. see StagingItemReader in the samples), but not a necessity.

Dave Syer added a comment - 07/Aug/08 10:06 AM
Assume closed as resolved and released