Issue Details (XML | Word | Printable)

Key: BATCH-365
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Critical Critical
Assignee: Lucas Ward
Reporter: Gregory Kick
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Spring Batch

ItemStream rework

Created: 14/Feb/08 12:08 PM   Updated: 07/Aug/08 10:06 AM
Component/s: Infrastructure
Affects Version/s: 1.0.0.m4
Fix Version/s: 1.0.0.m5

Time Tracking:
Original Estimate: 0.75d
Original Estimate - 0.75d
Remaining Estimate: 0.75d
Remaining Estimate - 0.75d
Time Spent: Not Specified
Remaining Estimate - 0.75d


 Description  « Hide
The ItemStream interface is a little odd. From the way the StreamManager and ItemStreams interact, my understanding is that the ItemStream is designed to allow:
-restartability based on ExecutionAttributes
-enabling transactional behavior via mark() and reset()

While the restartability portion works for both reading and writing, mark and reset only really make sense for reading. Because there's no modification for reading, it makes sense to just mark a position and reset back to it whenever you rollback a transaction. However, for writing, this is not the case. Constantly writing and then truncating a file when you "reset" is frail and will always leave corrupt data when the power goes out.

My suggestion is that mark and reset get moved to the ItemReader interface (just like InputStream). ItemWriter could then add put, clear and write (proably need better names) operations that allow the user to add items, clear unwritten items and write added items respectively. Essentially, this creates a contract that expects some sort of buffering and pushes writing towards an approximately atomic operation.

 All   Comments   Work Log   Change History   FishEye   Builds      Sort Order: Ascending order - Click to sort in descending order
Dave Syer added a comment - 14/Feb/08 04:05 PM
I agree that the truncation might be problematic, but that is just implementation. The interface has to be the same for reader and writer - it's the same concern. I don't personally find mark/reset inappropriate for writing, but if anyone has any suggestions that they feel more strongly about, now is the time to say...

Lucas Ward added a comment - 14/Feb/08 04:13 PM
Greg,

First, I agree with the point about changing the contract of ItemWriter to expect some kind of buffering. For a little bit of history, the reason it is the way it is now, is because of the old ItemOrientated approach. Essentially, everytime an item was written, a new line was written to a file. After say, ten item's were written, the position of the file was stored (i.e. mark). If a rollback occurred, the file was truncated back to it's position. This was because of the approach of writing and reading one item at a time. The issue of 'invalid state' was solved by saving the position of the mark (or more accurately, the current position after the last mark) and truncating the file to that point upon restart. The shift into a 'chunk orientated world' I think demands the ItemWriter have more of a 'contract that expects some sort of buffering' as you said. I also agree that this contract is a bit different from the read. It would actually solve some other problems I was having about how someone would be notified about the end of the chunk. Rather than some kind of interceptor(which I really don't want to do), the 'Dechunker' can easily call itemWriter.flush() before returning successfully, or itemWriter.clear() if rethrowing the exception (i.e. not skipping). The same could easily be done on the chunker, removing the stream manager's necessity for managing anything but the 'ExecutionAttributes'. Although, I'm still hopeful some kind of buffering approach can be used that removes that need to call mark or reset on the ItemReader.

Gregory Kick added a comment - 14/Feb/08 04:19 PM
Reading: mark followed by reads. reset for a return to the mark.
Writing: write followed by commit/flush. reset to clear buffered data.

If you were to say that it's the same concern and invoke this "mark" method at the same time for reading and writing, you need to either:
 - mark when a transaction begins
    - perform a noop write on an empty bufffer at the first mark
    - flush the writing buffer on stream close? somehow that last buffer has to get written...
 - mark when a trasaction ends
   - allow a reset without a mark while reading to accommodate the first reset before a mark is set
   - call mark on the writer when finishing... seems counterintuitive to 'mark' a stream that you're about to close

Neither seems to make all that much sense.

Gregory Kick added a comment - 14/Feb/08 04:30 PM
@Lucas
I'm not sure that it makes sense to have the Dechunker responsible for flushing. The StreamManager trying to tie the stream behavior as closely as possible to the tx seems to be the most reasonable approach. You just need to make sure that streams get registered.

As for buffering, if the new semantics can ensure you will always be moving forward through the data then you'll never need mark and reset. IIRC, the transaction boundary is around processing and writing anyway, so if reads never get rolled back then that makes sense anyway.

Lucas Ward added a comment - 14/Feb/08 04:43 PM
Good point about the StreamManager and the Dechunker.......

It leaves sort of an excellent question. If the StreamManager calls flush before calling commit(), then there would be an issue if something messed up and the transaction was committed. However, if flushed after the commit, an error inbetween committing and flushing would potentially cause the file to have a gap, if a subsequent flush call isn't made . It isn't extremely likely, but possible. It seems to me that the best course of action would be to flush before committing, and truncate the file after restarting. If the commit failed after flushing, the job will error out and when restarted, the erroneous data would be wiped out via the truncate, before writing began again. I still think buffering is the better way to go for writing, and don't think truncating would be good for 'clear()', but either truncating or positioning correctly at overwrite would solve any potential issues with timing.

Given this, I guess calling flush or clear in the streamManager vs. Dechunker is fairly debatable.

Lucas Ward added a comment - 14/Feb/08 05:02 PM
I didn't see Dave's first comment, or realize that's who Greg was first responding to.

Dave,

I'm not sure I agree with you about ItemStream, something hasn't quite sate right with me from the beginning about it. Besides Greg's points about open and close (which I'm fairly indifferent on), there's the issue of the fact that everyone thinks about 'flushing' in their writers. Projects writing a Hibernate version think about 'flushing the session', samething with jdbc batch mode, and even an iBatis batch update. They're thinking about how to flush before commit. In m3, they have to use a repeatInterceptor(or Listener, I forget what we're calling it now) and hook in that way, which is very counterintuitive and confusing. Mark gives them another point now, but I think that's also very confusing. Flush is what they're looking to do, flush is what we should be doing in general for writing, therefore flush is the best thing to call it. I'm really wondering if it doesn't make sense to add flush and clear directly to the ItemWriter interface, I know it would mean everyone has to handle it, but I'm not sure I mind the no-op or pass through, since there doesn't have to be an instance-of check. I think I remember Ben having similar sentiments?

I was able to make the shift very quickly on my machine ~20 minutes, but haven't committed, I really think making this small name change and shift would go a long way towards people understand how exactly we're handling transactions.

Dave Syer added a comment - 15/Feb/08 04:19 AM
It's a small change to ItemWriter, but then StreamManager has to distinguish between ItemWriter and ItemStream (which now doesn't make sense any more). This is quite a big change if you ask me, and puts the release under threat again while we decide what to do.

Gregory Kick added a comment - 15/Feb/08 09:32 AM
@Dave
you mean btw ItemWriter and ItemReader right?

Gregory Kick added a comment - 15/Feb/08 09:44 AM
@Lucas
I'm not entirely sure i follow. Let me rephase and hope i covered it.

if: flush then commit
  - error on flush -> stream is responsible for it's own mess, throws an exception, tx gets rolled back, everybody's happy
  - error on commit (or between flush and commit) -> stream wrote data that it shouldn't have. uh oh. exception handling for commit would have to do cleanup

if: commit then flush
  - error on commit -> tx rolls back, flush never happens, everybody's happy
  - error on flush -> tx committed, but data isn't written. this is the worst case.

So, it looks like flush then commit is the way to go, but there may have to be some mechanism on the writer to ease the pain of the cleanup. Basically something that makes a best effort at cleaning out the data from the write.

Lucas Ward added a comment - 26/Feb/08 03:28 PM
This issue should be completely resolved.

ItemReader now has mark and reset directly on it's interface.

ItemWriter has flush and clear on it's interface.

All of these methods are called correctly from the ItemOrientedStep (and the chunk stuff as well)

ItemStream now has the following methods:

open(EC)
update(EC)
close(EC)

restoreFrom and getEC have been deleted. Further, the ExecutionContextProvider interface has been removed as well.

As part of this change the StreamManager and implementations has been greatly simplified. This is because it not longer has to worry about aggregating the ExecutionContext values and then restoring back to the correct readers. I also removed the key that was used on most methods as well, because the StepExecution was the only key being used, and it will be the same even in a parrallel case, so I didn't see any value in keeping it.

Due to the nature of having one ExecutionContext, which creates a shared keyspace, all readers and writers should have a name value that is overridable. By default it is the fully qualified classname, but setName can be called to override this in the case where two writers of the same class are used in a single step (which should be fairly rare)

Further, each reader and writer now has a saveState value. If saveState is false, nothing will be put in the context. If it's true, then the current state will be put in the context, with the expectation that it will be persisted. For this reason, it is probably best to remove the saveExecutionContext flag on the step. I'll create a separate issue for this.

Dave Syer added a comment - 07/Aug/08 10:06 AM
Assume closed as resolved and released