|
The meta-data needed to track the scheduling concern is so trivial, I can't help thinking this is going to be over-engineered if we aren't careful. All the user needs is a aJobInstance ID and a status (plus the job name to make it easier to understand for humans). In fact that data is already available in existing batch meta-data tables, just probably not in the form that users would ideally like for all purposes. We even had part of this debate a while ago when we removed the status column from the JobInstance table for normalization reasons (I was opposed, but not strongly at the time). All I think we need is a new repository method (maybe in JobRepository, maybe in a new interface).
Dave, I completely understand your feelings about potential over-engineering, and I think Doug's example illustrates this potential. However, everytime I write a batch job at a client, I end up having to continually deal with this issue. It would be fine if it didn't require persistence, but it almost always does. The way I see it, there's three issues:
* Initial State: If the job has never been run before, what should the starting value be? The most common way I've seen this handled has been a seed record, but there's probably other ways as well. * Determining the status of the last execution run. I think this is what you were referring to above Dave, and it's implicitly required in my description above. It's actually the first thing I was thinking of about how to solve this issue. Simply add a method to the repository that will return the last execution (or instance) for a given Job. At least then, they wouldn't have to store the status in a separate table. * 'Incrementing' the parameters. (I still think the name is correct) If the job is complete, you need to be able to move the parameters forward to the next logical value. This could be a new date, or a new number, or even a new file, depending upon the implementation. These issues have to be solved for almost every batch job, every time. I know that developers could do the incrementing themselves after calling the repository directly, but as I sat thinking about it, I wondered if that was really necessary. If we had a mechanism as described above, it solved the only problem I could think of: We don't know how the parameters should be incremented. Initial state can be handled by specific implementations if you want to reduce complexity of the solution.
e.g. String x = parameters.getString("x"); if( x == null ) { x = initialState; } else { x = increment(x); } Don't confuse a stream-of-thought response with an attempt to over-engineer. As you know me fairly well, my ideas can generally use a bit of refining -- that was just an initial set of thoughts.
1 and 2 were just suggestions, take 'em or leave 'em 3 I believe was implicit to the solution - I was just clarifying. 4 is important if you want solutions to be able to re-use incrementers. This also mirrors the functionality of Spring Core's property configurer / overrider, which allows you to specify whether you want to allow system properties to override the values. In this case, instead of system properties vs. property file we're talking about command-line arguments vs. generated arguments. Admittedly, 5 and 6 above are perhaps a case of over-engineering. I firmly believe that the rest of the points are not. 5 and 6 arise from my thoughts about a few standard use cases for incrementing parameters: UC1) Daily jobs - the scheduleDate property needs to be incremented from day to day - this is the most trivial use case and not a very convincing one in favor of Lucas' solution, since this is the easiest one to relegate to the scheduler UC2) Weekly, Monthly, Quarterly jobs - the scheduleDate property needs to be set to the first, last, or a given day each week, month or quarter --- however, think of the following, a REAL LIFE scenario I encountered in a previous position. When you run job X, it goes to a vendor's FTP server and downloads the file for the previous month. This is run every day since the file can be updated on a daily basis to reflect corrections to numbers from the previous month. Now, suppose you're running the job for June, but the 31st of May fell on a Saturday and your vendor names his files based on the last BUSINESS day of each month. An incrementer that just picks the last calendar day of each month would not suffice. You would need an incrementer that figures out the last BUSINESS day of each month. That's why I suggested incorporating some sort of rudimentary mechanism for identifying business days -- e.g. a property file -- this may be something that you want the user to deal with, for now, but it's a VERY COMMON CONCERN. UC3) Parameter-less jobs - if a job doesn't need parameters and can be run an arbitrary number of times each day, the incrementer can generate a random number each time 7 arises from the following thought: Suppose you have a job that runs once per week but can be run from start to finish an arbitrary number of times. You couldn't just use the incrementer from UC2 since it would collide with the previous instance after the first run. However, if you could use BOTH the incrementer from UC2 and the incrementer from UC3, you could accomplish this goal. I suggested up front offering this with deference to the Ordered interface since you could potentially have a use case for using 2 incrementers that affect the same parameter and would need deterministic ordering of the two. You could relegate some of these to the scheduler, but you're counting on a level of functionality that may or may not be present (cron, of course, does not support any of these scenarios - I maintain Spring Batch should not be prejudicial against those who don't use a commercial scheduler or quartz) Speaking of property placeholder configurers, do we provide a JobParametersBuilder that creates JobParameters based on properties file? That would be a useful complement to this functionality.
On a side note: Please understand that I try to think in terms of the end-user (read: end-developer) experience. The basic problem is this: if there is a feature that N out of every 100 end users will want to use, it should be implemented and provided as part of the standard package. How do we determine N? Well, I leave that to you, but as an end-user-sympathizer, I have a pretty low value for N. My only request is for you guys to be consistent. There is some pretty esoteric (albeit wonderful and useful) stuff already in the framework (see: BATCH-333, et al). This being one of the most requested features and a long-standing point of contention indicates to me that there is a real demand for this type of functionality. 1. The term override is what some of our batch architects refer to this feature as (e.g. running a batch job on a time different from the originally scheduled date). However, the known original scheduled date was in the enterprise scheduler and it was an ops override to run off schedule. What I can't remember is how the schedule was defined (we were using Tivoli) to not run batch jobs that were dependent on sequential ordering of job executions before previous jobs were caught up.
2. Not sure what's different about logging. 3. I think Dave described how the meta-data from a previous run is already available but requires a little explanation on how best to retrieve it (or a new interface or something that exposes it in a more accessible fashion. I like Lucas' suggestion above. I don't know what is meant by a "seed record". 4. Not sure what to think of the discussion on incrementers. And part of the reason I'm puzzling is that I believe we did think of it more as override than incrementing anything. I'm going to forward to Tsay and get his take on the topic. [5..6]. It comes to the issue again on who knows about scheduling. I think its not over-engineering in the big picture but it is for Spring Batch where we don't want to get into the business of scheduling. That's enterprise schedulers or Quartz's job. It's been sometime ago since I looked at the possibilities of Quartz integration (and Lucas has looked closer than I have) but I think it already solves UC2 with their framework. Also, we get questions about "ftp job types" on a fairly regular basis and for right now have handed off some of the ideas to our integration team (internally). We have a few people looking at how we'd do this with Spring-Integration + Spring Batch. I don't want to re-introduce ftp job types like we had on my previous batch project. 7. I disagree with moving into the scheduling space but I could be swayed. Hopefully Tsay will have an opinion on this as well. It's all a matter of balance - I never bought the argument that feature X shouldn't be part of Spring Batch because X is a scheduler concern, etc. If it's a useful feature and people want it, it shouldn't matter whether or not it makes SB a theoretically impure solution. Frankly, it's unrealistic to think that any framework of any sort can stay entirely independent of the concerns of its surrounding environment and remain relevant.
Spring Batch WILL need hooks into schedulers and other commonly-used frameworks, it WILL need a clean way to manage File Transfer, and it WILL need to handle at least a reasonable subset of scheduling concerns in a clean, easy-to-use way. I had this argument with everyone awhile back - we were discussing some feature that I argued was too convoluted to configure. Basically, everyone said "well, the new version of Spring IDE will support imports, so that makes this REALLY easy to do," to which I replied "Well, not everyone uses Spring IDE and we have no right to mandate it. Some users will write their jobs in notepad and we have just as much responsibility to them." No one believed me, but several months later I was vindicated on the forums --- sure enough, people were using notepad as their editor. (As an aside: I don't advocate this, I just don't discriminate). In my opinion, the framework's usefulness should ABSOLUTELY NOT be strongly coupled to the IDE or scheduler that you use in conjunction with it, even if it is a recommended solution. If you want to add some niceness for people using a particular scheduler, that's fine. If you don't want to reinvent the wheel, that's understandable. However, if I as an end user need to hack together a set of custom database tables to achieve the simple piece of functionality that is a fairly common need, then I don't consider "that's a scheduler concern" to be a valid answer. To be blunt: I argue that there's no validity in saying that you won't implement it purely because it violates the "purity" of your product. Explain to the users why you don't want to maintain it, why you can't come up with an elegant way to do it, or why you've chosen scheduler X as the official scheduler for Spring Batch as opposed to supporting the end user's judgment -- these are valid arguments that I think any user would respect more than "it's a scheduling concern. QED". |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||
- never try to restart
- only try to restart if the job is not completed
- always try to restart (regardless of completion status, might find no work to do)
- force a fresh run from the beginning if the job is already complete
So here we're talking about forcing the fresh run of the job. I like this as a feature, since it allows jobs with no explicit parameters to be run more than once.
A few things about this solution:
1) I don't know if incrementer is really what this is - it can be used to increment, but it's more of an "Overrider"
2) The logging for this mechanism needs to be very clear
3) This mechanism has access to the currently provided job parameters. However, it's a whole lot more useful if it has both the currently provided parameters AND the last parameters used for the particular Job (i.e. the parameters for the last job instance created for this job). That way you could potentially create an "incrementer" that allows the user to not have to specify schedule date each time, but rather if it's not included in current params to go to the last run of the job and increment from there.
4) Should a job configuration be allowed to place limitations on what can and can't be overridden? that is, if the job only will allow an injected "incrementer" to add new properties vs. change the value of those provided on the command line -- e.g. as in (3), if a schedule date is provided on the command line, the incrementer shouldn't try to generate a new one
5) We should provide a "DailyScheduleDateJobParametersIncrementer" (or Overrider) example as part of the distribution, and since it's simple enough, how about a couple of other simple utility ones, such as "MonthEndScheduleDateJobParametersIncrementer"
6) We should provide a framework mechanism for identifying business / non-business days to make the stuff I mention in (5) more usable
7) We should either allow several of these to be injected (and potentially Ordered) OR include a composite version that allows several of these to be injected (and again, potentially Ordered - order will be very important if several can be used).
This still doesn't solve the random number for filenames problem, btw, unless you store "overridden" parameters differently than provided parameters and create some special logic to aid with restart -- otherwise the user would have to remember the random number generated the first time in between invocations. I still vote for the %RANDOM% marker in the ResourceProxy...
If I think of more, it will be forthcoming...