Issue Details (XML | Word | Printable)

Key: BATCH-657
Type: Sub-task Sub-task
Status: Open Open
Priority: Critical Critical
Assignee: Lucas Ward
Reporter: Lucas Ward
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Spring Batch
BATCH-675

JobParametersIncrementer

Created: 07/Jun/08 06:44 PM   Updated: 29/Aug/08 06:43 AM
Component/s: Core
Affects Version/s: 1.0.1
Fix Version/s: None

Time Tracking:
Original Estimate: 0.12d
Original Estimate - 0.12d
Remaining Estimate: 0.12d
Remaining Estimate - 0.12d
Time Spent: Not Specified
Remaining Estimate - 0.12d


 Description  « Hide
Since 1.0 came out we have been pushing users to understand that JobInstance = Job + JobParameters. For the most part, I think people understand the relationship. The one issue that has been brought up a few times is how they manage the changing job parameters in order to create a new instance. We've used the 'get current time' approach to always give a new instance, but this isn't really something that works well in production. (Although it's a great testing approach). I've been working with a few clients and the problem of managing parameters seem to always be pushed off at some level to be handled by the client of JobLauncher. At one client, they create a separate table similar to our JobParameters table that kept track of their parameters and dates. We have always encouraged this because the framework has absolutely no idea what makes one instance different from another. However, we're pushing an awful lot of work onto users. The use case below is an example of something I have to build for a client that illustrates the problem:

This particular application needs to check a schedule setup by users of an online application, to determine when certain files are to be uploaded into the system. For this reason, it's imperative that the daily batch job knows exactly which day it's running for. For example, if run on monday, the job may find and schedule 10 files to be uploaded for that day. On Tuesday, it will likely find an entirely different set of files. The days in which these files are loaded is important for the system.

It's not a particularly difficult problem to solve. Generally speaking, it should be solved using the 'Schedule Date' pattern. This pattern indicates that a job is started and identified by a schedule date that determines the data in which the job will be operating on. For example, using the job above, if there is a catastrophic error on monday and the job cannot be run until tuesday. The job will be run with a schedule date of monday, even though the day it's being run on is Tuesday, thus ensuring the files that should have been loaded on monday eventually get loaded. It's a fairly common batch pattern that's been around for a very long time. To implement it with Spring Batch, I need to create a separate set of tables outside of the architecture, or require the scheduler itself to keep track of it. (some do). If I was to create tables, I would need to check the tables to see the last schedule date that was run successfully, and increment it, then store the fact that it was incremented. Once the job returns, I need to flip that record back to indicate that it finished successfully. There's also a few potential error scenarios that must be thought of. For example, some type of error that happens between the launcher returning and this wrapper writing out the status change. It's not a particularly hard problem to solve. I could easily catch a JobInstanceAlreadyCompleteException to help handle this scenario. However, it's a lot to ask of developers, and I know many might have issues handling the error scenarios correctly.

What I would like to propose is that the framework simply tell the developer when they need to 'increment' the parameters. Part of the reason why we have always pushed this responsibility to outside the framework is because we have no way of knowing how they want them to be incremented and under what circumstances. For example, we can't assume a date should be incremented by one day. There could be many scenarios when it should be incremented by an entire week. In other cases, certain parameters may need to stay constant for the most part. In my opinion, the solution to this problem is to simply provide a hook for users to tell us how parameters should be incremented. The interface would look something like the following:

public void JobParametersIncrementer{

  JobParameters incrementParameters(JobParameters jobParameters);
}

The solution for my use case above would be the following:

JobParameters incrementParameters(JobParameters jobParameters){
  return new JobParametersBuilder().addString(incrementDate(jobParameters.getString(scheduleDate)));
}

We could probably even provide a new constructor for the builder that takes an existing parameters, so that only the one parameter that needs to change could be incremented.

I'm not sure where exactly this would belong in the API, it seems like the JobLauncher could take it as a parameter. If the JobRepository returns that the job is already complete, the launcher could call the incrementer, then try and create an execution again. (possibly using a configuration option to determine if it should try) If there's an issue the second time it could fail, if not it should run as normal. This would allow the framework to manage the task of checking for completion, but allow the user to determine how the parameters should be changed as a result.

 All   Comments   Work Log   Change History   FishEye   Related Builds      Sort Order: Ascending order - Click to sort in descending order
Douglas C. Kaminsky added a comment - 08/Jun/08 08:39 PM
This draws the conversation back to the idea that "restartable" is not an easily understood parameter. Remember, we were discussing the idea that there are several different types of restart behavior that aren't necessarily encapsulated by "true" and "false", e.g.

- never try to restart

- only try to restart if the job is not completed

- always try to restart (regardless of completion status, might find no work to do)

- force a fresh run from the beginning if the job is already complete

So here we're talking about forcing the fresh run of the job. I like this as a feature, since it allows jobs with no explicit parameters to be run more than once.

A few things about this solution:
1) I don't know if incrementer is really what this is - it can be used to increment, but it's more of an "Overrider"
2) The logging for this mechanism needs to be very clear
3) This mechanism has access to the currently provided job parameters. However, it's a whole lot more useful if it has both the currently provided parameters AND the last parameters used for the particular Job (i.e. the parameters for the last job instance created for this job). That way you could potentially create an "incrementer" that allows the user to not have to specify schedule date each time, but rather if it's not included in current params to go to the last run of the job and increment from there.
4) Should a job configuration be allowed to place limitations on what can and can't be overridden? that is, if the job only will allow an injected "incrementer" to add new properties vs. change the value of those provided on the command line -- e.g. as in (3), if a schedule date is provided on the command line, the incrementer shouldn't try to generate a new one
5) We should provide a "DailyScheduleDateJobParametersIncrementer" (or Overrider) example as part of the distribution, and since it's simple enough, how about a couple of other simple utility ones, such as "MonthEndScheduleDateJobParametersIncrementer"
6) We should provide a framework mechanism for identifying business / non-business days to make the stuff I mention in (5) more usable
7) We should either allow several of these to be injected (and potentially Ordered) OR include a composite version that allows several of these to be injected (and again, potentially Ordered - order will be very important if several can be used).

This still doesn't solve the random number for filenames problem, btw, unless you store "overridden" parameters differently than provided parameters and create some special logic to aid with restart -- otherwise the user would have to remember the random number generated the first time in between invocations. I still vote for the %RANDOM% marker in the ResourceProxy...

If I think of more, it will be forthcoming...

Dave Syer added a comment - 09/Jun/08 02:20 AM
The meta-data needed to track the scheduling concern is so trivial, I can't help thinking this is going to be over-engineered if we aren't careful. All the user needs is a aJobInstance ID and a status (plus the job name to make it easier to understand for humans). In fact that data is already available in existing batch meta-data tables, just probably not in the form that users would ideally like for all purposes. We even had part of this debate a while ago when we removed the status column from the JobInstance table for normalization reasons (I was opposed, but not strongly at the time). All I think we need is a new repository method (maybe in JobRepository, maybe in a new interface).

Lucas Ward added a comment - 09/Jun/08 10:02 AM
Dave, I completely understand your feelings about potential over-engineering, and I think Doug's example illustrates this potential. However, everytime I write a batch job at a client, I end up having to continually deal with this issue. It would be fine if it didn't require persistence, but it almost always does. The way I see it, there's three issues:

* Initial State: If the job has never been run before, what should the starting value be? The most common way I've seen this handled has been a seed record, but there's probably other ways as well.

* Determining the status of the last execution run. I think this is what you were referring to above Dave, and it's implicitly required in my description above. It's actually the first thing I was thinking of about how to solve this issue. Simply add a method to the repository that will return the last execution (or instance) for a given Job. At least then, they wouldn't have to store the status in a separate table.

* 'Incrementing' the parameters. (I still think the name is correct) If the job is complete, you need to be able to move the parameters forward to the next logical value. This could be a new date, or a new number, or even a new file, depending upon the implementation.

These issues have to be solved for almost every batch job, every time. I know that developers could do the incrementing themselves after calling the repository directly, but as I sat thinking about it, I wondered if that was really necessary. If we had a mechanism as described above, it solved the only problem I could think of: We don't know how the parameters should be incremented.

Douglas C. Kaminsky added a comment - 09/Jun/08 12:44 PM
Initial state can be handled by specific implementations if you want to reduce complexity of the solution.

e.g.

String x = parameters.getString("x");
if( x == null ) {
   x = initialState;
} else {
  x = increment(x);
}

Douglas C. Kaminsky added a comment - 09/Jun/08 01:01 PM
Don't confuse a stream-of-thought response with an attempt to over-engineer. As you know me fairly well, my ideas can generally use a bit of refining -- that was just an initial set of thoughts.

1 and 2 were just suggestions, take 'em or leave 'em

3 I believe was implicit to the solution - I was just clarifying.

4 is important if you want solutions to be able to re-use incrementers. This also mirrors the functionality of Spring Core's property configurer / overrider, which allows you to specify whether you want to allow system properties to override the values. In this case, instead of system properties vs. property file we're talking about command-line arguments vs. generated arguments.

Admittedly, 5 and 6 above are perhaps a case of over-engineering. I firmly believe that the rest of the points are not.

5 and 6 arise from my thoughts about a few standard use cases for incrementing parameters:

UC1) Daily jobs - the scheduleDate property needs to be incremented from day to day - this is the most trivial use case and not a very convincing one in favor of Lucas' solution, since this is the easiest one to relegate to the scheduler

UC2) Weekly, Monthly, Quarterly jobs - the scheduleDate property needs to be set to the first, last, or a given day each week, month or quarter --- however, think of the following, a REAL LIFE scenario I encountered in a previous position. When you run job X, it goes to a vendor's FTP server and downloads the file for the previous month. This is run every day since the file can be updated on a daily basis to reflect corrections to numbers from the previous month. Now, suppose you're running the job for June, but the 31st of May fell on a Saturday and your vendor names his files based on the last BUSINESS day of each month. An incrementer that just picks the last calendar day of each month would not suffice. You would need an incrementer that figures out the last BUSINESS day of each month. That's why I suggested incorporating some sort of rudimentary mechanism for identifying business days -- e.g. a property file -- this may be something that you want the user to deal with, for now, but it's a VERY COMMON CONCERN.

UC3) Parameter-less jobs - if a job doesn't need parameters and can be run an arbitrary number of times each day, the incrementer can generate a random number each time

7 arises from the following thought:

Suppose you have a job that runs once per week but can be run from start to finish an arbitrary number of times. You couldn't just use the incrementer from UC2 since it would collide with the previous instance after the first run. However, if you could use BOTH the incrementer from UC2 and the incrementer from UC3, you could accomplish this goal. I suggested up front offering this with deference to the Ordered interface since you could potentially have a use case for using 2 incrementers that affect the same parameter and would need deterministic ordering of the two.

You could relegate some of these to the scheduler, but you're counting on a level of functionality that may or may not be present (cron, of course, does not support any of these scenarios - I maintain Spring Batch should not be prejudicial against those who don't use a commercial scheduler or quartz)

Douglas C. Kaminsky added a comment - 09/Jun/08 01:09 PM
Speaking of property placeholder configurers, do we provide a JobParametersBuilder that creates JobParameters based on properties file? That would be a useful complement to this functionality.

On a side note:

Please understand that I try to think in terms of the end-user (read: end-developer) experience. The basic problem is this: if there is a feature that N out of every 100 end users will want to use, it should be implemented and provided as part of the standard package. How do we determine N? Well, I leave that to you, but as an end-user-sympathizer, I have a pretty low value for N.

My only request is for you guys to be consistent. There is some pretty esoteric (albeit wonderful and useful) stuff already in the framework (see: BATCH-333, et al). This being one of the most requested features and a long-standing point of contention indicates to me that there is a real demand for this type of functionality.

Wayne Lund added a comment - 09/Jun/08 01:29 PM
1. The term override is what some of our batch architects refer to this feature as (e.g. running a batch job on a time different from the originally scheduled date). However, the known original scheduled date was in the enterprise scheduler and it was an ops override to run off schedule. What I can't remember is how the schedule was defined (we were using Tivoli) to not run batch jobs that were dependent on sequential ordering of job executions before previous jobs were caught up.

2. Not sure what's different about logging.

3. I think Dave described how the meta-data from a previous run is already available but requires a little explanation on how best to retrieve it (or a new interface or something that exposes it in a more accessible fashion. I like Lucas' suggestion above. I don't know what is meant by a "seed record".

4. Not sure what to think of the discussion on incrementers. And part of the reason I'm puzzling is that I believe we did think of it more as override than incrementing anything. I'm going to forward to Tsay and get his take on the topic.

[5..6]. It comes to the issue again on who knows about scheduling. I think its not over-engineering in the big picture but it is for Spring Batch where we don't want to get into the business of scheduling. That's enterprise schedulers or Quartz's job. It's been sometime ago since I looked at the possibilities of Quartz integration (and Lucas has looked closer than I have) but I think it already solves UC2 with their framework.

Also, we get questions about "ftp job types" on a fairly regular basis and for right now have handed off some of the ideas to our integration team (internally). We have a few people looking at how we'd do this with Spring-Integration + Spring Batch. I don't want to re-introduce ftp job types like we had on my previous batch project.

7. I disagree with moving into the scheduling space but I could be swayed. Hopefully Tsay will have an opinion on this as well.

Douglas C. Kaminsky added a comment - 09/Jun/08 02:01 PM
It's all a matter of balance - I never bought the argument that feature X shouldn't be part of Spring Batch because X is a scheduler concern, etc. If it's a useful feature and people want it, it shouldn't matter whether or not it makes SB a theoretically impure solution. Frankly, it's unrealistic to think that any framework of any sort can stay entirely independent of the concerns of its surrounding environment and remain relevant.

Spring Batch WILL need hooks into schedulers and other commonly-used frameworks, it WILL need a clean way to manage File Transfer, and it WILL need to handle at least a reasonable subset of scheduling concerns in a clean, easy-to-use way.

I had this argument with everyone awhile back - we were discussing some feature that I argued was too convoluted to configure. Basically, everyone said "well, the new version of Spring IDE will support imports, so that makes this REALLY easy to do," to which I replied "Well, not everyone uses Spring IDE and we have no right to mandate it. Some users will write their jobs in notepad and we have just as much responsibility to them." No one believed me, but several months later I was vindicated on the forums --- sure enough, people were using notepad as their editor. (As an aside: I don't advocate this, I just don't discriminate).

In my opinion, the framework's usefulness should ABSOLUTELY NOT be strongly coupled to the IDE or scheduler that you use in conjunction with it, even if it is a recommended solution. If you want to add some niceness for people using a particular scheduler, that's fine. If you don't want to reinvent the wheel, that's understandable. However, if I as an end user need to hack together a set of custom database tables to achieve the simple piece of functionality that is a fairly common need, then I don't consider "that's a scheduler concern" to be a valid answer.

To be blunt: I argue that there's no validity in saying that you won't implement it purely because it violates the "purity" of your product. Explain to the users why you don't want to maintain it, why you can't come up with an elegant way to do it, or why you've chosen scheduler X as the official scheduler for Spring Batch as opposed to supporting the end user's judgment -- these are valid arguments that I think any user would respect more than "it's a scheduling concern. QED".