Job in Error State moved before qdel was called

A failed job was moved back from the cluster before qdel was called or a resubmit was attempted.


Id: 05a8f2bb1c825244dcb42cd02ab8221a48938530
Type: bugfix
Creation time: 2010-11-16 14:17
Creator: Mathew Topper <mathew.topper@...>
Release: 0.2 (released 2011-02-09)
Status: closed : fixed fixed
In progress: 2 weeks

Issue log

2011-02-09 11:44 Mathew Topper <mathew.topper@...> closed with disposition fixed
Have completed a sucessful test so I assume it is working ok with the new system now.
2011-02-08 16:17 Mathew Topper <mathew.topper@...> commented
Had an issue with hitting the recursion limit when waiting for the job to start so have turned check_status into an infinite while loop with exits. There is still some recursion with jobs restarting but not infinite any more. Unfortunately the grid engine is balls, so I can't test how the code is handling completed jobs.
2011-02-07 15:40 Mathew Topper <mathew.topper@...> commented
Have changed the control system now to look for actual output from qsub anf qacct rather than just a non responce from qsub. As long as the job was not deleted then qacct should record statistics for jobs that have left the queue. If qstat is not giving an answer and qacct is also not giving an answer then something has gone wrong and the job should be resubmitted. I've put some controls of the number of times a job should resubmit in this case, so it doesn't just keep flogging a dead horse.
2011-01-31 16:19 Mathew Topper <mathew.topper@...> changed status from paused to in_progress
OK, so the issue here is uncontrolled restarting of the queue which will cause all sorts of problems (i.e. completely lose the job). I think the best way around this is to keep trying to start a job until files are generated. I don't know how this effects the restarting process.
2010-11-24 09:38 Mathew Topper <mathew.topper@...> changed status from in_progress to paused
Just waiting for the issue to reoccur and hope that the logging picks it up!
2010-11-24 09:37 Mathew Topper <mathew.topper@...> changed status from closed to in_progress
Problem reoccurred of a job ending up in the error state but not being deleted and possibly resubmitted before being returned. Bringing this one back to life.
2010-11-23 12:08 Mathew Topper <mathew.topper@...> closed with disposition fixed
In the last run a job went into error state and was resubmitted successfully so I'm going to close this one.
2010-11-23 10:43 Mathew Topper <mathew.topper@...> changed status from in_progress to paused
I think the adjustments to have sorted it out, but as I can't force a job into the error state, I am going to stop this issue until another example can be found. One concern is that it may not report what has happened unless a better logging system is implemented.
2010-11-16 15:38 Mathew Topper <mathew.topper@...> commented
Ok, so the problem is in SSHSGE.wait. There was the interval in between the check for the Eqw state and the /dev/null test. I'm not sure what that test looks like for an Eqw type job. I can't seem to make a job to go into this state, which is annoying so I've issued a command to check the reason for the issue if it does happen.
2010-11-16 14:19 Mathew Topper <mathew.topper@...> commented
I was wondering how to test for this, but I guess a script with an error state would be one way, although I don't know if it would create the behaviour seen by the job failing.
2010-11-16 14:18 Mathew Topper <mathew.topper@...> changed status from unstarted to in_progress
2010-11-16 14:17 Mathew Topper <mathew.topper@...> created