Job in Error State moved before qdel was called
A failed job was moved back from the cluster before qdel was called or a resubmit was attempted.
Details
Id: | 05a8f2bb1c825244dcb42cd02ab8221a48938530 |
Type: | bugfix |
Creation time: | 2010-11-16 14:17 |
Creator: | Mathew Topper <mathew.topper@...> |
Release: | 0.2 (released 2011-02-09) |
Component: | sshsge.py |
Status: |
closed
: fixed
![]() |
In progress: | 2 weeks |
Issue log
2011-02-09 11:44 | Mathew Topper <mathew.topper@...> | |
Have completed a sucessful test so I assume it is working ok with the new system now. | ||
2011-02-08 16:17 | Mathew Topper <mathew.topper@...> | |
Had an issue with hitting the recursion limit when waiting for the job to start so have turned check_status into an infinite while loop with exits. There is still some recursion with jobs restarting but not infinite any more. Unfortunately the grid engine is balls, so I can't test how the code is handling completed jobs. | ||
2011-02-07 15:40 | Mathew Topper <mathew.topper@...> | |
Have changed the control system now to look for actual output from qsub anf qacct rather than just a non responce from qsub. As long as the job was not deleted then qacct should record statistics for jobs that have left the queue. If qstat is not giving an answer and qacct is also not giving an answer then something has gone wrong and the job should be resubmitted. I've put some controls of the number of times a job should resubmit in this case, so it doesn't just keep flogging a dead horse. | ||
2011-01-31 16:19 | Mathew Topper <mathew.topper@...> | |
OK, so the issue here is uncontrolled restarting of the queue which will cause all sorts of problems (i.e. completely lose the job). I think the best way around this is to keep trying to start a job until files are generated. I don't know how this effects the restarting process. | ||
2010-11-24 09:38 | Mathew Topper <mathew.topper@...> | |
Just waiting for the issue to reoccur and hope that the logging picks it up! | ||
2010-11-24 09:37 | Mathew Topper <mathew.topper@...> | |
Problem reoccurred of a job ending up in the error state but not being deleted and possibly resubmitted before being returned. Bringing this one back to life. | ||
2010-11-23 12:08 | Mathew Topper <mathew.topper@...> | |
In the last run a job went into error state and was resubmitted successfully so I'm going to close this one. | ||
2010-11-23 10:43 | Mathew Topper <mathew.topper@...> | |
I think the adjustments to sshsge.py have sorted it out, but as I can't force a job into the error state, I am going to stop this issue until another example can be found. One concern is that it may not report what has happened unless a better logging system is implemented. | ||
2010-11-16 15:38 | Mathew Topper <mathew.topper@...> | |
Ok, so the problem is in SSHSGE.wait. There was the interval in between the check for the Eqw state and the /dev/null test. I'm not sure what that test looks like for an Eqw type job. I can't seem to make a job to go into this state, which is annoying so I've issued a command to check the reason for the issue if it does happen. | ||
2010-11-16 14:19 | Mathew Topper <mathew.topper@...> | |
I was wondering how to test for this, but I guess a script with an error state would be one way, although I don't know if it would create the behaviour seen by the job failing. | ||
2010-11-16 14:18 | Mathew Topper <mathew.topper@...> | |
2010-11-16 14:17 | Mathew Topper <mathew.topper@...> | |