Date & time of incident:
Saturday, June 22, 2013 - 14:00
Post date:
Monday, June 24, 2013 - 10:59
Incident Description:
A significant fraction of the SLC6 batch capacity has crashed recently with kernel panics. We are investigating the cause and are restarting the machines.
The SLC5 batch capacity is unaffected.
(Jobs on nodes which have crashed will appear in the UNKWN state until the node recovers, at which point the jobs will be marked as failed.)
Service Element Affected:
Batch Service
Impact:
Service is degraded
Status:
Resolved
Resolution date:
Mon, Aug 12, 10:00
Posted by:
IT-PES
Unit responsible for resolution:
IT Department
Updates
The service has been
The service has been noticeably and consistently more reliable for a week now. The protection we put in place effectively helps.
We now have a workaround in
We now have a workaround in place which appears to be effective is preventing the kernel crashes. We'll keep an eye on the service for a few more days before confirming that it is back to normal.
The number of crashes should,
The number of crashes should, and seems to be decreasing. We have improved the worker nodes' protection against crashes. We will check how the situation develops over the next few days to make sure the improvement is sustainable.
Crashes are still occurring.
Crashes are still occurring. We have identified a pattern which may lead us to the root cause of the problem. We are running new tests to further understand it.
Crashes are still occurring.
Crashes are still occurring. We are in the process of performing various changes to the worker nodes which should improve their general stability. We are also collecting debugging information to help us understand the root cause of the problem.