Date & time of incident:
Monday, June 10, 2013 - 01:00
Post date:
Monday, June 10, 2013 - 11:08
Incident Description:
A significant fraction of the SLC6 batch capacity has crashed recently with kernel panics. We are investigating the cause and are restarting the machines.
The SLC5 batch capacity is unaffected.
(Jobs on nodes which have crashed will appear in the UNKWN state until the node recovers, at which point the jobs will be marked as failed)
-----
The bulk of SLC6 capacity has now returned. We're monitoring for further crashes.
Service Element Affected:
Batch Service
Impact:
Service is degraded
Status:
Resolved
Resolution date:
Fri, Jun 21, 06:30
Posted by:
IT-PES
Unit responsible for resolution:
IT Department