CERN Accelerating science

Batch nodes failing, kill processes and resend notification emails to users

 
Date & time of incident: 
Friday, April 27, 2012 - 16:21
Incident Description: 

When batch worker nodes run out of memory, a process that  kills the biggest or most rapidly growing process is initiated. This happens in order to protect the worker node.

In such a case, the owner of the process is being notified by e-mail. On some batch nodes the offending process did not exit. This resulted in the user being notified over and over again.

The hosts affected have been identified and paused, the issue is under investigation.

They will be made available again as soon as the root issue is fixed.

 

 

Service Element Affected: 
Batch Service
Impact: 
Service is degraded
Status: 
Resolved
Resolution date: 
Fri, Apr 27, 19:00
Posted by: 
IT-PES
Unit responsible for resolution: 
IT Department