CERN Accelerating science

Puppet runs failing

 
Date & time of incident: 
Friday, June 14, 2013 - 09:00
Incident Description: 

The Configuration Management infrastructure, in particular the Puppet masters,  failed to cope with a mass installation on Friday 14th of June. This resulted in the main Puppet Masters being swamped. In order to address this issue, a lot of extra capacity (going from 4 nodes to 14) is being added to the Puppet masters as well as splitting that capacity between batch and interactive. In the meantime, the mass installations have halted.

UPDATE 18th June 2013: We are currently experiencing problems installing some of the new nodes that will be used as Puppet Masters. We expect these problems to be solved before the end of the day.

UPDATE 20th June 2013: The new puppet master are now installed, yet experts are still working in this problem since Puppet 3 compatibility issues have been found.

UPDATE 21st June 2013: Reverted back to Puppet 2 as there are some Puppet 3 compatibility issues which will required further investigation. The service is now up.

Service Element Affected: 
IT Operations Support Service
Impact: 
Service is degraded
Status: 
Resolved
Resolution date: 
Fri, Jun 21, 17:00
Posted by: 
IT-PES
Unit responsible for resolution: 
IT Department