There are problems affecting several Drupal ad AIS applications. It is linked with the planned intervention https://test-static-03.web.cern.ch/planned-intervention/firmware-upgrade-some-components-safe-host-controllers/24-04-2012
Upgrade of one controller was transparent, second node was done three weeks ago, and upgrade was also transparent. Same procedure was applied to second cluster, but upgrade on second cluster dbnasg403/dbnasg404 resulted on one controller freezing which has affected services running on it (drupal, nova, aisrac, hammercloud-atlas, AIS). Despite that we tried to minimize the freezing time some database services got affected. The downtime was less than 15 minutes. A postmortem and further analysis with Netapp is on-going (a call to NetApp has been opened).
We apologise for the inconvenience caused by this incident.
Updates
Second instance of AIS RAC
Second instance of AIS RAC prod was evicted by the clusterware, causing PPT services bound to instance 2 and not using service names to fail.
Instance 2 is back as of 17:44.
PPT services required application server restart and are back to normal operation at 18:45
IT-DB
Third instance of production
Third instance of production AIS RAC was overloaded and we have restarted it.
We are investigating for the root cause of the problem.
Services are back to normal (except PPT).
IT-DB