Date & time of incident:
Tuesday, June 18, 2013 - 10:00
Post date:
Tuesday, June 18, 2013 - 15:05
Incident Description:
We are again experiencing problems with the cluster CVICLR02FC that is hosting virtual machines for BE-CO
Due to this instability the VMs will get rebooted.
We are trying to solve the incident as soon as possible.
Service Element Affected:
Multiple Services
Specific Service detail:
CERN Virtualization Infrastructure
Impact:
Service is degraded
Status:
Resolved
Resolution date:
Mon, Jun 24, 16:00
Posted by:
IT-OIS
Unit responsible for resolution:
IT Department
Updates
The machines are now running
The machines are now running on new hardware and confirmed working by BE/CO.
We continue to see problems
We continue to see problems on the cluster, so we will do a major intervention to evacuate the machines on to a different set of hypervisors.
We will restart the complete
We will restart the complete cluster to reconnect to the updated storage members
The cluster is still having
The cluster is still having problems. We are updating the firmware on the storage and evaluating possible alternatives for the VMs hosted.
Update of the cluster
Update of the cluster finished
We have just update all the nodes in the cluster, and clean up all the duplicated entries in cluster database and vmm database.
It should recover the stability of the machines.