Date & time of incident:
Friday, October 5, 2012 - 10:20
Post date:
Friday, October 5, 2012 - 11:38
Incident Description:
Following the reboot of 20 hypervisors, some 60 VMs hosted at IT-DB rebooted. The machines were correctly restarted/relocated when the hypervisors came back. This has affected multiple AIS and DBoD services, see separate announcements on the GS and IT SSB.
The service suffered from the same problem on October 5.
Experts have worked with Oracle Support in order to identify the root cause of this problems, and have put in place a fix for the problem.
The situation is anyway being closely monitored.
Service Element Affected:
Multiple Services
Impact:
Service is degraded
Status:
Resolved
Resolution date:
Mon, Nov 12, 19:00
Posted by:
IT-DB
Unit responsible for resolution:
IT Department
Updates
Unfortunately, until the root
Unfortunately, until the root cause of the problem is identified, this incident may ("will") re-appear over the next few days: on average we have experienced one/two reboots per week.
Therefore we keep the article in the SSB such that users are aware that the service is degraded and can fail. We will update or close the article as soon as we progress on the investigations.
Thanks for your understanding.
All the services are back.
All the services are back.
We are still working with Oracle Support in order to identify the root cause of this problems.
The last services are being moved out of the problematic machines.
Today at 17:00 some
Today at 17:00 some hypervisors rebooted. Two DBOD instances are still down: dod_boinc and dod_lbcertif. The development instance of the Oracle HR application is down too. We are working on it.
We have just experienced
We have just experienced another spontaneous mass reboot. The list of services affected is:
- Kitry PROD
- ERTT and ERTD databases
- Oracle HR DEV and TEST application servers (databases stayed up)
All production services have
All production services have been or are being moved to physical hardware, and now the problem seems to have disappeared. We are still working with Oracle support to put in place more diagnostics, then we will put load on the systems since it looks like one of the causes of the reboots.
We just experienced another
We just experienced another spontaneous reboot of multiple hypervisors. All the VMs are coming back correctly.
No news so far, we are still
No news so far, we are still working on a resolution.
We have experienced just one spontaneous reboot yesterday afternoon.
Today we experienced a
Today we experienced a problem with many hypervisors rebooting at 11:36, then the situation got stable.
We are still working with Oracle Support and with the Network team in order to understand the root cause of the problem, but so far we cannot draw any conclusion.
We have put in place some extra monitoring, and will be deploying some additional debugging.
We will post a status update tomorrow morning.
The situation is still
The situation is still unstable. We are working on it.
Marking this issue as
Marking this issue as resolved as the situation is stable since October 5th at 18:00. We are keeping the systems monitored.
The situation seems to have
The situation seems to have stabilised now following the shutdown of a problematic hypervisor. All the services are running fine since 18:00. We will perform more health checks over the weekend.
The problem happened again at
The problem happened again at 16:50. All the systems are coming back. We are still investigating the root cause of the problem.
Update: the same problem
Update: the same problem happened again at 14:30. All the systems are back now. We are investigating the root cause of the problem.