IT Service Status Board

Main menu

IT-DB Virtualisation Infrastructure degraded - same problem as last time (October)

Date & time of incident:

Friday, October 5, 2012 - 10:20

Post date:

Friday, October 5, 2012 - 11:38

Incident Description:

Following the reboot of 20 hypervisors, some 60 VMs hosted at IT-DB rebooted. The machines were correctly restarted/relocated when the hypervisors came back. This has affected multiple AIS and DBoD services, see separate announcements on the GS and IT SSB.

The service suffered from the same problem on October 5.

Experts have worked with Oracle Support in order to identify the root cause of this problems, and have put in place a fix for the problem.

The situation is anyway being closely monitored.

Service Element Affected:

Multiple Services

Impact:

Service is degraded

Status:

Resolved

Resolution date:

Mon, Nov 12, 19:00

Posted by:

IT-DB

Unit responsible for resolution:

IT Department

Updates

Posted November 9, 2012 - 9:52am

Unfortunately, until the root

Unfortunately, until the root cause of the problem is identified, this incident may ("will") re-appear over the next few days: on average we have experienced one/two reboots per week.

Therefore we keep the article in the SSB such that users are aware that the service is degraded and can fail. We will update or close the article as soon as we progress on the investigations.

Thanks for your understanding.

Posted November 7, 2012 - 11:39pm

All the services are back.

All the services are back.
We are still working with Oracle Support in order to identify the root cause of this problems.
The last services are being moved out of the problematic machines.

Posted November 7, 2012 - 10:24pm

Today at 17:00 some

Today at 17:00 some hypervisors rebooted. Two DBOD instances are still down: dod_boinc and dod_lbcertif. The development instance of the Oracle HR application is down too. We are working on it.

Posted November 2, 2012 - 4:55pm

We have just experienced

We have just experienced another spontaneous mass reboot. The list of services affected is:
- Kitry PROD
- ERTT and ERTD databases
- Oracle HR DEV and TEST application servers (databases stayed up)

Posted October 18, 2012 - 11:56pm

All production services have

All production services have been or are being moved to physical hardware, and now the problem seems to have disappeared. We are still working with Oracle support to put in place more diagnostics, then we will put load on the systems since it looks like one of the causes of the reboots.

Posted October 11, 2012 - 2:28pm

We just experienced another

We just experienced another spontaneous reboot of multiple hypervisors. All the VMs are coming back correctly.

Posted October 10, 2012 - 10:02am

No news so far, we are still

No news so far, we are still working on a resolution.
We have experienced just one spontaneous reboot yesterday afternoon.

Posted October 8, 2012 - 5:59pm

Today we experienced a

Today we experienced a problem with many hypervisors rebooting at 11:36, then the situation got stable.
We are still working with Oracle Support and with the Network team in order to understand the root cause of the problem, but so far we cannot draw any conclusion.
We have put in place some extra monitoring, and will be deploying some additional debugging.
We will post a status update tomorrow morning.

Posted October 8, 2012 - 11:46am

Secondary menu

www.cern.ch

Updates

Unfortunately, until the root

All the services are back.

Today at 17:00 some

We have just experienced

All production services have

We just experienced another

No news so far, we are still

Today we experienced a

The situation is still

Marking this issue as

The situation seems to have

The problem happened again at

Update: the same problem