CERN Accelerating science

Virtual Machines affected by hypervisor storage incident

 
Date & time of incident: 
Tuesday, July 2, 2013 - 15:00
Incident Description: 

A storage cell for virtual machines stopped responding for a few minutes around 15:00, causing a number of virtual machines to experience IO problems.

Update Thursday July 4 10:30: the last non-production VMs are back online.

Update Wednesday July 3 11:30: the problem was caused by a misconfigured network switch installed as a replacement of a network switch whose power supply had failed, eventually causing the storage network for a cluster of hypervisors to collapse. Once the network was restored, we experienced unprecedented trouble reconnecting hypervisors to SAN volumes. At 11:30, almost all VMs are back online. Only one SAN volume is still offline, affecting 4 non-production VMs. This volume is undergoing a repair operation and is expected to be restored (along with the 4  remaining VMs) on Thursday morning.

Update 20:00: most VMs are back online but the following VMs still have no storage access and are still offline:

bdii215
ignavm01
janvm01
janvm02
janvm03
lbdai01
lbdai02
lc-w7
lxargus01
lxvcs02
papra42a
papra42b
pcudsdev1
testalexslc6
voatlas175
voatlas222
wms300

Update 17:00: a faulty network switch connecting hypervisors and storage subsystem has been replaced. The virtual machines are progressively coming back online.

Update 16:30: the storage subsystem is again experiencing problems. Investigations are ongoing. See the list of potentially affected VMs below

Update 16:00: the situation is back to normal. 14 Virtual Machines crashed due to the incident with the storage cell and have been rebooted:

castorsrv103
ignavm01
janvm04
lbdai02
lxlic05
pcuds99
voatlas208
voatlas211
voatlas220
voatlas222
vocms149
vocms155
vocms157
wms300

The following 93 VMs also use this storage and may have been affected by performance problems during the incident:

aicache01
batchmon03
bdii214
bdii215
burotel
castorsrv101
castorsrv102
castorsrv103
dashboard19
dashboard20
dashboard21
dashboard22
dashboard23
dashboard32
dashboard36
dashboard37
dashboard38
dashboard39
dashboard40
dashboard41
dashboard42
dashboard43
dashboard44
fts303
ignavm01
janvm01
janvm02
janvm03
janvm04
jnyczakvm
lbdai01
lbdai02
lcgapp09
lc-w7
lfcatlas03
lfclhcbro03
lxargus01
lxcvmfs02
lxfont02
lxfont03
lxlic01
lxlic05
lxvcs01
lxvcs02
md-office
papra42a
papra42b
pcuds100
pcuds101
pcuds102
pcuds99
pcudsdev1
procdev02
procdev03
procdev04
px306
saotools
smtarch01
testalexslc6
vmitdi01
voatlas174
voatlas175
voatlas176
voatlas205
voatlas206
voatlas207
voatlas208
voatlas211
voatlas212
voatlas213
voatlas214
voatlas215
voatlas220
voatlas221
voatlas222
voatlas223
vocms143
vocms144
vocms145
vocms146
vocms149
vocms155
vocms157
vocms158
vocms17
volhcb31
volhcb32
volhcb33
voms305
voms307
voms309
voms310
wms300

Service Element Affected: 
Multiple Services
Impact: 
Some applications linked to services are unavailable
Status: 
Resolved
Resolution date: 
Thu, Jul 4, 10:30
Expected resolution or Next Update Time: 
Tuesday, July 2, 2013 - 16:30
Posted by: 
IT-PES
Unit responsible for resolution: 
IT Department