CERN Accelerating science

Backup Service degraded - problem with IBMLIB2

 
Date & time of incident: 
Friday, July 19, 2013 - 14:53
Incident Description: 

Dear Backup Users,

we are currently experiencing problems with IBMLIB2. Both accessors are disfunctional and tapes cannot be accessed. An IBM engineer has been called and is on site. We will provide an update in 1-2 hours time.

best regards,
German on behalf of TSM Support

19/07/2013 - 22.30 UPDATE:

The library has been reconfigured and works fine now, so you can proceed with your restores. As a precaution for the weekend, we decided to divert the backups again to the second library, but if everything goes fine during the weekend, on monday we will reset the original library as backup destination. We sincerely apologize for the prolonged inconveniences and we thank you once again for your patience.

 

20/07/2013 - 23.41 UPDATE:

The library got again many hardware failures and it is currently unavailable. On monday the IBM expert will come to try to fix the issue. The next update will be on Monday morning.

 

22/07/2013 - 11.07 UPDATE:

The IBM engineer did not come yet. We are expecting him to come asap. More updates this afternoon.

 

22/07/2013 - 17.05 UPDATE:

The IBM engineer fixed some of the issues. We are going to reinitialize the library later on this evening and tomorrow morning. More updates tomorrow morning.

 

23/07/2013 - 10.28 UPDATE:

Problems still persist on the library. Bad alignment or bad calibration prevent the accessors from working correctly. IBM has been contacted again to send and engineer asap.

 

23/07/2013 - 16.04 UPDATE:

The IBM engineer is working on the library. When he is done, we will perform additional application (TSM) tests to verify that the library is fine.

 

24/07/2013 - 15.20 UPDATE:

Several parts of the library still need to replaced however the library is working with a slightly lower failure rate. Users may now try to perform their restores. Some jobs may still fail but retrying will probably lead to successful restores eventually. The work to be done on the library is still a lot and currently the estimated date of full recovery is this friday (26/07/2013).

 

25/07/2013 - 14.00 UPDATE:

The library is working currently: we have diverted some traffic onto it to test it properly. We will watch it this afternoon and overnight before putting it completely back into production (hopefully tomorrow).

 

26/07/2013 - 09.35 UPDATE:

The library is broken again. The IBM engineer is back on site.

 

26/07/2013 - 15.08 UPDATE:

The library temporarily works but will need further checks (and possibly interventions) before being put back into production. Feel free to retry restores.

 

30/07/2013 - 10.00 UPDATE:

The library is under deeper investigations by IBM engineer, it is unavailable for the day. When he is done, we will perform additional application (TSM) tests to verify that the library is fine.

 

05/08/2013 - 14.00 UPDATE:

There are still some mechanical issues with the libary, it is under investigation and some parts are going to be replaced this afternoon.

 

19/08/2013 - 16:00 UPDATE (RESOLUTION):

The library has been tested successfully for a period of 10 days. It has now been working fine in production during these days and therefore we consider this case closed.

Service Element Affected: 
Multiple Services
AFS Service
Any other affected service(s): 
All services using the TSM backup functionality
Impact: 
Service is degraded
Status: 
Resolved
Resolution date: 
Mon, Aug 19, 16:06
Posted by: 
IT-DSS
Unit responsible for resolution: 
IT Department