IT Service Status Board

Main menu

CASTORPUBLIC service unavailability

Date & time of incident:

Monday, July 29, 2013 - 22:00

Post date:

Tuesday, July 30, 2013 - 14:45

Incident Description:

There was a serious DB contention for CASTORPUBLIC, the service was severely degraded. First diagnostics point to a massive load from several public users staging files in a tight loop with rates close to 200Hz/user (some of the files were in the process of being recalled from tape). This induced a lot of pressure on the DB side, making it slower and slower as the requests were piling-up, to the point that this also impacted internal CASTOR processes that got stuck on the DB side. This made the situation worse - resulting in the stager response time being extraordinarily high.

Follow-up: the offending users has been banned and experiment computing responsibles have been informed. The internal CASTOR processes were regenerated and freed from the DB side (mainly tape-related workflows). The situation is coming back to normality but we are closely monitoring the evolution until everything is fully restored and understood.

Service Element Affected:

Storage Service for Projects & Experiments

Specific Service detail:

CASTORPUBLIC, SRM-CASTORPUBLIC

Impact:

Service is degraded

Status:

Resolved

Resolution date:

Tue, Jul 30, 12:00

Posted by:

IT-DSS

Unit responsible for resolution:

IT Department

www.cern.ch

CERN Accelerating science

IT Service Status Board

Main menu

CASTORPUBLIC service unavailability