In the night of Monday 11 December to Tuesday 12 December our storage hosting provider (Softlayer) performed what should have been a non-interruptive maintenance on the storage solution used by our Europe Region. The storage solution is used to store and serve, among others, the shipping labels.
During the maintenance window there where some small disruptions of the service, these where not investigated further since the service returned to stable after the maintenance window. However, the next day new issues with the storage solution started to appear. A small percentage of the writing attempts to this storage solution failed, as a result a number of shipping labels could not be created.
11-12-2017 - 19:00 UTC Softlayer starts maintenance on the Storage solution, within minutes storage errors are detected.
11-12-2017 – 22:00 UTC Softlayer maintenance window has ended and there are no further errors detected. No further investigation is deemed necessary.
12-12-2017 – 08:35 UTC The first label creation errors start to appear, we start to monitor and investigate the issue. Due to the large time intervals between errors it takes some time to pin-point the root-cause of the issues.
12-12-2017 – 11:00 UTC The cause and scoop have been determent and we are investigating it further with the storage provider.
12-12-2017 – 12:30 UTC Official acknowledgement from storage provider. Due to the increase of filers the portal status is updated to ‘ Degraded performance’. Still only a small percentage of all RMA creations and processing actions are affected.
12-12-2017 – 13:50 UTC Hosting provider implements a fix and reports the situation stabilized. After testing it our self, and monitoring 20 minutes of creations and processing actions on the system we detect no more issues and update the portal status to “Operational”.
12-12-2017 – 14:30 UTC Issues have appeared again, and we escalate the issue at the storage provider. The portal status is updated to ‘ Degraded performance’
12-12-2017 – 17:50 UTC Softlayer plans an extra maintenance window starting at 19:00 to implement a fix.
12-12-2017 – 19:00 UTC Softlayer starts to implement the fix.
12-12-2017 – 21:19 UTC Softlayer reports they have completed the fix implementation, the issues stop appearing. We decide to keep the official status on ‘Degraded performance’ and we keep monitoring the systems.
13-12-2017 – 07:00 UTC no errors have been detected since the fix implementation and all services are considered stable again. We update the portal status to “Operational”.