Summary

Through Tuesday 2017-03-14 and Wednesday 2017-03-15 we had an event impacting Episerver FIND. The following report describes additional details around that event.

Details

Episerver FIND leverages a continuous deployment model where updates are rolled out weekly without interuption to services, and during Monday 2017-03-13 a new version of the Episerver FIND backend was rolled out to production clusters. The update was rolled out to all clusters without any issues at that time.

On Tuesday 2017-03-14 the tech team recieved alerts of request latency being higher than normal and an investigation was started to find the root cause. And as investigations proceeded request latency increased further and the theory was that the likely cause of the issue was Mondays rollout of a new backend version.

The decision was made to roll back all clusters to the older version and the roll back operation was completed on Wednesday afternoon, and request latency returned to normal.

Root Cause

The root cause of this issue was a performance related bug in the new version of the backend to FIND. This bug caused queues to build up and response times to increase over time.

Impact on other services

During the time of the issue customers using the service would have experienced slow responses or "HTTP 503 Service Unavailable" responses to their requests.

Corrective and Preventative Measures

To improve our processes and to mitigate this happening again we have changed our deployment process to have longer staging windows during rollouts. This will minimize the impact of performance related bugs to one or fewer clusters before a roll back.

Final Words

We apologize for the impact to affected customers. While we are proud of the availability we have on the service we know how critical this service is to customers. For us, availability is the most important feature and we will do everything we can to learn from the event and to avoid a recurrence in the future.

Posted May 03, 2017 - 07:51 UTC

Resolved

This incident has been resolved

Posted Mar 15, 2017 - 14:58 UTC

Monitoring

A fix has been implemented and we are monitoring the results

Posted Mar 15, 2017 - 08:55 UTC

Investigating

We are currently investigating this issue

Posted Mar 14, 2017 - 14:50 UTC