Issues with Episerver Find
Incident Report for Optimizely Service
Postmortem

Summary

Through Tuesday 2017-03-14 and Wednesday 2017-03-15 we had an event impacting Episerver FIND. The following report describes additional details around that event.

Details

Episerver FIND leverages a continuous deployment model where updates are rolled out weekly without interuption to services, and during Monday 2017-03-13 a new version of the Episerver FIND backend was rolled out to production clusters. The update was rolled out to all clusters without any issues at that time.

On Tuesday 2017-03-14 the tech team recieved alerts of request latency being higher than normal and an investigation was started to find the root cause. And as investigations proceeded request latency increased further and the theory was that the likely cause of the issue was Mondays rollout of a new backend version.

The decision was made to roll back all clusters to the older version and the roll back operation was completed on Wednesday afternoon, and request latency returned to normal.

Root Cause

The root cause of this issue was a performance related bug in the new version of the backend to FIND. This bug caused queues to build up and response times to increase over time.

Impact on other services

During the time of the issue customers using the service would have experienced slow responses or "HTTP 503 Service Unavailable" responses to their requests.

Corrective and Preventative Measures

To improve our processes and to mitigate this happening again we have changed our deployment process to have longer staging windows during rollouts. This will minimize the impact of performance related bugs to one or fewer clusters before a roll back.

Final Words

We apologize for the impact to affected customers. While we are proud of the availability we have on the service we know how critical this service is to customers. For us, availability is the most important feature and we will do everything we can to learn from the event and to avoid a recurrence in the future.

Posted May 03, 2017 - 07:51 UTC

Resolved
This incident has been resolved
Posted Mar 15, 2017 - 14:58 UTC
Monitoring
A fix has been implemented and we are monitoring the results
Posted Mar 15, 2017 - 08:55 UTC
Investigating
We are currently investigating this issue
Posted Mar 14, 2017 - 14:50 UTC