FIND Incident - EU Region (EMEA01)
Incident Report for Optimizely Service
Postmortem

SUMMARY

Episerver Find is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. On Wednesday the 6th of November 2019 we experienced an event which impacted the functionality of the service in the EU Digital Experience Cloud region. Details of the incident are described below.

DETAILS

On Wednesday 2019-11-06 11:40 UTC Engineering team received a warning regarding FIND Cluster EMEA01. Immediately upon receiving the alert trouble shooting started and identified high JAVA Heap memory consumption and failing garbage collects. When this happens the cluster master evicts those nodes from the cluster and mark them as failed. The default behavior is then to re-distribute the data over the remaining nodes this caused the memory issue to be moved to another node and the same scenario replayed itself on the other nodes until no nodes where left in the cluster and a full cluster restart is initiated. Service was restored at 13:55 UTC on November 6th, 2019.

TIMELINE

November 6th, 2019

11:40 UTC – Warning alert triggered and acknowledged within 1 minute. Investigation initiated immediately on the specific alerting Find Cluster, EMEA01.

11:45 UTC – Series of node restarts due to high memory utilization and garbage collect failures as initial mitigation action.

13:29 UTC – Critical alert triggered.

13:30 UTC – Full cluster restart was initiated.

13:31 UTCSTATUSPAGE updated.

13:55 UTC – Service operational and critical alert resolved.

November 7th, 2019

16:49 UTC - Root cause identified and action plan initiated.

ANALYSIS

The cause of this incident was due to a synonym handling job suffering from a run-away scenario which resulted in resource starvation.

IMPACT

During the event requests to this FIND cluster would have seen network timeouts (5xx-errors) or slow response times when trying to connect.

CORRECTIVE MEASURES

Short-term mitigation

Engineering team performed node restarts to recover service as soon as possible.

Long term mitigation

This issue has been registered as a bug.

FINAL WORDS

We sincerely apologize for the impact to affected customers. Customer experience is of high priority for us, and we have a strong commitment to delivering high availability for our services. We will do everything we can to learn from the event to avoid a recurrence in the future.

Posted Nov 26, 2019 - 14:12 UTC

Resolved
This incident has been resolved. An incident report will be provided as soon as the investigation into the root cause is completed.
Posted Nov 06, 2019 - 15:09 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 06, 2019 - 13:58 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Nov 06, 2019 - 13:37 UTC
Investigating
We are currently investigating an event that is impacting the functionality on the FIND service in the EU region (EMEA01). A subset of clients will be experiencing high latency or 5xx-errors.
Posted Nov 06, 2019 - 13:31 UTC