FIND Incident - EU Region (EMEA01)
Incident Report for Optimizely Service
Postmortem

SUMMARY

Episerver Find is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. On Tuesday the 29th of October 2019 we experienced an event which impacted the functionality of the service in the EU Digital Experience Cloud region. Details of the incident are described below.

DETAILS

Between 2019-10-29 14:29 UTC and 2019-10-29 15:17 UTC FIND cluster EMEA01 experienced an outage.

What triggered the outage was a sudden peak in JAVA heap memory consumption. This was caused by high demand for memory, garbage collects to free memory failed continuously and nodes in the cluster became unresponsive. When this happens the cluster master evicts those nodes from the cluster and mark them as failed. The default behavior is then to re-distribute the data over the remaining nodes, this caused the memory issue to be moved to another node and the same scenario replayed itself on the other nodes until no nodes where left in the cluster.

Worth mentioning is that there is a technical limit in how much memory that can be configured in JAVA/ElasticSearch therefore these types of incidents can not be mitigated by adding additional memory.

TIMELINE

October 29th, 2019

14:26 PM UTC - Warning alert triggered and acknowledged within 1 minute. Investigation initiated immediately on the specific alerting Find Cluster, EMEA01.

14:30 PM UTC - Series of node restarts due to high high memory utilization and garbage collect failures.

14:45 PM UTC – Critical alert triggered.

14:51 PM UTC – Full cluster restart was initiated.

14:59 PM UTCSTATUSPAGE updated.

15:17 PM UTC – Service operational and critical alert resolved.

16:03 PM UTCINCIDENT Closed.

November 7th, 2019

16:49 UTC - Root cause identified and action plan initiated.

ANALYSIS

The cause of this incident was due to a synonym handling job suffering from a run-away scenario which resulted in resource starvation.

IMPACT

During the event, requests to this FIND cluster would have seen network timeouts (5xx-errors) or slow response times when trying to connect.

CORRECTIVE MEASURES

Short-term mitigation

Engineering team performed node restarts to recover service as soon as possible.

Long term mitigation

This issue has been registered as a bug.

FINAL WORDS

We sincerely apologise for the impact to affected customers. Customer experience is of high priority for us, and we have a strong commitment to delivering high availability for our services. We will do everything we can to learn from the event to avoid a recurrence in the future.

Posted Nov 01, 2019 - 12:41 UTC

Resolved
This incident has been resolved. An RCA will be provided as soon as the investigation into the root cause is completed.
Posted Oct 29, 2019 - 16:03 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 29, 2019 - 15:33 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Oct 29, 2019 - 15:24 UTC
Investigating
We are currently working on an an event that impacted the functionality on the FIND service in the EU Region (EMEA01) and service is now restoring. A subset of clients will be experiencing increased latency or 5xx-errors.
Posted Oct 29, 2019 - 14:59 UTC