Episerver Find is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. On Tuesday the 29th of October 2019 we experienced an event which impacted the functionality of the service in the EU Digital Experience Cloud region. Details of the incident are described below.
Between 2019-10-29 14:29 UTC and 2019-10-29 15:17 UTC FIND cluster EMEA01 experienced an outage.
What triggered the outage was a sudden peak in JAVA heap memory consumption. This was caused by high demand for memory, garbage collects to free memory failed continuously and nodes in the cluster became unresponsive. When this happens the cluster master evicts those nodes from the cluster and mark them as failed. The default behavior is then to re-distribute the data over the remaining nodes, this caused the memory issue to be moved to another node and the same scenario replayed itself on the other nodes until no nodes where left in the cluster.
Worth mentioning is that there is a technical limit in how much memory that can be configured in JAVA/ElasticSearch therefore these types of incidents can not be mitigated by adding additional memory.
October 29th, 2019
14:26 PM UTC - Warning alert triggered and acknowledged within 1 minute. Investigation initiated immediately on the specific alerting Find Cluster, EMEA01.
14:30 PM UTC - Series of node restarts due to high high memory utilization and garbage collect failures.
14:45 PM UTC – Critical alert triggered.
14:51 PM UTC – Full cluster restart was initiated.
14:59 PM UTC– STATUSPAGE updated.
15:17 PM UTC – Service operational and critical alert resolved.
16:03 PM UTC– INCIDENT Closed.
November 7th, 2019
16:49 UTC - Root cause identified and action plan initiated.
The cause of this incident was due to a synonym handling job suffering from a run-away scenario which resulted in resource starvation.
During the event, requests to this FIND cluster would have seen network timeouts (5xx-errors) or slow response times when trying to connect.
Short-term mitigation
Engineering team performed node restarts to recover service as soon as possible.
Long term mitigation
This issue has been registered as a bug.
We sincerely apologise for the impact to affected customers. Customer experience is of high priority for us, and we have a strong commitment to delivering high availability for our services. We will do everything we can to learn from the event to avoid a recurrence in the future.