Episerver Find is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. On Wednesday the 6th of November 2019 we experienced an event which impacted the functionality of the service in the EU Digital Experience Cloud region. Details of the incident are described below.
On Wednesday 2019-11-06 11:40 UTC Engineering team received a warning regarding FIND Cluster EMEA01. Immediately upon receiving the alert trouble shooting started and identified high JAVA Heap memory consumption and failing garbage collects. When this happens the cluster master evicts those nodes from the cluster and mark them as failed. The default behavior is then to re-distribute the data over the remaining nodes this caused the memory issue to be moved to another node and the same scenario replayed itself on the other nodes until no nodes where left in the cluster and a full cluster restart is initiated. Service was restored at 13:55 UTC on November 6th, 2019.
November 6th, 2019
11:40 UTC – Warning alert triggered and acknowledged within 1 minute. Investigation initiated immediately on the specific alerting Find Cluster, EMEA01.
11:45 UTC – Series of node restarts due to high memory utilization and garbage collect failures as initial mitigation action.
13:29 UTC – Critical alert triggered.
13:30 UTC – Full cluster restart was initiated.
13:31 UTC – STATUSPAGE updated.
13:55 UTC – Service operational and critical alert resolved.
November 7th, 2019
16:49 UTC - Root cause identified and action plan initiated.
The cause of this incident was due to a synonym handling job suffering from a run-away scenario which resulted in resource starvation.
During the event requests to this FIND cluster would have seen network timeouts (5xx-errors) or slow response times when trying to connect.
Short-term mitigation
Engineering team performed node restarts to recover service as soon as possible.
Long term mitigation
This issue has been registered as a bug.
We sincerely apologize for the impact to affected customers. Customer experience is of high priority for us, and we have a strong commitment to delivering high availability for our services. We will do everything we can to learn from the event to avoid a recurrence in the future.