Episerver Find is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. On Thursday the 20th of February 2020 we had an event which impacted the functionality of the service in the EU Digital Experience Cloud region and following report describes additional details around the event.
Between 2020-02-20 09:55 UTC and 2020-02-20 11:47 UTC FIND Cluster EMEA04 experienced an outage.
What triggered the issue was a sudden increase of incoming requests which caused high demand of memory. The master node evicted those nodes from the cluster and mark them as failed. Therefore the nodes were forced to restart and re-connect to the cluster. The master node was also fully restarted to remove "pending tasks" jobs. The service was restored on the same day at 11:47 UTC.
TIMELINE
February 20th, 2020
09:55 UTC – First alert is triggered by monitoring system and troubleshooting initiated.
09:55 UTC – High JVM Heap usage detected and node restarts are performed.
10:07 UTC – STATUSPAGE updated.
10:18 UTC – Few nodes has been evicted from cluster.
10:22 UTC – Restarted nodes and re-connected to cluster.
11:11 UTC – Master node is restarted.
11:13 UTC – Cluster is recovering
11:47 UTC – Service operational and critical alert resolved.
The cause of this incident was a sudden peak in JAVA heap memory consumption. Analysis identified that two things took place at the same time which caused the spike on load, a shared resource tenant was sending a set of large files to the cluster and together with extensive calls from Episerver FIND index management portal. This caused over-allocation of heap on the nodes and failed garbage collects. This creates failed queries on the node with high heap usage, and when that node is restarted the ElasticSearch primary shard being indexed to moves to another node and the scenario replays itself until the clusters internal shards have been re-balanced and service is fully operational.
During the events, requests to this FIND cluster would have seen network timeouts (5xx-errors), or slow response times when trying to connect.
Short-term mitigation
Engineering team performed node restarts to recover service as soon as possible.
Long term mitigation
We are working on an improvement to prevent implicit calls to Elasticsearch when they are not necessary to reduce the load on the backend.
Engineers are also working and testing improvements on memory management on the individual data nodes.
We apologize for the impact to affected customers. Availability is of high priority for us and we have a strong commitment to delivering high availability for our Find service. We will continue to prioritize our efforts in proving to overcome these recent difficulties and will do everything we can to learn from the event to avoid a recurrence in the future.