Episerver Search & Navigation (formerly Find) is a cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. Between 2020-06-26 03:41 UTC and 2020-07-04 22:07 UTC FIND cluster AZNEUPROD01 experienced intermittent incidents culminating on Saturday 2020-07-04. Details of the incident are described below.
Starting at 03:41 UTC on 26th June 2020, the Engineering team received an alert for FIND cluster AZNEUPROD01. Troubleshooting started immediately, and the investigation discovered that the performance on the entire FIND cluster was impacted due to an extreme variation of multi-request queries that increased traffic demand towards the cluster. A final fix was implemented on July 4th to limit and optimize the load from the identified source towards to the cluster. The service was fully restored on the same day at 22:07 UTC.
June 26th, 2020
03:41 UTC – First alert is triggered by global monitoring system.
03:42 UTC – Alert acknowledged and troubleshooting initiated.
03:42 UTC – Rolling restarts are performed to mitigate high memory utilization.
04:33 UTC – STATUSPAGE updated.
04:33 UTC – Service started to recover.
08:19 UTC – Issue started reoccurring.
09:05 UTC – Identified the issue. Started to roll out Mitigation steps.
_11:36 UTC – Critical alert resolved and service fully operational.
_
Intermittent service degradation continued until July 4th at 22:07 UTC when final mitigations efforts were in place to limit the load towards the cluster.
The cause of this incident was a sudden peak in JAVA heap memory consumption. Analysis identified that the cluster was overwhelmed due to a constant amount of large multi-request type (bulk or multi-search) which allows for performing multiple tasks in a single request. This caused over-allocation of heap on the nodes, and failed garbage collects. This in turn creates failed queries on the node with high heap usage. When that node is restarted, the Elasticsearch primary shard being indexed to, moves to another node. The scenario replays itself until the cluster's internal shards have been re-balanced, and the service is fully operational.
During the events, requests to this FIND cluster may have seen network timeouts (5xx-errors), or slow response times when trying to connect.
Short-term mitigation:
Long-term mitigation:
We sincerely apologize for the impact to affected customers. Customer experience is of high priority for us, and we have a strong commitment to deliver high service availability. We will do everything we can to learn from the event to avoid future recurrence.