Episerver Search & Navigation (formerly Find) is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. On Monday September 21 and Sunday September 27 - Tuesday September 29, 2020, we had an intermittent event which impacted the service functionality in the EU Digital Experience Cloud region. The following report describes additional details around the event.
Between 2020-09-21 16:05 UTC and 2020-09-30 23:27 UTC the Search & Navigation cluster EMEA07 experienced intermittent service degradation.
The Engineering team has been working diligently to identify the root cause of the degradation, and apply mitigation steps to reduce the impact. As the Search & Navigation service contains a multitude of components, including but not limited to, API calls towards the cluster, mitigation efforts required a complex and exhaustive investigation, thereby requiring additional time to stabilize and identify the resolution.
Details of the incident are described below.
September 21, 2020
16:05 UTC – First alert triggered, acknowledged and investigation initiated.
17:45 UTC – STATUSPAGE updated.
18:54 UTC – Proxies were reset.
_18:57 UTC – Critical alert resolved and service fully operational.
_
September 22, 2020
_07:00 UTC - Retrospective meeting.
_
September 27, 2020
17:08 UTC – First alert triggered and acknowledged, investigation initiated.
17:36 UTC – Restarted an unhealthy node and allocated shards.
18:12 UTC – Service recovered.
September 28, 2020
00:47 UTC – First alert triggered and investigation initiated.
01:02 UTC – Restarted Elasticsearch service.
_01:20 UTC – Service recovered.
_
_07:00 UTC - Retrospective meeting to analyze and discuss next actions.
_
11:00 UTC – Second alert triggered and acknowledged, investigation initiated.
11:20 UTC - Restarted client nodes.
11:35 UTC - Service started recovering.
12:08 UTC - Third alert triggered and acknowledged.
12:17 UTC - Critical alert resolved and service fully operational.
16:23 UTC - Fourth alert triggered and acknowledged.
16:49 UTC - STATUSPAGE updated.
17:05 UTC – Service recovered.
September 29, 2020
02:51 UTC – First alert triggered and acknowledged, investigation initiated.
03:05 UTC – Client node restarted.
03:15 UTC – Service started recovering.
05:26 UTC – Second alert triggered and troubleshooting commenced.
05:27 UTC – Restart of client nodes due to overload.
06:12 UTC – STATUSPAGE updated.
07:44 UTC – Fix implemented and service recovering.
13:28 UTC - Statuspage incident still open to continue monitor service health closely.
16:19 UTC - Third alert triggered and acknowledged, investigation initiated.
16:29 UTC - Client node restarted and service recovered.
September 30, 2020
07:00 UTC - Retrospective meeting to analyze and discuss next actions.
**
18:51 UTC** – First alert triggered and acknowledged, troubleshooting commenced.
18:57 UTC – Client nodes restarted.
_20:31 UTC – Service started recovering.
_
23:11 UTC – Second alert triggered and client node restarted.
23:19 UTC – Critical alert resolved and service fully operational.
The issues experienced with the EMEA07 FIND cluster were found to be caused by a tenant of the shared resource sending queries with many nested items, which repeatedly caused memory congestion on a few nodes. This created failed queries on the node with high heap usage. When that node is restarted, the Elasticsearch primary shard being indexed to is moved to another node, and the scenario replays itself until the cluster's internal shards have been re-balanced, and the service is fully operational.
During the events, requests to this FIND cluster may have experienced network timeouts (5xx-errors), or slow response times when trying to connect.
Short-term mitigation
Long-term mitigation
We apologize for the impact to affected customers. We have a strong commitment to delivering high availability for our Search & Navigation service. We will continue to prioritize our efforts in proving to overcome these recent difficulties, and will do everything we can to learn from the event to avoid a recurrence in the future.