Episerver Search & Navigation (formerly Find) is a cloud-based enterprise search solution that delivers enhanced relevance and powerful search functionality to websites. On Wednesday June 9, 2021 and Thursday June 10, 2021 we experienced events which impacted the functionality of the service in the EU Digital Experience Cloud region. Details of the incident are described below.
Between June 9, 2021 12:34 UTC and June 10 20:36 UTC the Search & Navigation cluster EUFINDPROD25 experienced intermittent service degradation.
The issue was triggered by a peak of requests that affected the cluster in a way that it stopped responding and couldn't recuperate by itself. An Individual node restart was performed but this did not bring back the cluster to the required state so the decision was made to perform a full cluster restart. The procedure to perform a restart of an entire cluster can be time consuming due to the number of nodes. During this incident, it required additional time to stabilize the service as the shard allocation was cumbersome due to the remnants from a failed deletion process. The service was fully operational at June 10, 2021, 7:35 UTC.
June 9, 2021
12:34 UTC – First alert and automation restarts were triggered.
13:57 UTC – Critical alert triggered, acknowledged and investigation initiated.
14:50 UTC – Restarted an unhealthy node and allocated shards.
14:55 UTC – Full cluster restart performed.
16:14 UTC – Cluster started allocating shards.
16:36 UTC – STATUSPAGE updated
20:36 UTC – All healthy shards allocated and service functionality was recovered.
June 10, 2021
04:27 UTC – Root cause identified and long-term mitigation action started.
07:35 UTC – The fix was implemented and the Incident closed.
Retrospective identified that during the full cluster restart, an unhealthy node rejoined the pool, causing the shard allocation process to be delayed. The unhealthy node consisted of remaining shards from an unsuccessful deletion operation for a specific index that was never completed. Once the deletion process had been completed and all shards were removed, service was fully restored.
During the events, a subset of requests to the Search and Navigation cluster may have experienced network timeouts (5xx-errors), or slow response times when trying to connect.
Short-term mitigation
Long-term mitigation
We apologize for the impact to affected customers. We have a strong commitment to deliver high availability for our Search & Navigation service. We will continue to prioritize our efforts in proving to overcome these recent difficulties, and will do everything we can to learn from the event to avoid a recurrence in the future.