Episerver Search & Navigation (formerly Find) is a cloud-based enterprise search solution that delivers enhanced relevance and powerful search functionality to websites. On Monday November 23, 2020 we experienced an event which impacted the functionality of the service in the East US Digital Experience Cloud region. Details of the incident are described below.
Starting at 07:30 UTC on November 23, 2020, the Engineering team received an alert for the Search & Navigation cluster USEA02. Troubleshooting began immediately, and the investigation discovered that the performance of the cluster was impacted due to a sudden increase of bulk requests. Rolling restarts of overloaded data nodes were immediately performed to recover the service. Once the cause had been identified and the bulk operation stopped, the service was fully restored at 08:18 UTC on November 23, 2020.
November 23, 2020
07:30 UTC – First alert is triggered by global monitoring system.
07:31 UTC – Alert acknowledged and troubleshooting started.
07:40 UTC – Restarted data nodes.
07:43 UTC – Status page updated.
08:07 UTC – Full cluster restart was performed and service started recovering.
08:18 UTC – Critical alert resolved and service fully operational.
The investigation identified that the cluster was overloaded due to a sudden increase of bulk requests in a short period of time. This caused high heap usage on some nodes. When multiple data nodes frequently exceed the heap threshold, the automation process stops for security reasons.
The overloaded data nodes, including the main node, had to be manually restarted, releasing a number of pending tasks stuck in the queue.
During the events, requests to this Search & Navigation cluster may have experienced network timeouts (5xx-errors), or slow response times when trying to connect.
Short-term mitigation
Long-term mitigation
We apologize for the impact to affected customers. We have a strong commitment to deliver high availability for our Search & Navigation service. We will continue to prioritize our efforts in proving to overcome these recent difficulties and will do everything we can to learn from the event to avoid a recurrence in the future.