Episerver Search & Navigation (formerly Find) is a cloud-based enterprise search solution that delivers enhanced relevance and powerful search functionality to websites. On Monday the 19th of October and Tuesday the 20th of October 2020, we experienced events which impacted the functionality of the service in the East US Digital Experience Cloud region. Details of the incident are described below.
Starting at 16:57 UTC on 19th October 2020, the Engineering team received an alert for Search & Navigation cluster USEA02. Troubleshooting began immediately, and the investigation discovered that the performance of the cluster was impacted due to a number of malicious requests from a tenant of the shared resources. Once the cause was identified, a scale-out operation was performed to resolve the incident. The service was fully restored at 05:24 UTC on 20th October 2020.
October 19th, 2020
16:57 UTC – First alert is triggered by global monitoring system.
17:00 UTC – Alert acknowledged, and troubleshooting started.
18:10 UTC – Mitigation steps were initiated.
21:30 UTC – Full cluster restart was performed.
22:02 UTC – Second alert triggered and investigation to identify cause continued.
22:53 UTC – Mitigation steps were continuously applied.
23:26 UTC – Full cluster restart was performed.
October 20th, 2020
02:39 UTC – Status page updated.
05:15 UTC – Scale-out performed.
05:24 UTC – Critical alert resolved and service fully operational.
Investigation identified that CPU increased from 10% to 100% usage on all nodes from one minute to another. The cause of high CPU usage was a tenant of the shared resources sending malicious requests to the Elasticsearch queue. The cluster capacity was not enough to handle the unexpected rapid growth, but once the environment was expanded the cluster could finalize all unpredicted tasks.
During the events, requests to this Search & Navigation cluster may have experienced network timeouts (5xx-errors), or slow response times when trying to connect.
We apologize for the impact to affected customers. We have a strong commitment to deliver high availability for our Search & Navigation service. We will continue to prioritize our efforts in proving to overcome these recent difficulties and will do everything we can to learn from the event to avoid a recurrence in the future.