Episerver Search & Navigation (formerly Find) is a cloud-based enterprise search solution that delivers enhanced relevance and powerful search functionality to websites. On Monday February 22, 2021 we experienced an event which impacted the functionality of the service in the US Digital Experience Cloud region. Details of the incident are described below.
Between February 22, 15:06 PM UTC and February 22, 16:44 PM UTC the Search & Navigation cluster USPROD08 experienced an outage.
What triggered the issue was a significant increase of expensive queries causing spikes in JAVA heap memory consumption. Graceful restarts were performed to clear memory and service was fully operational at 16:44 PM UTC.
February 22, 2021
15:06 UTC – First alert and automation restarts were triggered.
15:17 UTC – Second alert acknowledged and troubleshooting initiated.
15:58 UTC – Elasticsearch and runner nodes were manually restarted.
16:15 UTC – Status page updated.
16:26 UTC – Service started recovering.
16:44 UTC – Critical alert resolved and service fully operational.
The cause of this incident was a peak in JAVA heap memory consumption. Analysis identified that the cluster was overwhelmed due to a sudden increase of expensive requests. This caused over-allocation of heap on the nodes, and failed garbage collects. This in turn created failed queries on the node with high heap usage. When that node was restarted, the Elasticsearch primary shard being indexed to, moved to another node. The scenario replays itself until the cluster's internal shards have been re-balanced, and the service is fully operational.
During the events, requests to this Search & Navigation cluster may have experienced network timeouts (5xx-errors), or slow response times when trying to connect.
Short-term mitigation
Long-term mitigation
We apologize for the impact to affected customers. We have a strong commitment to deliver high availability for our Search & Navigation service. We will continue to prioritize our efforts in proving to overcome these recent difficulties, and will do everything we can to learn from the event to avoid a recurrence in the future.