Episerver Search & Navigation (A.K.A. Find) is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. On Friday the 5th of June 2020 we experienced an event which impacted the functionality of the service in the EU Digital Experience Cloud region. Details of the incident are described below.
Starting at 08:15 UTC on 5th June 2020, the Engineering team received an alert for FIND cluster AZNEUPROD01. Troubleshooting began immediately and the investigation discovered the performance of the Find cluster was impacted due to a significant increase of delete and index requests. Once the source of the surge was identified, a resolution was implemented which mitigated the impact on the cluster. The service was restored at 18:00 UTC on 5th June 2020.
June 5th, 2020
08:15 UTC – First alert is triggered by global monitoring system.
08:22 UTC – Alert acknowledged and troubleshooting started.
11:54 UTC – Root cause identified and mitigation actions were initiated.
12:01 UTC – STATUSPAGE updated
13:32 UTC – Fix was implemented and service started recovering.
17:11 UTC – The issue reoccurred.
17:24 UTC – Enabled throttling.
18:00 UTC – Global monitoring system is reporting full functionality restored.
The cause of this issue was due to a sudden and significant increase of incoming delete and index request caused by a tenant of the shared resources. This resulted in high JAVA Heap memory consumption and failing garbage collects. When this happens the cluster master evicts those nodes from the cluster and mark them as failed. The default behavior is then to re-distribute the data over the remaining nodes and this caused the memory issue to be moved to another node and the same scenario replayed itself on the other nodes.
During the event, requests to this FIND cluster would have seen network timeouts (5xx-errors), or slow response times when trying to connect.
Short-term mitigation
Long-term mitigation
We apologize for the impact to affected customers. We have a strong commitment to delivering high availability for our Find service. We will continue to prioritize our efforts in proving to overcome these recent difficulties and will do everything we can to learn from the event to avoid a recurrence in the future.