Episerver Search & Navigation (A.K.A. Find) is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. On Friday the 7th of August, 2020 we had an event impacting the functionality of the service in the EU Digital Experience Cloud region and following report describes additional details around the event.
Between 2020-08-07 12:28 UTC and 2020-08-07 15:20 UTC FIND Cluster EMEA04 experienced a service degradation.
What triggered the issue was a sudden increase of incoming bulk requests which caused high demand of memory. Graceful node restarts were performed to mitigate the incident as quickly as possible and the service was fully restored the same day at 15:20 UTC.
2020-08-07 12:28 UTC – First alert is triggered by monitoring system.
2020-08-07 12:29 UTC – Alert acknowledged and troubleshooting initiated.
2020-08-07 12:30 UTC – Rolling restarts are performed to mitigate high memory utilization.
2020-08-07 12:43 UTC – STATUSPAGE updated
2020-08-07 12:53 UTC – Yellow state, all primary shards allocated but not all replicas.
2020-08-07 13:32 UTC – Cluster is green and all shards allocated but re-balancing the shards.
2020-08-07 15:20 UTC – Global monitoring system is reporting full functionality restored.
The cause of this incident was a sudden peak in JAVA heap memory consumption. Analysis identified that this was due to a high amount of bulk request towards the cluster which caused over-allocation of heap on the nodes and failed garbage collects. This creates failed queries on the node with high heap usage, and when that node is restarted the ElasticSearch primary shard being indexed to moves to another node and the scenario replays itself until the clusters internal shards have been re-balanced and service is fully operational.
During the event, requests to this FIND cluster would have seen network timeouts (5xx-errors), or slow response times when trying to connect.
Short-term mitigation
Long-term mitigation
We apologize for the impact to affected customers. Availability is of high priority for us and we have a strong commitment to delivering high availability for our Find service. We will continue to prioritize our efforts in proving to overcome these recent difficulties and will do everything we can to learn from the event to avoid a recurrence in the future.