FIND Incident - EU Region (EMEA04)
Incident Report for Optimizely Service
Postmortem

SUMMARY

Episerver Find is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. On Thursday the 20th of February 2020 we had an event which impacted the functionality of the service in the EU Digital Experience Cloud region and following report describes additional details around the event.

DETAILS

Between 2020-02-20 09:55 UTC and 2020-02-20 11:47 UTC FIND Cluster EMEA04 experienced an outage.

What triggered the issue was a sudden increase of incoming requests which caused high demand of memory. The master node evicted those nodes from the cluster and mark them as failed. Therefore the nodes were forced to restart and re-connect to the cluster. The master node was also fully restarted to remove "pending tasks" jobs. The service was restored on the same day at 11:47 UTC.

TIMELINE

February 20th, 2020

09:55 UTC – First alert is triggered by monitoring system and troubleshooting initiated.

09:55 UTC – High JVM Heap usage detected and node restarts are performed.

10:07 UTCSTATUSPAGE updated.

10:18 UTC – Few nodes has been evicted from cluster.

10:22 UTC – Restarted nodes and re-connected to cluster.

11:11 UTC – Master node is restarted.

11:13 UTC – Cluster is recovering

11:47 UTC – Service operational and critical alert resolved.

ANALYSIS

The cause of this incident was a sudden peak in JAVA heap memory consumption. Analysis identified that two things took place at the same time which caused the spike on load, a shared resource tenant was sending a set of large files to the cluster and together with extensive calls from Episerver FIND index management portal. This caused over-allocation of heap on the nodes and failed garbage collects. This creates failed queries on the node with high heap usage, and when that node is restarted the ElasticSearch primary shard being indexed to moves to another node and the scenario replays itself until the clusters internal shards have been re-balanced and service is fully operational.

IMPACT

During the events, requests to this FIND cluster would have seen network timeouts (5xx-errors), or slow response times when trying to connect.

CORRECTIVE MEASURES

Short-term mitigation

Engineering team performed node restarts to recover service as soon as possible.

Long term mitigation

We are working on an improvement to prevent implicit calls to Elasticsearch when they are not necessary to reduce the load on the backend.

Engineers are also working and testing improvements on memory management on the individual data nodes.

FINAL WORDS

We apologize for the impact to affected customers. Availability is of high priority for us and we have a strong commitment to delivering high availability for our Find service. We will continue to prioritize our efforts in proving to overcome these recent difficulties and will do everything we can to learn from the event to avoid a recurrence in the future.

Posted Feb 26, 2020 - 15:37 UTC

Resolved
This incident has been resolved.
Posted Feb 20, 2020 - 12:43 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 20, 2020 - 11:56 UTC
Identified
Engineer identified what may cause the issue and are continue working to determine mitigation options.
Posted Feb 20, 2020 - 11:11 UTC
Investigating
We are currently investigating an event that is impacting the functionality on the FIND service in the EU region.
A subset of clients will be experiencing high latency or 5xx-errors.
Posted Feb 20, 2020 - 10:07 UTC