FIND Incident - EU Region (EMEA04)
Incident Report for Optimizely Service
Postmortem

SUMMARY

Episerver Search & Navigation (A.K.A. Find) is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. On Friday the 7th of August, 2020 we had an event impacting the functionality of the service in the EU Digital Experience Cloud region and following report describes additional details around the event.  

DETAILS

Between 2020-08-07 12:28 UTC and 2020-08-07 15:20 UTC FIND Cluster EMEA04 experienced a service degradation. 

What triggered the issue was a sudden increase of incoming bulk requests which caused high demand of memory. Graceful node restarts were performed to mitigate the incident as quickly as possible and the service was fully restored the same day at 15:20 UTC.

TIMELINE

2020-08-07 12:28 UTC – First alert is triggered by monitoring system. 

2020-08-07 12:29 UTC – Alert acknowledged and troubleshooting initiated.

2020-08-07 12:30 UTC – Rolling restarts are performed to mitigate high memory utilization. 

2020-08-07 12:43 UTC – STATUSPAGE updated 

2020-08-07 12:53 UTC – Yellow state, all primary shards allocated but not all replicas.

2020-08-07 13:32 UTC – Cluster is green and all shards allocated but re-balancing the shards.

2020-08-07 15:20 UTC – Global monitoring system is reporting full functionality restored. 

ANALYSIS

The cause of this incident was a sudden peak in JAVA heap memory consumption. Analysis identified that this was due to a high amount of bulk request towards the cluster which caused over-allocation of heap on the nodes and failed garbage collects. This creates failed queries on the node with high heap usage, and when that node is restarted the ElasticSearch primary shard being indexed to moves to another node and the scenario replays itself until the clusters internal shards have been re-balanced and service is fully operational.

IMPACT

During the event, requests to this FIND cluster would have seen network timeouts (5xx-errors), or slow response times when trying to connect. 

CORRECTIVE MEASURES

Short-term mitigation

  • Restart of unhealthy nodes to recover service as soon as possible.

Long-term mitigation

  • Enable rate limiting on bulk requests. 

FINAL WORDS

We apologize for the impact to affected customers. Availability is of high priority for us and we have a strong commitment to delivering high availability for our Find service. We will continue to prioritize our efforts in proving to overcome these recent difficulties and will do everything we can to learn from the event to avoid a recurrence in the future.

Posted Aug 12, 2020 - 08:21 UTC

Resolved
This incident has been resolved
Posted Aug 07, 2020 - 15:18 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 07, 2020 - 13:33 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Aug 07, 2020 - 12:54 UTC
Investigating
We are currently investigating an event that is impacting the functionality on the FIND service in the EU region, EMEA04.

A subset of clients will be experiencing high latency or 5xx-errors.
Posted Aug 07, 2020 - 12:43 UTC