Find Incident - US Region (USPROD08)
Incident Report for Optimizely Service
Postmortem

SUMMARY

Episerver Search & Navigation (formerly Find) is a cloud-based enterprise search solution that delivers enhanced relevance and powerful search functionality to websites. On Monday February 22, 2021 we experienced an event which impacted the functionality of the service in the US Digital Experience Cloud region. Details of the incident are described below.

DETAILS

Between February 22, 15:06 PM UTC and February 22, 16:44 PM UTC the Search & Navigation cluster USPROD08 experienced an outage.

What triggered the issue was a significant increase of expensive queries causing spikes in JAVA heap memory consumption. Graceful restarts were performed to clear memory and service was fully operational at 16:44 PM UTC.

TIMELINE

February 22, 2021

15:06 UTC – First alert and automation restarts were triggered. 

15:17 UTC – Second alert acknowledged and troubleshooting initiated.

15:58 UTC –  Elasticsearch and runner nodes were manually restarted.

16:15 UTC – Status page updated. 

16:26 UTC – Service started recovering. 

16:44 UTC – Critical alert resolved and service fully operational.

ANALYSIS

The cause of this incident was a peak in JAVA heap memory consumption. Analysis identified that the cluster was overwhelmed due to a sudden increase of expensive requests. This caused over-allocation of heap on the nodes, and failed garbage collects. This in turn created failed queries on the node with high heap usage. When that node was restarted, the Elasticsearch primary shard being indexed to, moved to another node. The scenario replays itself until the cluster's internal shards have been re-balanced, and the service is fully operational.

IMPACT

During the events, requests to this Search & Navigation cluster may have experienced network timeouts (5xx-errors), or slow response times when trying to connect. 

CORRECTIVE MEASURES

Short-term mitigation

  • Restart of unhealthy nodes to recover service as soon as possible.

Long-term mitigation

  • This issue is linked to periods of high load on the cluster, and our recommendation is to move into our new Search & Navigation Service platform which consists of the latest cloud version with improved architecture. Migration can be requested by contacting support.

FINAL WORDS

We apologize for the impact to affected customers. We have a strong commitment to deliver high availability for our Search & Navigation service. We will continue to prioritize our efforts in proving to overcome these recent difficulties, and will do everything we can to learn from the event to avoid a recurrence in the future.

Posted Mar 08, 2021 - 20:21 UTC

Resolved
This incident has been resolved.
We will continue our investigation to establish the root cause. An RCA will be published as soon as it becomes available
Posted Feb 22, 2021 - 20:01 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 22, 2021 - 16:50 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Feb 22, 2021 - 16:42 UTC
Investigating
We are currently investigating an event that is impacting the functionality on the FIND service in the US region, USPROD08.

A subset of clients will be experiencing high latency or 5xx-errors.
Posted Feb 22, 2021 - 16:15 UTC