Search & Navigation Incident - US Region (USEA02)
Incident Report for Optimizely Service
Postmortem

SUMMARY

Episerver Search & Navigation (formerly Find) is a cloud-based enterprise search solution that delivers enhanced relevance and powerful search functionality to websites. On Monday April 19 and Tuesday April 20, 2021, we experienced an event which impacted the functionality of the service in the US Digital Experience Cloud region. Details of the incident are described below.

DETAILS

Between April 19, 2021 17:47 UTC and April 20, 2021 13:22 UTC the Search & Navigation cluster USEA02 experienced intermittent outages.

The issue was triggered by a consistent high level of Java heap memory consumption, which lead to failed shards allocation and garbage collects. A setting was reconfigured to trigger shard reallocation, and reduce the high heap usage, and the service was fully operational at April 20, 2021 13:22 UTC

 

TIMELINE

April 19, 2021

17:47 UTC – First alert and automation restarts were triggered. 

23:16 UTC – Second alert acknowledged and troubleshooting started. 

23:31 UTC – Issue identified and mitigation actions were performed. 

00:47 UTC – Service operation was recovered and monitored. 

 

April 20, 2021

09:31 UTC – First alert triggered and the issue was quickly identified. 

09:55 UTC – STATUSPAGE updated 

09:59 UTC – Mitigation action was performed. 

10:04 UTC – Root cause identified and the Engineering team started working on long-term mitigation actions. 

13:22 UTC – Critical alert resolved and service fully operational.

ANALYSIS

The cause of this issue was due to a consistent amount of high latency queries, caused by a single tenant on shared resources, where they started implementing a new strategy to improve multilingual searching. This resulted in high Java heap memory consumption and failing garbage collects. When this happened, the nodes got evicted from the cluster. The shards balancing process was not triggered automatically due to an incorrect setting, which subsequently hindered the cluster from returning to normal state. 

Once the setting was corrected, and the particular tenant on the shared resources was identified, a mitigation plan was activated and the service started recovering. 

IMPACT

During the events, a subset of requests to this Search and Navigation cluster may have experienced network timeouts (5xx-errors), or slow response times when trying to connect. 

CORRECTIVE MEASURES

Short-term mitigation

  • Restarting of unhealthy nodes and reallocation of shards.
  • Correction of cluster allocation setting. 

Long-term mitigation

  • Application experts were engaged to understand the indexing implementation of the shared resource tenant, and provide a solution to reduce the service impact. 

FINAL WORDS

We apologize for the impact to affected customers. We have a strong commitment to deliver high availability for our Search & Navigation service. We will continue to prioritize our efforts in proving to overcome these recent difficulties, and will do everything we can to learn from the event to avoid a recurrence in the future.

Posted May 28, 2021 - 12:02 UTC

Resolved
This incident has been resolved.
We will continue our investigation to establish the root cause. An RCA will be published as soon as it becomes available
Posted Apr 20, 2021 - 20:52 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 20, 2021 - 10:04 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Apr 20, 2021 - 09:59 UTC
Investigating
We are currently investigating an event that is impacting the availability on the Search & Navigation (FIND) service in the US region (USEA02).

A subset of clients will be experiencing high latency or 5xx-errors.
Posted Apr 20, 2021 - 09:55 UTC