Search & Navigation Incident - US Region (USEA01)
Incident Report for Optimizely Service
Postmortem

SUMMARY

Episerver Search & Navigation (formerly Find) is a cloud-based enterprise search solution that delivers enhanced relevance and powerful search functionality to websites. On Wednesday April 14, 2021 and Thursday 15, 2021 we experienced events which impacted the functionality of the service in the US Digital Experience Cloud region. Details of the incident are described below.

DETAILS

Between April 14, 10:18 UTC and April 15, 2021 7:35 UTC the Search & Navigation cluster USEA01 experienced intermittent outages.

The issue was triggered by a consistent high level of JAVA heap memory consumption, which lead to failed shards allocation and garbage collects. A setting was reconfigured to trigger shard reallocation and reduce the high heap usage, and the service was fully operational at April 15, 2021 7:35 UTC. 

TIMELINE

April 14, 2021

10:18 UTC – First alert and automation restarts were triggered. 

10:20 UTC – Critical alert triggered, acknowledged and investigation initiated. 

11:34 UTC – STATUSPAGE updated 

11:35 UTC – Restarted an unhealthy node and allocated shards.

12:35 UTC – Service operation recovered. 

16:17 UTC – Second alert and automation restarts were triggered. 

17:33 UTC – Critical alert triggered and investigation immediately started. 

18:38 UTC – STATUSPAGE updated 

19:54 UTC – Issue identified and mitigation actions were performed. 

20:11 UTC – Service operation was recovered and monitored. 

April 15, 2021

00:36 UTC – First alert and automation restarts were triggered. 

06:55 UTC – Second alert triggered and the issue was quickly identified. 

07:09 UTC – STATUSPAGE updated 

07:30 UTC – Mitigation actions were immediately performed. 

07:32 UTC – Root cause identified and the Engineering team started working on long-term mitigation actions. 

07:35 UTC – Critical alert resolved and service fully operational.

ANALYSIS

Investigation discovered that the shard balancing process was not triggered automatically due to an incorrect setting, which subsequently hindered the cluster from returning to normal state. 

IMPACT

During the events, a subset of requests to Search and Navigation cluster may have experienced network timeouts (5xx-errors), or slow response times when trying to connect. 

CORRECTIVE MEASURES

Short-term mitigation

  • Restarting of unhealthy nodes and reallocation of shards.
  • Correction of cluster allocation setting. 

Long-term mitigation

  • Application experts were engaged to understand the indexing implementation of a shared resource tenant, and provided a solution to reduce the service impact.  

FINAL WORDS

We apologize for the impact to affected customers. We have a strong commitment to deliver high availability for our Search & Navigation service. We will continue to prioritize our efforts in proving to overcome these recent difficulties, and will do everything we can to learn from the event to avoid a recurrence in the future.

Posted May 25, 2021 - 09:34 UTC

Resolved
This incident has been resolved. We will continue our investigation to establish the cause.
The Incident Postmortem will be published as soon as it becomes available.
Posted Apr 15, 2021 - 02:16 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 14, 2021 - 19:54 UTC
Investigating
We are currently investigating an event that is impacting the availability on the
Search & Navigation (FIND) service in the US region, USEA01.

A subset of clients will be experiencing high latency or 5xx-errors.

We will post additional updates hourly or when we have information available.

Thank you for your patience.
Posted Apr 14, 2021 - 18:38 UTC