Find Incident - US Region (USEA02)
Incident Report for Optimizely Service
Postmortem

SUMMARY 

Episerver Search & Navigation (formerly Find) is a cloud-based enterprise search solution that delivers enhanced relevance and powerful search functionality to websites. On Monday November 23, 2020 we experienced an event which impacted the functionality of the service in the East US Digital Experience Cloud region. Details of the incident are described below.

DETAILS 

Starting at 07:30 UTC on November 23, 2020, the Engineering team received an alert for the Search & Navigation cluster USEA02. Troubleshooting began immediately, and the investigation discovered that the performance of the cluster was impacted due to a sudden increase of bulk requests. Rolling restarts of overloaded data nodes were immediately performed to recover the service. Once the cause had been identified and the bulk operation stopped, the service was fully restored at 08:18 UTC on November 23, 2020.  

TIMELINE 

November 23, 2020

07:30 UTC – First alert is triggered by global monitoring system.

07:31 UTC – Alert acknowledged and troubleshooting started. 

07:40 UTC – Restarted data nodes. 

07:43 UTC – Status page updated.

08:07 UTC – Full cluster restart was performed and service started recovering. 

08:18 UTC – Critical alert resolved and service fully operational.

ANALYSIS

The investigation identified that the cluster was overloaded due to a sudden increase of bulk requests in a short period of time. This caused high heap usage on some nodes. When multiple data nodes frequently exceed the heap threshold, the automation process stops for security reasons.

The overloaded data nodes, including the main node, had to be manually restarted, releasing a number of pending tasks stuck in the queue.

IMPACT 

During the events, requests to this Search & Navigation cluster may have experienced network timeouts (5xx-errors), or slow response times when trying to connect. 

CORRECTIVE MEASURES 

Short-term mitigation

  • Rolling restarts of unhealthy data nodes and full cluster restart.
  • Application experts engaged to understand indexing implementation of a shared resource tenant, to help enhance service performance. 

 

Long-term mitigation

  • Implement queuing support for bulk/index/delete requests, to improve control of bulk operations.

 

FINAL WORDS

We apologize for the impact to affected customers. We have a strong commitment to deliver high availability for our Search & Navigation service. We will continue to prioritize our efforts in proving to overcome these recent difficulties and will do everything we can to learn from the event to avoid a recurrence in the future.

Posted Dec 04, 2020 - 18:45 UTC

Resolved
This incident has been resolved.

We will continue our investigation to establish the root cause. A post mortem will be published as soon as it becomes available
Posted Nov 23, 2020 - 09:10 UTC
Monitoring
The issue has been identified and mitigation steps have been applied.
The service is restored and we are monitoring the health.
Posted Nov 23, 2020 - 08:09 UTC
Investigating
We are currently investigating an event that is impacting the functionality on the FIND service in the US region, USEA02.

A subset of clients will be experiencing high latency or 5xx-errors.
Posted Nov 23, 2020 - 07:43 UTC