Find Incident - US Region (USEA02)
Postmortem

SUMMARY 

Episerver Search & Navigation (formerly Find) is a cloud-based enterprise search solution that delivers enhanced relevance and powerful search functionality to websites. On Monday the 19th of October and Tuesday the 20th of October 2020, we experienced events which impacted the functionality of the service in the East US Digital Experience Cloud region. Details of the incident are described below.

DETAILS 

Starting at 16:57 UTC on 19th October 2020, the Engineering team received an alert for Search & Navigation cluster USEA02. Troubleshooting began immediately, and the investigation discovered that the performance of the cluster was impacted due to a number of malicious requests from a tenant of the shared resources. Once the cause was identified, a scale-out operation was performed to resolve the incident. The service was fully restored at 05:24 UTC on 20th October 2020.  

TIMELINE 

October 19th, 2020

16:57 UTC – First alert is triggered by global monitoring system.

17:00 UTC – Alert acknowledged, and troubleshooting started. 

18:10 UTC – Mitigation steps were initiated. 

21:30 UTC – Full cluster restart was performed. 

22:02 UTC – Second alert triggered and investigation to identify cause continued. 

22:53 UTC – Mitigation steps were continuously applied.

23:26 UTC – Full cluster restart was performed. 

 

October 20th, 2020

02:39 UTC – Status page updated.

05:15 UTC – Scale-out performed.

05:24 UTC – Critical alert resolved and service fully operational. 

ANALYSIS

Investigation identified that CPU increased from 10% to 100% usage on all nodes from one minute to another. The cause of high CPU usage was a tenant of the shared resources sending malicious requests to the Elasticsearch queue. The cluster capacity was not enough to handle the unexpected rapid growth, but once the environment was expanded the cluster could finalize all unpredicted tasks. 

IMPACT 

During the events, requests to this Search & Navigation cluster may have experienced network timeouts (5xx-errors), or slow response times when trying to connect. 

CORRECTIVE MEASURES 

  • Rolling restarts of unhealthy nodes and full cluster restart.
  • Enabled rate limiting to recover service as soon as possible.
  • Expansion of data node count was performed to distribute the unexpected spike of connections more evenly. 
  • We will review our capacity planning model and adjust it accordingly to better handle an unexpected rapid growth to prevent this from happening again.

FINAL WORDS

We apologize for the impact to affected customers. We have a strong commitment to deliver high availability for our Search & Navigation service. We will continue to prioritize our efforts in proving to overcome these recent difficulties and will do everything we can to learn from the event to avoid a recurrence in the future.

Posted Nov 10, 2020 - 09:57 CET

Resolved
This incident has been resolved.

We will continue our investigation to establish the root cause. A post mortem will be published as soon as it becomes available.
Posted Oct 20, 2020 - 11:54 CEST
Update
We are continuing to monitor for any further issues.
Posted Oct 20, 2020 - 09:46 CEST
Monitoring
We are currently working on short and long term corrective actions to stabilize the service. At this moment, service is reporting that functionality has been restored and we are continuing to closely monitor the health.

If the situation changes we will provide a new update.

We sincerely apologize for the inconvenience!
Posted Oct 20, 2020 - 08:07 CEST
Identified
The issue has been identified and a fix is being implemented.
Posted Oct 20, 2020 - 07:24 CEST
Investigating
We are currently investigating an event that is impacting the functionality on the FIND service in the US region, USEA02.

A subset of clients will be experiencing high latency or 5xx-errors.
Posted Oct 20, 2020 - 04:39 CEST
This incident affected: Digital Experience Cloud FIND (East US Region).