FIND Incident - EU Region (EMEA04)
Incident Report for Optimizely Service
Postmortem

SUMMARY

Episerver Find is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites.

Between 2020-01-24 14:51 UTC and 2020-02-07 21:22 UTC FIND cluster EMEA04 experienced intermittent incidents culminating on Friday the 7th of February, 2020. Details of the incident are described below.

TIMELINE

January 24th, 2020

14:51 UTC – First alert is triggered by monitoring system.

15:03 UTC – Alert acknowledged and troubleshooting initiated.

15:03 UTC – Rolling restarts are performed to mitigate high memory utilization.

17:41 UTC – Service is restored and monitored.

20:18 UTC – Second alert is triggered and mitigation steps are immediately performed.

21:00 UTC – Service is restored.

January 25th, 2020

01:41 UTC – First alert is triggered.

01:41 UTC – Alert acknowledged and troubleshooting started.

05:05 UTC – Service is restored and monitored.

16:03 UTC – Second alert is triggered. Engineering immediately starts working on a fix.

16:24 UTCSTATUSPAGE updated

17:08 UTC – Fix was successfully implemented.

18:40 UTC – Service operation is fully recovered.

January 27th, 2020

08:00 UTC – Retrospect performed for postmortem.

February 6th, 2020

13:27 UTC – First alert is triggered by monitoring system

13:27 UTC – Alert acknowledged and troubleshooting started.

18:35 UTC – STATUSPAGE updated.

18:39 UTC - Cluster restart initiated.

20:48 UTC – Issue is identified and Engineering immediately starts working on a fix.

21:30 UTC – The fix is fully implemented and the service is recovered.

February 7th, 2020

07:20 UTC – First alert is triggered by monitoring system.

07:21 UTC – Alert acknowledged and troubleshooting started.

08:17 UTC – STATUSPAGE updated.

11:03 UTC – Engineering team starts working on mitigation steps.

13:29 UTC – The steps are successfully performed and the service is recovered.

ANALYSIS

The issues experienced with the EMEA04 Find cluster were found to be partially caused by long garbage collection times and failed garbage collection, which subsequently caused nodes to stall. When the nodes fail frequently enough, the cluster is not able to recover in time to stabilize the overall behavior. Without the proper recovery, customers were impacted to the point that all replicas and primaries were down, resulting in failed queries.

The failure of nodes also caused the replicas that were contained on that node to be reallocated and migrated to other data nodes. This resulted in the master nodes becoming saturated, which lead to long repair times. If the repair time is long enough, it takes so long for the cluster to repair a node that one or more other nodes fail due to increased traffic during the repair period. This can lead to scenarios requiring a full cluster restart which results in the Find service being unavailable to all clients located on this cluster, for the duration of the restart.

IMPACT

During the events, requests to this FIND cluster would have seen network timeouts (5xx-errors), or slow response times when trying to connect.

CORRECTIVE MEASURES

  • To decrease the load on the cluster, several indices that were contributing a significant amount of traffic, were disabled.
  • An additional cluster has been provisioned in order to reduce the customer load on EMEA04.
  • We are also investigating and changing several settings to allow for faster recovery times for the nodes and cluster.
  • We are actively collecting memory dumps and evaluating them to identify other measures we can take to improve garbage collection performance, in order to maximize operational efficiency and stability.
  • A misconfiguration was discovered in the application gateway configuration that resulted in the health of each proxy node being based on the availability of a particular index. In the event of a partially healthy cluster, if the canary index was inaccessible or non-functional, the entire proxy layer would be ejected for the cluster, resulting in failures for all tenants on that cluster. As soon as this issue was identified, it was repaired and should no longer be an issue going forward.
  • Ongoing steps are being discussed and tested in order to establish the full root cause and resolution.

FINAL WORDS

We place the utmost importance and pride on achieving and sustaining the highest level of availability for our customers, and we regret any disruption you have experienced in the service. We continue to work tirelessly to ensure any and all service disruptions are prevented and/or mitigated, and we will use this incident to further these efforts to help ensure you receive a reliable and positive experience.

Posted Feb 12, 2020 - 18:15 UTC

Resolved
This incident has been resolved.
Posted Feb 12, 2020 - 13:34 UTC
Monitoring
A fix has been implemented and we are continuing to monitor the results
Posted Feb 07, 2020 - 21:24 UTC
Investigating
We are aware of an issue that is impacting a limited group of customers on the FIND service in the EU region.
We are actively investigating and we sincerely apologize for the inconvenience.
Posted Feb 07, 2020 - 18:29 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 07, 2020 - 13:29 UTC
Update
A subset of clients could still experience network timeouts. We continue to closely monitor the service and work on mitigation efforts to restore full functionality.
Posted Feb 07, 2020 - 11:57 UTC
Update
We are continuing to investigate and test several mitigation efforts to be able to recover service as fast as possible.
Posted Feb 07, 2020 - 11:03 UTC
Update
We are continuing to investigate this issue.
Posted Feb 07, 2020 - 10:34 UTC
Update
The investigation remains ongoing at this time and we are treating this with highest priority. We are working diligently to resolve the current issues as soon as we can.

During the time of the event customer will experience 502 errors when connecting to the service.

We will provide next update within 60 minutes.
Posted Feb 07, 2020 - 10:02 UTC
Investigating
We are experiencing an event that is impacting a limited group of customers on the FIND service in the EU region.
Engineers are aware of the issue and are actively investigating.
We sincerely apologize for the inconvenience.
Posted Feb 07, 2020 - 08:17 UTC
Monitoring
A fix has been fully implemented and the service has recovered. We are continuing to monitor the results.
Posted Feb 06, 2020 - 21:48 UTC
Identified
The issue has been identified and a fix has been implemented and service is recovering. We sincerly apologize for the inconvenience
Posted Feb 06, 2020 - 20:48 UTC
Investigating
We are currently investigating an event that is impacting the functionality on the FIND service in the EU region.

A subset of clients will be experiencing high latency or 5xx-errors.
Posted Feb 06, 2020 - 18:35 UTC