Episerver Find is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites.
Between 2020-01-24 14:51 UTC and 2020-02-07 21:22 UTC FIND cluster EMEA04 experienced intermittent incidents culminating on Friday the 7th of February, 2020. Details of the incident are described below.
January 24th, 2020
14:51 UTC – First alert is triggered by monitoring system.
15:03 UTC – Alert acknowledged and troubleshooting initiated.
15:03 UTC – Rolling restarts are performed to mitigate high memory utilization.
17:41 UTC – Service is restored and monitored.
20:18 UTC – Second alert is triggered and mitigation steps are immediately performed.
21:00 UTC – Service is restored.
January 25th, 2020
01:41 UTC – First alert is triggered.
01:41 UTC – Alert acknowledged and troubleshooting started.
05:05 UTC – Service is restored and monitored.
16:03 UTC – Second alert is triggered. Engineering immediately starts working on a fix.
16:24 UTC – STATUSPAGE updated
17:08 UTC – Fix was successfully implemented.
18:40 UTC – Service operation is fully recovered.
January 27th, 2020
08:00 UTC – Retrospect performed for postmortem.
February 6th, 2020
13:27 UTC – First alert is triggered by monitoring system
13:27 UTC – Alert acknowledged and troubleshooting started.
18:35 UTC – STATUSPAGE updated.
18:39 UTC - Cluster restart initiated.
20:48 UTC – Issue is identified and Engineering immediately starts working on a fix.
21:30 UTC – The fix is fully implemented and the service is recovered.
February 7th, 2020
07:20 UTC – First alert is triggered by monitoring system.
07:21 UTC – Alert acknowledged and troubleshooting started.
08:17 UTC – STATUSPAGE updated.
11:03 UTC – Engineering team starts working on mitigation steps.
13:29 UTC – The steps are successfully performed and the service is recovered.
The issues experienced with the EMEA04 Find cluster were found to be partially caused by long garbage collection times and failed garbage collection, which subsequently caused nodes to stall. When the nodes fail frequently enough, the cluster is not able to recover in time to stabilize the overall behavior. Without the proper recovery, customers were impacted to the point that all replicas and primaries were down, resulting in failed queries.
The failure of nodes also caused the replicas that were contained on that node to be reallocated and migrated to other data nodes. This resulted in the master nodes becoming saturated, which lead to long repair times. If the repair time is long enough, it takes so long for the cluster to repair a node that one or more other nodes fail due to increased traffic during the repair period. This can lead to scenarios requiring a full cluster restart which results in the Find service being unavailable to all clients located on this cluster, for the duration of the restart.
During the events, requests to this FIND cluster would have seen network timeouts (5xx-errors), or slow response times when trying to connect.
CORRECTIVE MEASURES
We place the utmost importance and pride on achieving and sustaining the highest level of availability for our customers, and we regret any disruption you have experienced in the service. We continue to work tirelessly to ensure any and all service disruptions are prevented and/or mitigated, and we will use this incident to further these efforts to help ensure you receive a reliable and positive experience.