Episerver Search & Navigation (formerly Find) is a cloud-based enterprise search solution that delivers enhanced relevance and powerful search functionality to websites. On Monday April 19 and Tuesday April 20, 2021, we experienced an event which impacted the functionality of the service in the US Digital Experience Cloud region. Details of the incident are described below.
Between April 19, 2021 17:47 UTC and April 20, 2021 13:22 UTC the Search & Navigation cluster USEA02 experienced intermittent outages.
The issue was triggered by a consistent high level of Java heap memory consumption, which lead to failed shards allocation and garbage collects. A setting was reconfigured to trigger shard reallocation, and reduce the high heap usage, and the service was fully operational at April 20, 2021 13:22 UTC.
TIMELINE
April 19, 2021
17:47 UTC – First alert and automation restarts were triggered.
23:16 UTC – Second alert acknowledged and troubleshooting started.
23:31 UTC – Issue identified and mitigation actions were performed.
00:47 UTC – Service operation was recovered and monitored.
April 20, 2021
09:31 UTC – First alert triggered and the issue was quickly identified.
09:55 UTC – STATUSPAGE updated
09:59 UTC – Mitigation action was performed.
10:04 UTC – Root cause identified and the Engineering team started working on long-term mitigation actions.
13:22 UTC – Critical alert resolved and service fully operational.
The cause of this issue was due to a consistent amount of high latency queries, caused by a single tenant on shared resources, where they started implementing a new strategy to improve multilingual searching. This resulted in high Java heap memory consumption and failing garbage collects. When this happened, the nodes got evicted from the cluster. The shards balancing process was not triggered automatically due to an incorrect setting, which subsequently hindered the cluster from returning to normal state.
Once the setting was corrected, and the particular tenant on the shared resources was identified, a mitigation plan was activated and the service started recovering.
During the events, a subset of requests to this Search and Navigation cluster may have experienced network timeouts (5xx-errors), or slow response times when trying to connect.
Short-term mitigation
Long-term mitigation
We apologize for the impact to affected customers. We have a strong commitment to deliver high availability for our Search & Navigation service. We will continue to prioritize our efforts in proving to overcome these recent difficulties, and will do everything we can to learn from the event to avoid a recurrence in the future.