Find Incident - EU Region (EMEA07)

Incident Report for Optimizely Service

Postmortem

SUMMARY

Episerver Search & Navigation (formerly Find) is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. On Monday September 21 and Sunday September 27 - Tuesday September 29, 2020, we had an intermittent event which impacted the service functionality in the EU Digital Experience Cloud region. The following report describes additional details around the event.

DETAILS

Between 2020-09-21 16:05 UTC and 2020-09-30 23:27 UTC the Search & Navigation cluster EMEA07 experienced intermittent service degradation.

The Engineering team has been working diligently to identify the root cause of the degradation, and apply mitigation steps to reduce the impact. As the Search & Navigation service contains a multitude of components, including but not limited to, API calls towards the cluster, mitigation efforts required a complex and exhaustive investigation, thereby requiring additional time to stabilize and identify the resolution.

Details of the incident are described below.

TIMELINE

September 21, 2020

16:05 UTC – First alert triggered, acknowledged and investigation initiated.

17:45 UTC – STATUSPAGE updated.

18:54 UTC – Proxies were reset.

_18:57 UTC – Critical alert resolved and service fully operational.
_

September 22, 2020

_07:00 UTC - Retrospective meeting.
_

September 27, 2020

17:08 UTC – First alert triggered and acknowledged, investigation initiated.

17:36 UTC – Restarted an unhealthy node and allocated shards.

18:12 UTC – Service recovered.

September 28, 2020

00:47 UTC – First alert triggered and investigation initiated.

01:02 UTC – Restarted Elasticsearch service.

_01:20 UTC – Service recovered.
_

_07:00 UTC - Retrospective meeting to analyze and discuss next actions.
_

11:00 UTC – Second alert triggered and acknowledged, investigation initiated.

11:20 UTC - Restarted client nodes.

11:35 UTC - Service started recovering.

12:08 UTC - Third alert triggered and acknowledged.

12:17 UTC - Critical alert resolved and service fully operational.

‌

16:23 UTC - Fourth alert triggered and acknowledged.

16:49 UTC - STATUSPAGE updated.

17:05 UTC – Service recovered.

‌

September 29, 2020

02:51 UTC – First alert triggered and acknowledged, investigation initiated.

03:05 UTC – Client node restarted.

03:15 UTC – Service started recovering.

‌

05:26 UTC – Second alert triggered and troubleshooting commenced.

05:27 UTC – Restart of client nodes due to overload.

06:12 UTC – STATUSPAGE updated.

07:44 UTC – Fix implemented and service recovering.

13:28 UTC - Statuspage incident still open to continue monitor service health closely.

‌

16:19 UTC - Third alert triggered and acknowledged, investigation initiated.

16:29 UTC - Client node restarted and service recovered.

‌

September 30, 2020

07:00 UTC - Retrospective meeting to analyze and discuss next actions.

**
18:51 UTC** – First alert triggered and acknowledged, troubleshooting commenced.

18:57 UTC – Client nodes restarted.

_20:31 UTC – Service started recovering.
_

23:11 UTC – Second alert triggered and client node restarted.

23:19 UTC – Critical alert resolved and service fully operational.

ANALYSIS

The issues experienced with the EMEA07 FIND cluster were found to be caused by a tenant of the shared resource sending queries with many nested items, which repeatedly caused memory congestion on a few nodes. This created failed queries on the node with high heap usage. When that node is restarted, the Elasticsearch primary shard being indexed to is moved to another node, and the scenario replays itself until the cluster's internal shards have been re-balanced, and the service is fully operational.

IMPACT

During the events, requests to this FIND cluster may have experienced network timeouts (5xx-errors), or slow response times when trying to connect.

CORRECTIVE MEASURES

Short-term mitigation

Engineering team performed restarts of unhealthy nodes to recover service as soon as possible.
Expansion of client node count was performed to distribute the unexpected spike of connections more evenly.
Change of refresh interval to reduce the frequency of executed calls to Elasticsearch.
Applied caching for mapping request to reduce unnecessary call to the service.

Long-term mitigation

Implement queuing support for bulk/index/delete requests to improve control over bulk operations.
Two additional clusters have been provisioned to reduce the load on EMEA07.
To decrease the load on the cluster, several indices contributing to a significant amount of traffic have been offered alternative solutions.
Document best practice and fair use policy around nested queries.

FINAL WORDS

We apologize for the impact to affected customers. We have a strong commitment to delivering high availability for our Search & Navigation service. We will continue to prioritize our efforts in proving to overcome these recent difficulties, and will do everything we can to learn from the event to avoid a recurrence in the future.

Posted Oct 05, 2020 - 07:49 UTC

Resolved

This incident has been resolved.
We will continue our investigation to establish the root cause. A RCA will be published as soon as it becomes available

Posted Sep 21, 2020 - 20:59 UTC

Monitoring

A fix has been implemented and we are continuing to monitor the results

Posted Sep 21, 2020 - 19:04 UTC

Investigating

We have received new alerts and are continuing to investigate this issue.

We apologize for the inconvenience and will share an update once we have more information.

Posted Sep 21, 2020 - 18:33 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Sep 21, 2020 - 18:01 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Sep 21, 2020 - 17:53 UTC

Investigating

We are currently investigating an event that is impacting the functionality on the FIND service in the EU region, EMEA07.

A subset of clients will be experiencing high latency or 5xx-errors.

Posted Sep 21, 2020 - 17:44 UTC