FIND Incident - EU Region (AZNEUPROD01)
Incident Report for Optimizely Service
Postmortem

SUMMARY

Episerver Search & Navigation (formerly Find) is a cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. Between 2020-06-26 03:41 UTC and 2020-07-04 22:07 UTC FIND cluster AZNEUPROD01 experienced intermittent incidents culminating on Saturday 2020-07-04. Details of the incident are described below.

DETAILS

Starting at 03:41 UTC on 26th June 2020, the Engineering team received an alert for FIND cluster AZNEUPROD01. Troubleshooting started immediately, and the investigation discovered that the performance on the entire FIND cluster was impacted due to an extreme variation of multi-request queries that increased traffic demand towards the cluster. A final fix was implemented on July 4th to limit and optimize the load from the identified source towards to the cluster. The service was fully restored on the same day at 22:07 UTC.

TIMELINE

June 26th, 2020

03:41 UTC – First alert is triggered by global monitoring system.

03:42 UTC – Alert acknowledged and troubleshooting initiated.

03:42 UTC – Rolling restarts are performed to mitigate high memory utilization. 

04:33 UTC – STATUSPAGE updated.

04:33 UTC – Service started to recover. 

08:19 UTC – Issue started reoccurring. 

09:05 UTC – Identified the issue. Started to roll out Mitigation steps. 

_11:36 UTC – Critical alert resolved and service fully operational.
_

Intermittent service degradation continued until July 4th at 22:07 UTC when final mitigations efforts were in place to limit the load towards the cluster.

ANALYSIS

The cause of this incident was a sudden peak in JAVA heap memory consumption. Analysis identified that the cluster was overwhelmed due to a constant amount of large multi-request type (bulk or multi-search) which allows for performing multiple tasks in a single request. This caused over-allocation of heap on the nodes, and failed garbage collects. This in turn creates failed queries on the node with high heap usage. When that node is restarted, the Elasticsearch primary shard being indexed to, moves to another node. The scenario replays itself until the cluster's internal shards have been re-balanced, and the service is fully operational.

IMPACT

During the events, requests to this FIND cluster may have seen network timeouts (5xx-errors), or slow response times when trying to connect. 

CORRECTIVE MEASURES

Short-term mitigation:

  • To decrease the load on the cluster, several indices that were contributing a significant amount of traffic, were disabled.
  • Optimizing aggregated queries to reduce execution time and pressure to the cluster.
  • Enabling rate limiting for search requests. 

Long-term mitigation:

  • This issue is linked to periods of high load on the cluster, and our recommendation is to move into our new FIND Service platform which consists of the latest cloud version with improved architecture. Contact support to request migration.
  •  A queuing system for indexing requests are currently being built, which will decrease indexing speed when cluster issues starts to happen. This will reduce the impact for customers during cluster issues.

FINAL WORDS

We sincerely apologize for the impact to affected customers. Customer experience is of high priority for us, and we have a strong commitment to deliver high service availability. We will do everything we can to learn from the event to avoid future recurrence.

Posted Oct 09, 2020 - 09:46 UTC

Resolved
This incident has been resolved.

We will continue our investigation to establish the root cause. A RCA will be published as soon as it becomes available.
Posted Jun 26, 2020 - 14:50 UTC
Monitoring
All alerts have been cleared and we are closely monitoring the service.
Posted Jun 26, 2020 - 11:36 UTC
Update
We are continuing to closely monitor the recovery of the service and will provide more updates as soon as they become available.
Posted Jun 26, 2020 - 10:29 UTC
Identified
We have identified the issue and applied a corrective action. We are monitoring the results.
Posted Jun 26, 2020 - 09:05 UTC
Update
We have received new incoming alerts and we are working on to identify and resolve this issue as soon as possible.

A subset of clients will continue to experience high latency or 5xx-errors.

We apologize for the inconvenience and will share an update once we have more information.
Posted Jun 26, 2020 - 08:19 UTC
Investigating
The issue was starting to reoccur. We are continue investigating.
Posted Jun 26, 2020 - 08:17 UTC
Monitoring
We have implemented a fix and the service are up. We are currently monitoring the health of the FIND Service in the EU region
Posted Jun 26, 2020 - 07:34 UTC
Update
The issue was starting to reoccur while being recovered. We are working diligently to identify the root cause and performing a fix to bring the service up.
Posted Jun 26, 2020 - 06:21 UTC
Identified
We are currently investigating an event that is impacting the availability on the FIND service in the EU region.
Initial mitigation step has been performed and the service is recovering.

A subset of clients will be experiencing high latency or 5xx-errors.
Posted Jun 26, 2020 - 04:33 UTC