SUMMARY

Episerver Search & Navigation (formerly Find) is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. On Wednesday 1st of July 2020, we had an event which impacted the functionality of the service in the EU Digital Experience Cloud region. The following report describes additional details around the event.

DETAILS

Between 2020-07-01 09:07 UTC and 2020-07-01 10:12 UTC, FIND cluster EMEA07 experienced an outage.

Investigation discovered that the performance on the FIND cluster was impacted due to an extreme variation of multi-request queries, increasing traffic demand towards the cluster. The same day, a final fix was implemented to limit and optimize the load from the identified source towards to the cluster. The service was fully restored at 10:12 UTC.

TIMELINE

July 1st, 2020

09:07 UTC – First alert is triggered by global monitoring system.

09:07 UTC – Alert acknowledged and troubleshooting initiated.

09:30 UTC – Mitigation steps initiated and the service started recovering.

10:03 UTC – Root cause identified. Engineering started working on a fix.

10:11 UTC – STATUSPAGE updated.

10:12 UTC – Fix was successfully implemented and service was fully recovered.

ANALYSIS

The cause of this incident was a sudden peak in JAVA heap memory consumption. Analysis identified that the cluster was overwhelmed due to a constant amount of large multi-request bulk or multi-search types. This allows for performing multiple tasks in a single request. The event caused over-allocation of heap on the nodes, and failed garbage collects. This in turn created failed queries on the node with high heap usage. When that node is restarted, the Elasticsearch primary shard being indexed to, moves to another node. The scenario replays itself until the cluster's internal shards have been re-balanced, and the service is fully operational.

IMPACT

During the events, requests to this FIND cluster may experience network timeouts (5xx-errors), or slow response times when trying to connect.

CORRECTIVE MEASURES

Short-term mitigation

Enabling rate limiting on bulk requests to recover service as soon as possible.
Disabling several indices contributing a significant amount of traffic, to decrease the load on the cluster.
Optimizing aggregated queries to reduce execution time and pressure to the cluster.

Long-term mitigation

Implementing queuing support for bulk/index/delete requests to improve control over bulk operations.

FINAL WORDS

We apologize for the impact to affected customers. We have a strong commitment to deliver high availability for our Search & Navigation service. We will continue to prioritize our efforts in proving to overcome these recent difficulties, and will do everything we can to learn from the event to avoid future recurrence.

Posted Oct 19, 2020 - 12:32 UTC

Resolved

This incident has been resolved.
We will continue our investigation to establish the root cause. A RCA will be published as soon as it becomes available.

Posted Jul 02, 2020 - 06:21 UTC

Monitoring

We have implemented a fix and the service is currently up. We are actively monitoring the health of the FIND Service in the EU Region.

Posted Jul 01, 2020 - 13:53 UTC

Identified

We are currently investigating an event that is impacting the availability on the FIND service in the EU region (EMEA07)
Initial mitigation step has been performed and the service is recovering.

A subset of clients will be experiencing high latency or 5xx-errors.

Posted Jul 01, 2020 - 10:11 UTC