FIND Incident - EU Region (AZNEUPROD01)
Incident Report for Optimizely Service
Postmortem

SUMMARY

Episerver Search & Navigation (A.K.A. Find) is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. On Friday the 5th of June 2020 we experienced an event which impacted the functionality of the service in the EU Digital Experience Cloud region. Details of the incident are described below.

DETAILS

Starting at 08:15 UTC on 5th June 2020, the Engineering team received an alert for FIND cluster AZNEUPROD01. Troubleshooting began immediately and the investigation discovered the performance of the Find cluster was impacted due to a significant increase of delete and index requests. Once the source of the surge was identified, a resolution was implemented which mitigated the impact on the cluster. The service was restored at 18:00 UTC on 5th June 2020. 

TIMELINE

June 5th, 2020

08:15 UTC – First alert is triggered by global monitoring system.

08:22 UTC – Alert acknowledged and troubleshooting started. 

11:54 UTC – Root cause identified and mitigation actions were initiated. 

12:01 UTC – STATUSPAGE updated 

13:32 UTC – Fix was implemented and service started recovering. 

17:11 UTC – The issue reoccurred.

17:24 UTC – Enabled throttling.

18:00 UTC – Global monitoring system is reporting full functionality restored. 

ANALYSIS

The cause of this issue was due to a sudden and significant increase of incoming delete and index request caused by a tenant of the shared resources. This resulted in high JAVA Heap memory consumption and failing garbage collects. When this happens the cluster master evicts those nodes from the cluster and mark them as failed. The default behavior is then to re-distribute the data over the remaining nodes and this caused the memory issue to be moved to another node and the same scenario replayed itself on the other nodes. 

IMPACT

During the event, requests to this FIND cluster would have seen network timeouts (5xx-errors), or slow response times when trying to connect. 

CORRECTIVE MEASURES

Short-term mitigation

  • Restart of unhealthy nodes to recover service as soon as possible.
  • A fix was implemented to limit and optimize the load from the identified source towards to the cluster.

Long-term mitigation

  • Implement queuing support for bulk/index/delete requests for better control over bulk operations.
  • This issue is linked to periods of high load on the cluster and our recommendation is to move into our new FIND Service platform which consists of the latest cloud version with improved architecture. Migration can be requested by contacting support.

FINAL WORDS

We apologize for the impact to affected customers. We have a strong commitment to delivering high availability for our Find service. We will continue to prioritize our efforts in proving to overcome these recent difficulties and will do everything we can to learn from the event to avoid a recurrence in the future.

Posted Aug 19, 2020 - 11:09 UTC

Resolved
This incident has been resolved.

We will continue our investigation to establish the root cause. A RCA will be published as soon as it becomes available.
Posted Jun 08, 2020 - 03:26 UTC
Monitoring
We have implemented a fix and are currently monitoring the recovery and health of the FIND Service in the EU region
Posted Jun 05, 2020 - 18:10 UTC
Investigating
After identifying the issue and applying a fix, while we actively were monitoring the situation we noticed the issue was starting to reoccur. This is currently impacting the availability on the FIND service in the EU region and we are working diligently to identify the root cause and apply a fix.
Posted Jun 05, 2020 - 17:19 UTC
Monitoring
A fix has been implemented and we are closely monitoring the results.
Posted Jun 05, 2020 - 14:38 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jun 05, 2020 - 13:11 UTC
Update
We are continuing to investigate this issue.

We apologize for the inconvenience and will share an update once we have more information.
Posted Jun 05, 2020 - 12:51 UTC
Investigating
We are currently investigating an event that is impacting the availability on the FIND service in the EU region.

A subset of clients will be experiencing high latency or 5xx-errors.
Posted Jun 05, 2020 - 12:01 UTC