FIND Incident - EU Region (EU-PROD25 - V1)
Incident Report for Optimizely Service
Postmortem

SUMMARY

Episerver Search & Navigation (formerly Find) is a cloud-based enterprise search solution that delivers enhanced relevance and powerful search functionality to websites. On Wednesday June 9, 2021 and Thursday June 10, 2021 we experienced events which impacted the functionality of the service in the EU Digital Experience Cloud region. Details of the incident are described below.

DETAILS

Between June 9, 2021 12:34 UTC and June 10 20:36 UTC the Search & Navigation cluster EUFINDPROD25 experienced intermittent service degradation.

The issue was triggered by a peak of requests that affected the cluster in a way that it stopped responding and couldn't recuperate by itself. An Individual node restart was performed but this did not bring back the cluster to the required state so the decision was made to perform a full cluster restart. The procedure to perform a restart of an entire cluster can be time consuming due to the number of nodes. During this incident, it required additional time to stabilize the service as the shard allocation was cumbersome due to the remnants from a failed deletion process. The service was fully operational at June 10, 2021, 7:35 UTC. 

TIMELINE

June 9, 2021

12:34 UTC – First alert and automation restarts were triggered. 

13:57 UTC – Critical alert triggered, acknowledged and investigation initiated. 

14:50 UTC – Restarted an unhealthy node and allocated shards.

14:55 UTC – Full cluster restart performed. 

16:14 UTC – Cluster started allocating shards. 

16:36 UTC – STATUSPAGE updated 

20:36 UTC – All healthy shards allocated and service functionality was recovered. 

June 10, 2021

04:27 UTC – Root cause identified and long-term mitigation action started.

07:35 UTC – The fix was implemented and the Incident closed.

ANALYSIS

Retrospective identified that during the full cluster restart, an unhealthy node rejoined the pool, causing the shard allocation process to be delayed. The unhealthy node consisted of remaining shards from an unsuccessful deletion operation for a specific index that was never completed. Once the deletion process had been completed and all shards were removed, service was fully restored.   

IMPACT

During the events, a subset of requests to the Search and Navigation cluster may have experienced network timeouts (5xx-errors), or slow response times when trying to connect. 

CORRECTIVE MEASURES

Short-term mitigation

  • Rolling restarts of unhealthy nodes and full cluster restart.
  • A fix was applied to complete the termination and remove the faulty index.

Long-term mitigation

  • This issue is linked to periods of high load on the cluster, and our recommendation is to move into the new Search & Navigation Service platform. This consists of the latest cloud-based version with improved architecture. Migration can be requested by contacting support.

FINAL WORDS

We apologize for the impact to affected customers. We have a strong commitment to deliver high availability for our Search & Navigation service. We will continue to prioritize our efforts in proving to overcome these recent difficulties, and will do everything we can to learn from the event to avoid a recurrence in the future.

Posted Jul 09, 2021 - 08:01 UTC

Resolved
This incident has been resolved. We will continue our investigation to establish the cause.
The Incident Postmortem will be published as soon as it becomes available.
Posted Jun 10, 2021 - 08:36 UTC
Update
The service is fully operational.
We are continuing to monitor the services availability and establish the root cause.
Posted Jun 10, 2021 - 07:55 UTC
Monitoring
The service is fully operational.
We are continuing to monitor the services availability and establish the root cause.
Posted Jun 10, 2021 - 07:52 UTC
Identified
The cause has been identified and we are diligently working on the fix. We will provide more updates as soon as they become available.
Posted Jun 10, 2021 - 06:46 UTC
Update
We are continuing to investigate this issue.
We apologize for the inconvenience and will share an update as soon as we have more information.
Posted Jun 10, 2021 - 03:15 UTC
Update
We are continuing to investigate this issue. We will continue to provide updates as information becomes available. Thank you for your patience
Posted Jun 09, 2021 - 20:22 UTC
Investigating
We are currently investigating an event that is impacting the availability on the FIND service in the EU region (V1)

A subset of clients will be experiencing high latency or 5xx-errors.
Posted Jun 09, 2021 - 16:36 UTC