Search & Navigation Incident - APAC01
Incident Report for Optimizely Service
Postmortem

SUMMARY

Optimizely Search & Navigation (formerly Find) is a Cloud based Enterprise Search solution that delivers both enhanced relevance and powerful search functionality to websites. On Friday November 25, 2022, we experienced an event which impacted service functionality in the AP Digital Experience Cloud region. The following report describes additional details around the event.

DETAILS

Between November 25, 2022, 08:14 and 08:45 UTC the Search & Navigation cluster APAC01 experienced an intermittent service degradation.

The issue was triggered by a sudden increase of bulking requests from a tenant of shared resource, causing constant CPU spike on multiple data nodes. The affected nodes were gracefully restarted to clear the memory hooks and service was fully operational by 08:45 UTC. 

TIMELINE

November 25, 2022

08:14 UTC – First alert triggered and troubleshooting initiated. 

08:19 UTC – Unhealthy nodes were identified and restarted by automation. 

08:26 UTCSTATUSPAGE updated. 

08:32 UTC – Service started recovering and was able to handle the requests. 

08:45 UTC – Service was fully operational. 

ANALYSIS

The cause of this incident was due to a large number of bulking requests from a tenant of shared resources which resulted in a sudden CPU spike on multiple data nodes. 

IMPACT

During this series of events, a subset of requests to the Search & Navigation cluster may have experienced network timeouts (5xx-errors), or slower than normal response times while attempting to connect. 

CORRECTIVE MEASURES

Short-term mitigation

  • Restarting of unhealthy node(s). 

Long-term mitigation

  • To avoid future customer impact, customers with indices exceeding, or near exceeding, recommended sizes will be migrated to an alternate cluster.
  • Additional clusters have been provisioned in order to minimize the customer loads.

FINAL WORDS

We are sorry! We recognize the negative impact to affected customers and regret the disruption that you sustained. We have a strong commitment to our customers, and to deliver high availability services including Search & Navigation. We will continue to prioritize our efforts to overcome these recent difficulties, and diligently apply lessons-learned to avoid future events of similar nature to ensure that we continue to develop Raving Fans!

Posted Dec 23, 2022 - 03:03 UTC

Resolved
This incident has been resolved.
Posted Nov 25, 2022 - 09:20 UTC
Monitoring
A fix has been implemented and the service starts recovering. We're closely monitoring the results.
Posted Nov 25, 2022 - 08:36 UTC
Investigating
We are currently investigating an event that is impacting the availability on the Search & Navigation (FIND) service in the AP region, APAC01.

A subset of clients will be experiencing high latency or 5xx-errors.

We will post additional updates hourly or when we have information available.

Thank you for your patience.
Posted Nov 25, 2022 - 08:26 UTC