Search & Navigation Incident - EMEA Region (EMEA17)
Incident Report for Optimizely Service
Postmortem

SUMMARY

Optimizely Search & Navigation (formerly Find) is a Cloud based Enterprise Search solution that delivers both enhanced relevance and powerful search functionality to websites. On Thursday May 19, 2022, we experienced an event which impacted service functionality in the EU Digital Experience Cloud region. The following report describes additional details around the event.

DETAILS

Between May 19, 2022 12:26 UTC and 12:59 the Search & Navigation cluster EMEA17 experienced intermittent service degradation.

Immediately upon receiving the alert, the proxy node was restarted as an initial mitigation effort. Further investigation identified an extreme level of client connections, resulting in client nodes' inability to handle the incoming requests. The affected nodes were then gracefully restarted and service was fully operational by 12:59 UTC.

TIMELINE

May 19, 2022

12:26 UTC – First alert triggered, and troubleshooting initiated. 

12:32 UTC – Proxy node restarted.

12:47 UTCSTATUSPAGE updated. 

12:56 UTC – High number of client connections was identified and client nodes were restarted

12:58 UTC – Service started recovering and was able to handle the requests. 

12:59 UTC – Service was fully operational. 

ANALYSIS

The issue happened due to a mass amount of requests flowing over client nodes in a short period, requiring these nodes to return significantly large result-sets. This course of actions incurs rapidly-growing system memory, placing an extreme pressure on garbage collection process which in turn paused all requests to these nodes, resulting in a huge number of connection being created and failing when the queue becomes full. 

IMPACT

During this series of events, a subset of requests to the EMEA17 Search & Navigation cluster may have experienced network timeouts (5xx-errors), or slower than normal response times while attempting to connect. 

CORRECTIVE MEASURES

Short-term mitigation

  • Restarting of unhealthy node(s). 
  • Correction of shards allocation setting. 

Long-term mitigation

  • To avoid future customer impact, a new re-indexing functionality that regulates shard-setting is under development. 

FINAL WORDS

We are sorry! We recognize the negative impact to affected customers and regret the disruption that you sustained. We have a strong commitment to our customers, and to deliver high availability services including Search & Navigation. We will continue to prioritize our efforts to overcome these recent difficulties, and diligently apply lessons-learned to avoid future events of similar nature to ensure that we continue to develop Raving Fans!

Posted Jun 15, 2022 - 09:09 UTC

Resolved
This incident has been resolved. We will continue our investigation to establish the cause.

A Post Mortem will be published as soon as it becomes available.
Posted May 19, 2022 - 15:54 UTC
Monitoring
A fix has been implemented and we are closely monitoring the results.
Posted May 19, 2022 - 13:26 UTC
Investigating
We are currently investigating an event that is impacting the availability on the
Search & Navigation (FIND) service in the EMEA region, [EMEA17].

A subset of clients will be experiencing high latency or 5xx-errors.

We will post additional updates hourly or when we have information available.

Thank you for your patience.
Posted May 19, 2022 - 12:47 UTC