FIND having intermittent issue
Incident Report for Optimizely Service
Postmortem

Summary

Episerver Find is a Cloud based Enterprise Search solution that delivers both enhanced relevance and powerful search functionality to websites. On Thursday the 13th of December 2018 we had an event impacting the functionality of the service in the EU Digital Experience Cloud region and the following report describes additional details around the event.

Total incident duration: 60 minutes.

Details

On Thursday 2018-12-13 03:51 AM CET, the Reliability Engineering team received an alert regarding one of the EU Digital Experience Cloud regions. Immediately upon receiving the alert troubleshooting started and the investigation discovered nodes in a specific cluster had problems with JVM Garbage Collection and this caused slow response times.

Graceful restarts were performed on the identified nodes to clear memory.

We received additional intermittent alerts which was once again solved by node restarts. After about 8 hours, at 11:48 AM CET we started seeing the same performance issue again and identified that multiple nodes experience JVM GC errors. In the weeks prior to this incident we had several similar events which had been mitigated by performing individual node restarts and this was counted as an approved work around. However, this was unsuccessful this time so the decision was made to perform a full cluster restart to restore the service. The procedure to perform a restart of an entire cluster is time consuming due to the number of nodes and the service was fully restored at 13:21 PM CET.

Timeline

December 13th, 2018

03:51 AM CET: First alert is triggered by monitoring systems.

03:56 AM CET: Alert acknowledged and troubleshooting initiated. A second alert is triggered and single nodes are restarted.

04:01 AM CET Third alert is triggered.

04:04 AM CET Services restored after graceful node restarts.

11:48 AM CET Approximately 8 hours later, Reliability Engineering receives another alert for the same cluster and problem ticket is triggered. Decision is quickly made to initiate a full cluster restart to resolve incident as quickly as possible.

13:21 PM CET Global monitoring system is reporting full functionality restored.

15:10 PM CET: Preliminary root cause identified and working on a long-term resolution.

December 14th, 2018

Expansion of proxy nodes completed.

December 18th, 2018

Upgrade of all data nodes completed without any issues and an increase of java heap size in Cluster to enable more memory for Elasticsearch to work with.

December 18th, 2018 to 18th of January 2019

To avoid further eroding the reliability of our status updates, we remained in monitoring status until we had ensured that our services had clearly settled back into normal performance levels for a longer period (~1 month).

Root cause

The root cause of this incidents was caused by the proxy queue's being exhausted and search queries being evicted from the queue due to long running queries in the back end. The cause for the long running queries in the back end was failed garbage collects that caused JVM memory heap to reach 100% usage on some of the cluster nodes.

Impact on other services

During the events, a subset of customers using this specific cluster of FIND would have seen network timeouts or slow response times trying to connect to the service.

Corrective and Preventative Measures

There are a number of Issues identified during this analysis and fixes have been released to production:

  • Expansion of the proxy layer with more nodes to scale out traffic load.

  • Scale up of data node size so the JVM memory heap could be increased to 26GB.

  • Improved alerting and monitoring of now known limits for memory usage and garbage collection frequencies and timings with the aim to identify and take preventive actions against potential problems more quicker.

Longer term improvements

  • We are currently investigating to implement a request timeout between proxy and cluster to prevent some long running requests which eventually can exhaust the memory heap.

  • Identify, optimize, reduce, and eliminate unnecessary client calls to the back end.

  • One critical learning from these events is that we need to be much faster about communicating what´s happening. We will be reviewing on how we can improve our reporting to our support teams that are communicating directly with our clients that are negatively affected, without shifting focus from the engineers working to resolve the immediate incident.

Final Words

We place the utmost importance and pride on achieving and sustaining the highest level of availability for our customers and we regret any disruption in service you have experienced. We continue to work tirelessly to ensure any and all service disruptions are prevented and or mitigated and we will use this incident to further these efforts to help ensure you receive a reliable and positive experience.

Posted Jan 30, 2019 - 13:29 UTC

Resolved
This incident has been resolved.
Posted Jan 25, 2019 - 09:46 UTC
Monitoring
Episerver FIND issue has been resolved. All FIND regions are operational and we are monitoring to prevent further outage.
Posted Dec 14, 2018 - 09:38 UTC
Identified
We have identified the issue and are working on a resolution. To mitigate the immediate incident a full restart was performed on one of the Find Clusters in the EU Digital Experience Cloud regions and service is again functional.

We will continue to update this page as soon as we have more information.
Posted Dec 13, 2018 - 14:14 UTC
Investigating
FIND clusters are experiencing intermittent issues due to high load usage. Our FIND team are working on mitigation process to manage the high load.
Posted Dec 13, 2018 - 12:15 UTC