Episerver Find is a Cloud based Enterprise Search solution that delivers both enhanced relevance and powerful search functionality to websites. On Thursday the 13th of December 2018 we had an event impacting the functionality of the service in the EU Digital Experience Cloud region and the following report describes additional details around the event.
Total incident duration: 60 minutes.
On Thursday 2018-12-13 03:51 AM CET, the Reliability Engineering team received an alert regarding one of the EU Digital Experience Cloud regions. Immediately upon receiving the alert troubleshooting started and the investigation discovered nodes in a specific cluster had problems with JVM Garbage Collection and this caused slow response times.
Graceful restarts were performed on the identified nodes to clear memory.
We received additional intermittent alerts which was once again solved by node restarts. After about 8 hours, at 11:48 AM CET we started seeing the same performance issue again and identified that multiple nodes experience JVM GC errors. In the weeks prior to this incident we had several similar events which had been mitigated by performing individual node restarts and this was counted as an approved work around. However, this was unsuccessful this time so the decision was made to perform a full cluster restart to restore the service. The procedure to perform a restart of an entire cluster is time consuming due to the number of nodes and the service was fully restored at 13:21 PM CET.
December 13th, 2018
03:51 AM CET: First alert is triggered by monitoring systems.
03:56 AM CET: Alert acknowledged and troubleshooting initiated. A second alert is triggered and single nodes are restarted.
04:01 AM CET Third alert is triggered.
04:04 AM CET Services restored after graceful node restarts.
11:48 AM CET Approximately 8 hours later, Reliability Engineering receives another alert for the same cluster and problem ticket is triggered. Decision is quickly made to initiate a full cluster restart to resolve incident as quickly as possible.
13:21 PM CET Global monitoring system is reporting full functionality restored.
15:10 PM CET: Preliminary root cause identified and working on a long-term resolution.
December 14th, 2018
Expansion of proxy nodes completed.
December 18th, 2018
Upgrade of all data nodes completed without any issues and an increase of java heap size in Cluster to enable more memory for Elasticsearch to work with.
December 18th, 2018 to 18th of January 2019
To avoid further eroding the reliability of our status updates, we remained in monitoring status until we had ensured that our services had clearly settled back into normal performance levels for a longer period (~1 month).
The root cause of this incidents was caused by the proxy queue's being exhausted and search queries being evicted from the queue due to long running queries in the back end. The cause for the long running queries in the back end was failed garbage collects that caused JVM memory heap to reach 100% usage on some of the cluster nodes.
During the events, a subset of customers using this specific cluster of FIND would have seen network timeouts or slow response times trying to connect to the service.
There are a number of Issues identified during this analysis and fixes have been released to production:
Expansion of the proxy layer with more nodes to scale out traffic load.
Scale up of data node size so the JVM memory heap could be increased to 26GB.
Improved alerting and monitoring of now known limits for memory usage and garbage collection frequencies and timings with the aim to identify and take preventive actions against potential problems more quicker.
Longer term improvements
We are currently investigating to implement a request timeout between proxy and cluster to prevent some long running requests which eventually can exhaust the memory heap.
Identify, optimize, reduce, and eliminate unnecessary client calls to the back end.
One critical learning from these events is that we need to be much faster about communicating what´s happening. We will be reviewing on how we can improve our reporting to our support teams that are communicating directly with our clients that are negatively affected, without shifting focus from the engineers working to resolve the immediate incident.
We place the utmost importance and pride on achieving and sustaining the highest level of availability for our customers and we regret any disruption in service you have experienced. We continue to work tirelessly to ensure any and all service disruptions are prevented and or mitigated and we will use this incident to further these efforts to help ensure you receive a reliable and positive experience.