Issues with Episerver Find EU
Incident Report for Optimizely Service
Postmortem

Summary

During Monday morning we had a FIND event in the EMEA region for one of our clusters. The following report describes additional details around that event.

EPiServer FIND is a platform service that extends the search features for Episerver, allowing you to build advanced filtering and faceted navigation based on the behaviour of website visitors. The service is used by both on-premise hosted applications and Digital Experience Cloud hosted applications.

Details

The first alert was triggered at 2016-10-17 08:48 CEST by our monitoring system. The alert was sent through our automated alert triage system to the technical team who takes action on these alerts.

The technical team started troubleshooting the issue at 08:52 and found the issues to be isolated to a specific elasticsearch cluster. After some initial investigation it was discovered that several nodes in the cluster were unresponsive due to problems with garbage collection. This resulted in a so called "Split Brain" scenario where nodes fall out of the cluster configuration. The decision was made to do a restart of the whole cluster to get back to normal functionality as quick as possible. The restart was executed at 09:05. After the restart the cluster remained in split brain so another restart was executed. At 09:18 the global monitoring system reports the cluster to be functional again.

08:48 CEST - First alarm is trigged.

08:52 CEST - The technical team starts investigation of the issue.

08:58 CEST - It is found that the cluster had several nodes that were unresponsive and this resulted in a split brain scenario.

09:05 CEST - The whole cluster is restarted to get back to normal functionality as quick as possible.

09:10 CEST - The cluster remained in split brain so another restart is executed.

09:18 CEST - The global monitoring system is reporting that the cluster is functional again.

Impact on other services

During the event, applications using this specific cluster of FIND would have seen network timeouts or slow response times trying to connect to the service. Corrective and Preventative Measures

This issue is related to the elasticsearch component of Episerver FIND. Elasticsearch will be upgraded to a newer version during the fall, this issue will be fixed as a part of that upgrade. This issue is also linked to periods of high load on the cluster. An ongoing work is to move indices from this cluster to others to spread the load.

Final Words

We apologize for the impact to affected customers. While we are proud of the availability we have on FIND we know how critical this service is to customers. For us, availability is the most important feature and we will do everything we can to learn from the event and to avoid a recurrence in the future.

Posted Nov 28, 2016 - 12:25 UTC

Resolved
This incident has been resolved.
Posted Oct 17, 2016 - 10:20 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 17, 2016 - 07:44 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Oct 17, 2016 - 07:11 UTC