During Wednesday morning we had a FIND event in the EMEA region for one of our clusters. The following report describes additional details around that event.
EPiServer FIND is a platform service that extends the search features for Episerver, allowing you to build advanced filtering and faceted navigation based on the behaviour of website visitors. The service is used by both on-premise hosted applications and Digital Experience Cloud hosted applications.
The first alert was triggered at 2016-10-12 07:28 CEST by our monitoring system. The alert was sent through our automated alert triage system to the technical team who takes action on these alerts.
The technical team started troubleshooting the issue at 07:31 and found the issues to be isolated to a specific elasticsearch cluster. After some initial investigation it was discovered that several nodes in the cluster was didn't answer to requests. The reason for this was problems with garbage collection. The technical team tried to do a gentle restart of these nodes but they were none responsive. The decision was made to do a hard reset of the whole cluster to get back to normal functionality as quick as possible. The restart was executed at 07:50 CEST. At 08:05 the global monitoring system reports the cluster to be functional again. The cluster reports full functionality back at 09:17.
07:28 CEST - First alarm is trigged.
07:31 CEST - The technical team starts investigation of the issue.
07:42 CEST - It is found that some of the nodes in the cluster don't respond to requests due to problems with garbage collection. A gentle restart of these nodes are tried but fails.
07:50 CEST - A hard reset of the whole cluster is performed to get back to normal functionality as quick as possible.
08:05 CEST - The global monitoring system is reporting that the cluster is functional again.
09:17 CEST - The cluster reports full functionality restored.
During the event, applications using this specific cluster of FIND would have seen network timeouts or slow response times trying to connect to the service.
This issue is related to the elasticsearch component of Episerver FIND. Elasticsearch will be upgraded to a newer version during the fall, this issue will be fixed as a part of that upgrade. This issue is also linked to periods of high load on the cluster. An ongoing work is to move indices from this cluster to others to spread the load.
We apologize for the impact to affected customers. While we are proud of the availability we have on FIND we know how critical this service is to customers. For us, availability is the most important feature and we will do everything we can to learn from the event and to avoid a recurrence in the future.