Episerver Find is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. On Monday the 7th of October 2019 we experienced an event which impacted the functionality of the service in the EU and APAC Digital Experience Cloud region. Details of the incident are described below.
On Monday 2019-10-07 00:00 AM UTC the Engineering team received an alert regarding FIND in one of the APAC Digital Experience Cloud Regions. Troubleshooting started immediately upon receiving the alert.
Initial analysis did not reveal any underlying issues with the infrastructure and no abnormal amount of client 5xx-errors were seen. Some search requests were still being processed successfully, which was one of the reasons it took time to identify the cause of the incident. Continued investigation, together with client error reports identified an expired SSL certificate on four FIND Service API endpoints in the EU and APAC region.
However resolution was delayed due to issues with the SSL renewal process. Once a new certificate was issued, it was installed on one endpoint to ensure everything worked as expected. As the rollout continued on the remaining API endpoints, alerts cleared and FIND requests to the previously affected clusters started to respond to queries over HTTPS.
All services were fully restored at 2019-10-07 04:06 AM UTC.
October 7th, 2019
00:00 AM UTC: Alert triggered and acknowledged within 1 minute. Investigation initiated immediately on the specific alerting Find Cluster, APFINDPROD03 (v1 in APAC region).
00:32 AM UTC: First analysis did not reveal any underlying issue with the platform, cluster health was good and no abnormal amounts of client 5xx-errors were seen on the back-end. Investigation continued to identify the root cause.
01:24 AM UTC: Identified that the cause of the issue was an expired multi-domain SSL certificate on four FIND Service API endpoints in the APAC and EU region for v1 Infrastructure.
02:24 AM UTC: Experiencing issues during renewal process of expired SSL certificate.
02:25 AM UTC: Attempts to report ongoing incident on Episerver Status page was unsuccessful.
02:49 AM UTC: Alert is triggered for another Find Cluster, which is acknowledged within 1 minute. Cause is already known and mitigation steps are worked upon to generate a new SSL certificate.
03:50 AM UTC: New SSL certificate generated and rolled out starting with one Find endpoint for verification. Upon confirmation that the new certificate worked as expected, the three remaining endpoints were updated with new SSL certificates.
04:06 AM UTC: All services fully operational and monitoring alerts closed. - RESOLVED
04:20 AM UTC: Customer´s that had switched from HTTPS traffic to HTTP as a workaround during the incident were advised to revert the change.
07:00 AM UTC: Retrospective meeting to go through incident timelines and mitigation actions.
10:28 AM UTC: Episerver Status page updated to reflect and acknowledge the incident occurrence earlier in the day, since the update was not successful during the incident.
12:00 PM UTC: Completed audit of all endpoints existing SSL certificates related to the FIND service.
15:00 PM UTC: Completed proactive and reactive monitoring for SSL validity on all endpoints.
The investigation identified an expiring SSL certificate as the cause of the incident.
We had received a warning of an expiring SSL certificate for another domain no longer used and no further action was taken. Further analysis identified the reason for the misinterpretation of the SSL renewal notification being that only the primary domain was listed, when it was in fact a multi-domain certificate with domains still in use on our FIND Service API endpoints in EU and APAC region.
During the event, requests to this FIND cluster over SSL/TLS would not have been able to establish a successful SSL/TLS handshake and received a Security Authentication Exception.
Immediate remediation
We have undertaken the following activities to remediate this incident from occurring again.
Long term preventative measures
We sincerely apologise for the impact to affected customers. Customer experience is of high priority for us, and we have a strong commitment to delivering high availability for our services. We will do everything we can to learn from the event to avoid a recurrence in the future.