(Resolved) FIND Incident in EU and APAC region
Incident Report for Optimizely Service
Postmortem

SUMMARY

Episerver Find is a Cloud-based Enterprise Search solution that delivers enhanced relevance and powerful search functionality to websites. On Monday the 7th of October 2019 we experienced an event which impacted the functionality of the service in the EU and APAC Digital Experience Cloud region. Details of the incident are described below.

DETAILS

On Monday 2019-10-07 00:00 AM UTC the Engineering team received an alert regarding FIND in one of the APAC Digital Experience Cloud Regions. Troubleshooting started immediately upon receiving the alert.

Initial analysis did not reveal any underlying issues with the infrastructure and no abnormal amount of client 5xx-errors were seen. Some search requests were still being processed successfully, which was one of the reasons it took time to identify the cause of the incident. Continued investigation, together with client error reports identified an expired SSL certificate on four FIND Service API endpoints in the EU and APAC region.

However resolution was delayed due to issues with the SSL renewal process. Once a new certificate was issued, it was installed on one endpoint to ensure everything worked as expected. As the rollout continued on the remaining API endpoints, alerts cleared and FIND requests to the previously affected clusters started to respond to queries over HTTPS.

All services were fully restored at 2019-10-07 04:06 AM UTC.

TIMELINE

October 7th, 2019

00:00 AM UTC: Alert triggered and acknowledged within 1 minute. Investigation initiated immediately on the specific alerting Find Cluster, APFINDPROD03 (v1 in APAC region).

00:32 AM UTC: First analysis did not reveal any underlying issue with the platform, cluster health was good and no abnormal amounts of client 5xx-errors were seen on the back-end. Investigation continued to identify the root cause.

01:24 AM UTC: Identified that the cause of the issue was an expired multi-domain SSL certificate on four FIND Service API endpoints in the APAC and EU region for v1 Infrastructure.

02:24 AM UTC: Experiencing issues during renewal process of expired SSL certificate.

02:25 AM UTC: Attempts to report ongoing incident on Episerver Status page was unsuccessful.

02:49 AM UTC: Alert is triggered for another Find Cluster, which is acknowledged within 1 minute. Cause is already known and mitigation steps are worked upon to generate a new SSL certificate.

03:50 AM UTC: New SSL certificate generated and rolled out starting with one Find endpoint for verification. Upon confirmation that the new certificate worked as expected, the three remaining endpoints were updated with new SSL certificates.

04:06 AM UTC: All services fully operational and monitoring alerts closed. - RESOLVED

04:20 AM UTC: Customer´s that had switched from HTTPS traffic to HTTP as a workaround during the incident were advised to revert the change.

07:00 AM UTC: Retrospective meeting to go through incident timelines and mitigation actions.

10:28 AM UTC: Episerver Status page updated to reflect and acknowledge the incident occurrence earlier in the day, since the update was not successful during the incident.

12:00 PM UTC: Completed audit of all endpoints existing SSL certificates related to the FIND service.

15:00 PM UTC: Completed proactive and reactive monitoring for SSL validity on all endpoints.

ANALYSIS

The investigation identified an expiring SSL certificate as the cause of the incident.

We had received a warning of an expiring SSL certificate for another domain no longer used and no further action was taken. Further analysis identified the reason for the misinterpretation of the SSL renewal notification being that only the primary domain was listed, when it was in fact a multi-domain certificate with domains still in use on our FIND Service API endpoints in EU and APAC region.

IMPACT

During the event, requests to this FIND cluster over SSL/TLS would not have been able to establish a successful SSL/TLS handshake and received a Security Authentication Exception.

CORRECTIVE MEASURES

Immediate remediation

We have undertaken the following activities to remediate this incident from occurring again.

  • To improve detection and quicker mitigation, we have introduced additional SSL monitoring for all FIND service URLs that will help identify the issues faster. (Completed)
  • Performed a review of all assets that should have SSL monitoring enabled. (Completed)
  • Modified and verified automation for management escalation during high impact incidents, and performed additional internal training of hierarchical escalation procedures. (Completed)
  • Introduced an additional monitoring system for SSL Certificates validity with automated notifications for each end-point 60,30,14,7,5,4,3,2,1 days in advance. (Completed)

Long term preventative measures

  • _Review and implement improvements to internal SSL certificates renewal process and procedures not only limited to FIND services but all Episerver services. (Ongoing)
    _
  • During the incident, we also encountered issues with updating the Episerver status page, our communication channel to provide visibility on service status. We failed in this commitment due to access constraints and unclear procedures. We will complete a review of authorized users that can update Episerver Status page to improve incident reporting. We will also perform internal training and improve documentation on reporting of incidents for better client visibility.(Ongoing)

FINAL WORDS

We sincerely apologise for the impact to affected customers. Customer experience is of high priority for us, and we have a strong commitment to delivering high availability for our services. We will do everything we can to learn from the event to avoid a recurrence in the future.

Posted Oct 11, 2019 - 07:30 UTC

Resolved
Between 2019-10-07 02:00 AM CEST and 2019-10-07 06:03 AM CEST we had a Find incident in the EU and APAC Digital Experience Cloud region.

We are still investigating exactly what happened and why to be able to establish the root cause and how we can prevent it for happening again. An incident report will be published as soon as it becomes available.
Posted Oct 07, 2019 - 10:28 UTC