Service availability issue
Incident Report for Optimizely Service
Postmortem

Summary

Episerver Digital Experience Cloud™ Service (DXC Service) is the cloud-based offer from Episerver based on Microsoft cloud technology. A solution that delivers high availability and performance, easy connectivity with other cloud services and existing systems, ability to manage spikes in customer demand, and a platform that is ready to seamlessly adopt the latest technology updates.

Between December 11th and December 14th 2018 a subset of customers hosted in the Europe and West US Digital Experience Cloud regions experienced HTTP 500 responses on their websites and the following report describes additional details around the event.

Timeline

December 11th, 2018

6:46 AM CET The first alert alert is received and an incident support ticket is created with level 1 support.

3:59 PM CET It is identified that the issue is not isolated to a single client but affecting a subset of clients hosted in the Digital Experience Cloud Europe region. Since the issue is surfacing on a patch Tuesday and the error message indicates that this is an OSE related issue a priority A case is opened with Microsoft. The error messages shows that there is a failure to load the third-party component React.

December 12th, 2018

08:30 AM CET It is identified that upgrading React to the latest version of React resolves the error and this is communicated to affected clients as a workaround as we continue to investigate the root cause with Microsoft.

16:29 PM AM CET, The first alert related to a client affected in the Digital Experience Cloud region West US is received.

December 13th, 2018

09:00 AM CET: Preliminary root cause identified and the likely cause for the interruptions is a Critical security update of Microsofts Javascript engine ChakraCore (CVE-2018-8624) being rolled out by Microsoft on Azure App Services throughout all Azure regions. The majority of client services has recovered by implementing the work around but we are still receiving occasional reports from clients.

December 14th, 2018

10:30 AM CET: No more incidents have been reported and the incident is marked as resolved. We continue to work with Microsoft to map the events leading up to the incident.

January 7th, 2019

11:21 AM CET: The full root cause analysis report is received from Microsoft and the Incident ticket is closed.

Root cause

The investigation determined that the default javascript engine, MSIE was being loaded properly before the latest update occurred. However, after the latest update MSIE failed to load and the default JavaScript engine fell back to V8 which did not have all required components to support all features of older versions of React; hence, a misleading exception. The underlying problem was discovered to be that the new release did not include the newer chakra.dll (native lib backing MSIE engine).

The issue mainly affected implementations of React leveraging server side components.

Corrective and Preventative Measures

Episerver was working closely with Microsoft on this issue and has been informed that the following corrective and preventive actions have or will be undertaken by Microsoft.

Engineers reviewed the deployment release package for the latest update and found a bug that prevented the newer chakra.dll from being utilized properly. A fix was immediately put together, tested and deployed on December 13, 2018 11:53:29 AM to mitigate all customers who had the potential to experience this issue.

A postmortem of this incident has included repair items for the App Service release process and code reviews.

Final Words

We apologize for the impact to affected customers. We have a strong commitment to delivering high availability for our services and we will do everything we can to learn from the event and to avoid a recurrence in the future.

We also want to emphasize the importance for customers to regularly upgrade third party components used in their solutions. Especially since the work around for this incident was to upgrade to the latest version of React, solutions already using the latest version of React remained unaffected by this event.

Posted Jan 29, 2019 - 09:28 UTC

Resolved
This issue seem to be solved, no more incidents have been reported.

Engineers will continue to investigate together with Microsoft to establish the root cause and once completed, published on the status page.
Posted Dec 14, 2018 - 09:32 UTC
Update
We have identified the likely cause for the interruptions to be a Critical security update of Microsofts Javascript engine ChakraCore (CVE-2018-8624) being rolled out by Microsoft on Azure App Services throughout the world.

To resolve the issue JavaScriptEnginerSwitcher.Core must be updated to latest version and an engine must be specified (V8, MSIE, Vroom, Chakra).

Most services has been recovered but we are still receiving occasional reports from clients. Please call support if you have any further questions.
Posted Dec 13, 2018 - 08:45 UTC
Update
We are continuing to monitor for any further issues.
Posted Dec 13, 2018 - 03:18 UTC
Update
We are continuing to monitor for any further issues.
Posted Dec 12, 2018 - 18:09 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Dec 12, 2018 - 14:37 UTC
Identified
A fix has been identified and we are working closely with the affected clients to resolve the current issues.

We will provide more updates as soon as they become available.
Posted Dec 12, 2018 - 12:48 UTC
Investigating
Episerver is investigating intermittent service outages for a subset of European & United States hosted websites.

We have identified an issue with a subset of customers hosted in Europe and United States that are experiencing service outages. We are currently investigating this together with Microsoft and will provide updates to this page as they become available.
Posted Dec 12, 2018 - 10:15 UTC