Microsoft Azure App Service Intermittent Error (North Central US)
Incident Report for Optimizely Service
Postmortem

SUMMARY

Episerver Customer-Centric Digital Experience Platform (DXP; formerly Digital Experience Cloud™ Service - DXC Service) is the cloud-based offer from Episerver based on Microsoft cloud technology. A solution that delivers high availability and performance, easy connectivity with other cloud services and existing systems, ability to manage spikes in customer demand, and a platform that is ready to seamlessly adopt the latest technology updates.

Between 16:43 and 20:42 UTC on 19 Mar 2020, a limited subset of customers using App Service in North Central US may have received intermittent HTTP 500-level response codes on their sites. Root cause analysis has been provided by Microsoft and the following report describes additional details around the event.

DETAILS

On March 19th 2020 at 16:43 UTC Episerver began to receive a first alert for a subset of North Central US clients sites being down. Troubleshooting started immediately and revealed a coincidental timing with an outage of Microsoft Azure platform. What triggered the issue was a sudden rise in incoming traffic to the front ends in one of the scale units which resulted failures and timeout while some clients still experienced high latency. Increasing resources was performed and the service was fully recovered at 20:42 UTC.

TIMELINE

2020-03-19 15:10 UTC - Microsoft reported an issue with their Azure Platform.

2020-03-19 16:43 UTC - First alerts for client websites is received and investigation is initiated by Episerver.

2020-03-19 16:54 UTC - Support ticket raised with Microsoft.

2020-03-19 18:00 UTC - StatusPage updated.

2020-03-19 20:42 UTC - The affected App Services are fully recovered.

2020-03-20 15:03 UTC - Incident closed. Microsoft continue the investigation to establish the full root cause.

2020-03-26 00:16 UTC - Microsoft officially provided root cause analysis.

ANALYSIS

The issue was caused by sudden rise in incoming traffic to the front ends in one of the scale units.

Upon detecting increased load, automated throttling triggered which mostly prevented broadly impacting failures and timeouts but some customers still experienced high latency. Once the App Service team was engaged, engineers performed additional mitigation by increasing resources in the impacted scale unit.

IMPACT

A subset of customers may have seen HTTP 500-level response codes, experience timeouts or high latency when accessing App Service (Web, Mobile and API Apps).

CORRECTIVE MEASURES

Since the root cause was discovered, necessary fixes have been implemented to mitigate the issue from re-occurring.

Microsoft is continuously taking steps to improve the Microsoft Azure Platform and their processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Enhancing automated throttling to make it more targeted.
  • Enhancing auto-scaling measures for faster mitigation time in the future.

FINAL WORDS

We apologize for the impact to affected customers. We have a strong commitment to delivering high availability for our services and we will do everything we can to learn from the event and to avoid a recurrence in the future.

Posted Apr 13, 2020 - 04:20 UTC

Resolved
The affected app services have been operational and issue resolved. We are awaiting RCA from Microsoft and we will post it here when we receive it.
Posted Mar 20, 2020 - 15:03 UTC
Update
Microsoft has provided the RCA for one of the two affected applications. Regarding the other application, it was identified that the high CPU consumption of the web app on the instance could cause requests sent to the instance to fail. We are working on mitigation steps and will keep the progress updated.
Posted Mar 20, 2020 - 03:35 UTC
Monitoring
The affected App Services are running without error. Microsoft is investigating the root cause.
Posted Mar 19, 2020 - 20:42 UTC
Investigating
Microsoft reports: we can see that around 19 Mar 2020 15:24 UTC, you have been identified as a customer using App Service in North Central US who may receive intermittent HTTP 500-level response codes, experience timeouts or high latency when accessing App Service hosted in this region. Our engineers are actively investigating about possible issue on this region.

Episerver also had one similar occurrence in EMEA, and is investigating.
Posted Mar 19, 2020 - 18:00 UTC