Issue with Azure API´s that affects deployments
Incident Report for Optimizely Service
Postmortem

Summary

Episerver Digital Experience Cloud™ Service (DXC Service) is the cloud-based offer from Episerver based on Microsoft cloud technology. A solution that delivers high availability and performance, easy connectivity with other cloud services and existing systems, ability to manage spikes in customer demand, and a platform that is ready to seamlessly adopt the latest technology updates.

On June 19th, 2019, DXC Service (DXC-S) customers were unable to run deployments and the following root cause analysis was provided by Microsoft.

Details

Between the time period of 05:06 and 07:46 UTC on 19 Jun 2019, a subset of customers may have experienced latency, timeouts, or HTTP 500-level response codes while performing service management operations such as "site create", "delete" and "move resources". Auto-scaling and the loading of site metrics may also have been impacted. Azure Resource Manager (ARM) deployments containing App Service resources may have failed with the error message "Internal Server Error".

Root cause

Microsoft determined that as part of ongoing works to drive platform resilience and ensure service stability, a configuration change was performed on the App Service Resource Provider – a part of the service architecture that deals with the processing of Service Management requests, such as “create”, “delete”, “update”, etc. The initial configuration change was applied successfully, but when a follow-up update was made, this caused an unexpected impact to the systems that handle service management requests, and a subset of customers experienced failures as a result. Existing App Service resources would not have been impacted by this issue, but auto-scale operation may also have failed during this time, which could have impacted a site’s ability to scale to meet demand.

The specific configuration which caused the issue is related to how App Service processes all management requests between regions. This logic is being updated continuously by Microsoft to increase availability/resiliency on region-by-region basis, but the logic encountered unexpected data (observable only in the production environment) during this specific update. This unexpected data was not handled gracefully, thus causing the logic to crash for a certain percentage of incoming requests (~1%). The result was that management requests which were impacted failed before they could have been even correctly processed.

Due to the nature of the App Service’s service management architecture, the impact to overall customer requests was limited, but the location of impact was not confined to any one specific region.

Resolution and Recovery

To mitigate, Microsoft isolated the specific update that had caused the issue, and then rolled back the update which restored service management functionality. Microsoft then monitored for an extended period to ensure that full service had been restored for customers.

Corrective and Preventative Measures

  • Review of pre-deployment processes to ensure the test cases fully replicate the production environment, and catch update issues prior to roll-out
  • Enhancing the test cases related to platform updates to cater for the specific circumstances in this update.
  • Microsoft will ensure that the case observed during this incident will be included in further testing and also will review if there are any similar combinations of data/payload which could push the system to a similar faulty state.
  • Review of detection & auto-mitigation logic to reduce the time to detect, and where possible mitigate before customers experience impact

Final Words

Microsoft sincerely apologizes for the impact to affected customers. They are continuously taking steps to improve the Microsoft Azure Platform and their processes to help ensure such incidents do not occur in the future. They apologize for any inconvenience this may have caused.

Posted Jul 03, 2019 - 14:06 UTC

Resolved
This incident has been resolved and we are waiting for a root cause analysis (RCA) from Microsoft, which will be available once received.
Posted Jun 19, 2019 - 11:38 UTC
Monitoring
We have no more reports of failing deployments but we are still monitoring the service and advise you to contact support@episerver.com if you encounter any issues.
Posted Jun 19, 2019 - 09:49 UTC
Identified
The issue has been identified and we can now see that deployments are completing as expected.

Please contact Managed Services, support@episerver.com if you would experience any further issues.
Posted Jun 19, 2019 - 08:49 UTC
Update
We have now also identified that we have the same issue reported from the US region.

Investigation continues with the support of Microsoft.
Posted Jun 19, 2019 - 07:56 UTC
Investigating
We are currently investigating an ongoing issue with Azure API´s that affects deployments so we recommend you to not start any new deployments at this time and reach out to Managed Services, support@episerver.com if you need to re-schedule.

We will provide more updates as soon as they become available.
Posted Jun 19, 2019 - 06:59 UTC