PaaS Portal deployment issue
Incident Report for Optimizely Service
Postmortem

Summary

Episerver Digital Experience Cloud™ Service (DXC Service) is the cloud-based offer from Episerver based on Microsoft cloud technology. A solution that delivers high availability and performance, easy connectivity with other cloud services and existing systems, ability to manage spikes in customer demand, and a platform that is ready to seamlessly adopt the latest technology updates.

On May 9th, 2019, DXC Service (DXC-S) customers were unable to run deployments and the following information provides the details around this incident.

Timeline

2019-05-09 10:30 CEST: The issue concerning an inability to run deployments is reported and investigations immediately initiated.

2019-05-09 11:07 CEST: After conducting an initial investigation into root cause, Episerver engineers created and submitted a B-priority level support case to Microsoft.

2019-05-09 11:49 CEST: Episerver published a notification about the issue for DXC-S management portal users.

2019-05-09 11:58 CEST: A incident is posted at status.episerver.com.

2019-05-09 12:03 CEST: A new high priority, A-level support case created to Microsoft.

2019-05-09 12:40 CEST: An engineer at Microsoft support confirmed there is an issue with Azure Automation West Europe reported also by other customers.

2019-05-09 13:10 CEST: Active Automation account is switched to North Europe and a hotfix deployed to DXC-S management portal. This mitigated the issue for the customers starting new deployments.

2019-05-09 15:00 CEST: Episerver engineering team mitigates the issue for the customers who tried to run deployments during the incident.

2019-05-09 15:15 CEST: Microsoft reported that the issue was mitigated, and this was reviewed and confirmed by Episerver engineers.

Root cause

Microsoft support provided the following as the preliminary root cause: Azure Engineers determined that some instances of a backend service responsible for processing runbook requests had reached an operational threshold, preventing requests from completing.

Resolution and Recovery

Active Automation account was switched to North Europe region. Azure engineering team resolved the issue with Automation in West Europe region.

Corrective and Preventative Measures

This issue was turned into a known Bug report and filed for the DXC-S management portal in order to mitigate such outages in the future in a timely and efficient manner.

Final Words

We apologize for the impact to affected customers. We have a strong commitment to delivering high availability for our services and we will do everything we can to learn from the event and to avoid a recurrence in the future.

Posted May 15, 2019 - 07:43 UTC

Resolved
The incident has been resolved.
Posted May 09, 2019 - 14:43 UTC
Monitoring
We are continuing to monitor the results and addressing failed deployments that occurred during this event.
Posted May 09, 2019 - 13:15 UTC
Identified
The issue has been identified and a workaround has been implemented. We are currently monitoring the behavior.
Posted May 09, 2019 - 11:43 UTC
Investigating
There is an ongoing issue with Azure Automation that affects deployments. We are investigating the issue.
Posted May 09, 2019 - 09:58 UTC