PaaS Portal deployment issue
Incident Report for Optimizely Service
Postmortem

Summary

A subset of the deployments in PaaS portal didn't have an updated status and appeared to be stuck at 0%, but the deployment was still running in the background.

Timeline

2019-02-11 13:52 CET: Received reports about issues with hanging deployments
2019-02-11 14:02 CET: After going through the system logs preliminary root cause was determined to be a service bus related issue (timeouts)
2019-02-11 14:06 CET: Started to manually force a status update of affected deployments
2019-02-11 14:45 CET: Switched over to another service bus, new deployments from this time were unaffected
2019-02-11 15:10 CET: Last deployment affected by the incident was fixed (status in sync again)
2019-02-11 15:15 CET: Confirmed no new timeouts against the service bus had occurred since the switch to another one

Root cause

The root cause was a problem with the service bus that PaaS portal is using which were intermittently unreachable during the day. While the service bus experienced issues for the greater part of the day, the problem seemed to escalate and affect more deployments in the afternoon (CET).

Resolution and Recovery

We redeployed PaaS portal to a slot which used another service bus and swapped the new slot in to avoid downtime for customers (the timeouts affected only a subset of deployments). No new timeouts occurred once the new service bus went into production.

Corrective and Preventative Measures

We plan to upgrade and migrate to a new service bus infrastructure.

Final Words

We apologize for the impact to affected customers. We have a strong commitment to delivering high availability for our services and we will do everything we can to learn from the event and to avoid a recurrence in the future.

Posted Feb 18, 2019 - 10:22 UTC

Resolved
This incident has been resolved.
Posted Feb 11, 2019 - 16:13 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 11, 2019 - 15:17 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Feb 11, 2019 - 15:08 UTC
Investigating
The PaaS portal is currently experiencing issues with displaying the deployment progress. The deployments are still running in the back end but no information about its progress is shown.
Posted Feb 11, 2019 - 14:55 UTC