A subset of the deployments in PaaS portal didn't have an updated status and appeared to be stuck at 0%, but the deployment was still running in the background.
2019-02-11 13:52 CET: Received reports about issues with hanging deployments
2019-02-11 14:02 CET: After going through the system logs preliminary root cause was determined to be a service bus related issue (timeouts)
2019-02-11 14:06 CET: Started to manually force a status update of affected deployments
2019-02-11 14:45 CET: Switched over to another service bus, new deployments from this time were unaffected
2019-02-11 15:10 CET: Last deployment affected by the incident was fixed (status in sync again)
2019-02-11 15:15 CET: Confirmed no new timeouts against the service bus had occurred since the switch to another one
The root cause was a problem with the service bus that PaaS portal is using which were intermittently unreachable during the day. While the service bus experienced issues for the greater part of the day, the problem seemed to escalate and affect more deployments in the afternoon (CET).
We redeployed PaaS portal to a slot which used another service bus and swapped the new slot in to avoid downtime for customers (the timeouts affected only a subset of deployments). No new timeouts occurred once the new service bus went into production.
We plan to upgrade and migrate to a new service bus infrastructure.
We apologize for the impact to affected customers. We have a strong commitment to delivering high availability for our services and we will do everything we can to learn from the event and to avoid a recurrence in the future.