Total service interruption
Resolved
Jan 08 at 10:52am CET
Timeline of the first incident
At 08:30 we started received first reports from users, that a login to the application was no longer possible.
Monitoring start to show timeouts at 08:31 and waited 5 minutes before automatically creating an incident, which then happend at 08:36.
By this time, it was clear that the outage was global for all users of MP and emergency troubleshooting was started.
After initial investigation of basic network infrastructure, a failed infrastructure update, that was triggered at 08:00, was identified as the root cause.
The failed update was initially not considered, because of the difference in timing, but as it turned out, actual changes were only applied to production at around 08:28.Â
An attempt to rollback the changes was unsuccessful, since the failed update left the network infrastructure in an inconsistent state, so we had to identify the faulty settings manually.
At 09:10 the issue was identified and by 09:13 the application returned to a normal state and monitoring confirmed incident resolution.
Timeline of the second incident
At 10:30 monitoring reported another outage, which was reported by users at 10:40. Since the team was still investigating the previous outage, the root cause was quickly related to the previous deployment again.
The second incident was resolved by 10:46 and system went back to normal for the rest of the day.
Total service interruption
With the two incidents combined, the service was interrupted for 1h 4min, leading to a total availability of 99,95% (last 90 days)
Technical details
During the mentioned infrastructure update, terraform attempted to drop and recreate the VNET peering between the K8s network and our ApplicationGateway network, as well as the access policy of AppGW to our KeyVault resource (where SSL certificates are stored)
After dropping the resources, an unexpected manual change of the K8s cluster crashed the deployment pipeline and prevented the recreation of the previously destroyed resources.
This means terraform the deployment strategy seems to have given a faulty sequence of how the changes were applied.
After these checks, we recreated the VNET peering and the access policy manually.
Affected services
Updated
Jan 08 at 10:30am CET
all services are down
Affected services
Updated
Jan 08 at 09:16am CET
all services are running again
Affected services
Created
Jan 08 at 08:37am CET
All services down
Affected services