Google’s Compute Engine went down for 18 minutes across all regions on 13 April at 19:09 Pacific Time owing to bugs in its network management software.
The outage was attributed to two bugs in the software that routes the traffic to instances of the Compute Engine. The fault caused a failure in these traffic routes, bringing down the Compute Engine service.
The outage was triggered when Google engineers looking to boost the performance of Compute Engine removed an unused IP block that allows external systems to locate services on the Google Cloud Platform (GCP).
The removal of redundant IP blocks makes the GCP faster and more efficient by taking the shortest path between customers and the nearest GCP instance.
Benjamin Treynor Sloss, Google’s vice president of engineering, explained on the Google Cloud Status blog that a “timing quirk” in the IP block’s removal occurred when the engineers tried to spread out the new configuration for Compute Engine.
“The IP block had been removed from one configuration file, but this change had not yet propagated to a second configuration file also used in network configuration management,” he said.
Such a failure in probation normally triggers Google’s failsafe systems to revert to a previous configuration. However, in this case it activated another “previously unseen” bug that pushed out the new configuration but with all IP blocks removed, effectively cutting off the Compute Engine from external systems.
The problem was further compounded when the configuration spread to other data centres, meaning that the usual failsafe of rerouting traffic when services in one data centre go down could not be brought into effect. The result was a compete outage.
Compute Engine is a major part of GCP, and supports the running of virtual machines and infrastructure on Google’s globe-spanning cloud. So even a short outage can have severe repercussions for companies relying on Google’s cloud services.
Sloss confirmed that Google will look at ways to mitigate such problems in the future. “We take all outages seriously, but we are particularly concerned with outages which affect multiple zones simultaneously because it is difficult for our customers to mitigate the effect,” he said.
“We will conduct an internal investigation and make appropriate improvements to our systems to prevent or minimise future recurrence.”
The incident underlines the fact that cloud platforms are now intrinsic to many organisations' operations.
V3 is hosting a Cloud and Infrastructure event next week looking at some of the key issues, including data backup and recovery. The event is free to register and takes place online on 20 and 21 April.
Just spent a year working on them? Too bad, Intel's lost interest
Sony factory in Wales now making 100,000 Raspberry Pis every week
38-year-old Alexander Vinnik faces up to 55 years in jail
Threadripper also available from today if you want a lot more power - but you'll have to wait for the motherboards to appear