Emergency systems cause Google Cloud outage

Emergency systems cause Google Cloud outage

Google has revealed that a UPS system failure caused a recent six-hour outage in one of its cloud regions.

According to The Register, the outage began on March 29. More than twenty Google Cloud services in the us-east5-c zone – located around Columbus, Ohio – showed reduced performance or were unavailable.

According to the incident report, the disruption began with the loss of regular power supply in the affected zone. Normally, hyperscale data centers are resistant to this. They have UPS systems that immediately supply power when the electricity grid fails. These systems can keep this up for a few hours until diesel generators take over. In this case, however, there was a critical battery failure in those very UPS systems. That is why they did not supply power. The report also shows that the systems probably also prevented the generators from supplying power. The technicians had to bypass the UPS systems before the power supply was restored.

Manual actions

Technicians were notified of the problem at 12:54 p.m. (Pacific Time). The generators did not start up again until 2:49 p.m. Google indicated that most of the affected cloud services were back up and running shortly after that. However, some services took longer to recover, as manual actions were needed to restore full functionality.

The company said it regretted what had happened and emphasized that it would make every effort to prevent a recurrence in the future. To achieve that goal, Google wants to improve the power supply and the cluster recovery process to make power available more quickly and predictably after a power outage.

In addition, systems that did not switch over automatically will be checked to remediate any shortcomings. The company will also consult with the UPS supplier to better understand and resolve the cause of the battery problems.

Emergency power supplies and disaster recovery procedures

Hyperscalers such as Google generally promise resilience and often succeed in this. Yet this situation shows that even the best-prepared systems are not infallible. The most important lesson to be learned is that regularly testing emergency power supplies and disaster recovery procedures, including plans for when public cloud providers themselves fail, is not a luxury, but a necessity for every organization.