All the latest UK technology news, reviews and analysis

Google explains Google Apps datacentre failure

by Iain Thomson

More from this author

09 Mar 2010

Be the first to comment

  • Tweet this
Google
Google admitted that poor planning led to the 24 February outage

Google has published a post-mortem of an incident in February in which Google App Engine went down for over two hours.

All Google App Engine applications were "degraded" from 7:48am to 10:09am PST on 24 February after a power failure at the company's main datacentre, the firm said.

About 25 per cent of the servers failed within five minutes owing to a delay in back-up power generation. Google's message boards started showing questions from users almost immediately.

"By this time, our primary on-call engineer had determined that App Engine is down," the report said.

"The on-call engineer, according to procedure, paged our product managers and engineering leads to handle communicating the outage to users. A few minutes later, the first post from the App Engine team about this outage is made on the external group."

There was confusion about the instructions for switching to a back-up datacentre and the decision-maker for the crossover could not be found. The team then received data suggesting that the datacentre was recovering and that a changeover was not neccesary.

However, the data turned out to be inaccurate and this extended the outage considerably. By the time the move to the backup servers had been made, Google's App Engine had been down for more than two hours.

The report found that Google had not developed plans for a partial datacentre failure, nor for determining whether the datacentre was able to continue running on such a reduced server count.

The company will now hold regular drills for failure, with a wider spectrum of possible situations, and a bi-monthly audit of all operations documents.

Google claimed that a similar failure today would cause a service slowdown for a maximum of 20 minutes with the new procedures, rather than a complete outage.

Do you agree?

 

Add your comment

We won't publish your address
By submitting a comment you agree to abide by our Terms & Conditions. Your comment will be moderated before publication.

Poll

IT priorities for 2012

What is the most important IT priority for your company this year?

97%

1%

1%

0%

1%

Connect with V3.co.uk

Sign up to our daily or weekly newsletters

Accurev

Top 5 software development challenges

This paper focuses on a series of best practices and techniques for development teams looking to improve their software development processes

Talend

Rubbish in, rubbish enterprise

Why good data management at all levels is essential in the modern business (video, 6mins)

Payroll Business Analyst

Key Skills Execute test scripts and assist with development...

Systems Support Engineer

Our client is entering a new phase of their network systems...

SQL, DBA, Database Administrator,

SQL Server / Architect / DBA SQL DBA Architect is required...

.NET Developer – ETL - SQL – C# - Gain .NET 4.0, HTML5, MVC 3 – London

.NET - C# - SQL –SSIS –ETL - Real-Time Data. This established...

To send to more than one email address, simply separate each address with a comma.