Google has apologised for yet another mass outage of its Gmail service
yesterday evening, which left users unable to access emails for over 90 minutes.
Ben Treynor, vice president of engineering at Google, wrote in a
blog
post that the outage was caused by recent changes which were, ironically,
designed to improve service availability.
Treynor admitted that the Google team "slightly underestimated" the load
which these changes placed on the request routers during a routine period of
upgrade work in which some of the Gmail servers were taken offline.
"At about 12:30 pm PST [8.30pm BST] a few of the request routers became
overloaded and in effect told the rest of the system 'stop sending us traffic,
we're too slow!'," he explained.
"This transferred the load onto the remaining request routers, causing a few
more to become overloaded, and within minutes nearly all of the request routers
were overloaded. As a result, people were unable to access Gmail via the web
interface because their requests could not be routed to a Gmail server."
According to Treynor, the Google team has taken several actions to ensure
that the problem does not happen again, including increasing request router
capacity and ensuring that they degrade "gracefully". In other words "get slower
instead of refusing to accept traffic and shifting their load".
This is not the first time this year that Gmail has suffered a major outage.
Google was forced to apologise in February for a
two-and-a-half-hour
outage which the firm put down to datacentre maintenance.
Do you agree?
Have your say on this article