Tuneup Talk Home


Google’s Glitch

As many people on Google’s Gmail and search services certainly noticed, the huge provider experienced a severe service provision problem between roughly 10:30 and 11:30AM Thursday (US Eastern time). The issue, which apparently involved a routing problem that pushed too much traffic toward servers based in Asia, caused delays in the all-important search function, as well as problems on YouTube and other services. Fundamentally, “many Web sites took twice as long to load and were twice as likely to fail during Google’s disruption” according to one report.

The main issue here is that Google has largely become the primary go-to service in terms of search and other services. Since its services are “used by hundreds of millions of people, even a breakdown affecting a small percentage of its audience can have a huge impact. Google’s search engine, by far the most popular on the Internet, fields more than 9 billion monthly search requests in the United States alone.” Ergo, if Google goes down — even for a short period — a lot of services simply stop working.

As is the case with many large Internet and technology companies in general, the company has distributed its services worldwide to guard against a major meltdown at a particular facility. However, in this case nothing actually went down. Instead, a routing issue overloaded one data center until the problem was corrected. Basically, they’re “doing it right” from an overall design standpoint, but somehow managed to put too much of a load on one location.

The bad thing is that this interrupted services worldwide for about an hour, and certainly caused a lot of user frustration. The good thing is that annoyed users simply did the right thing by going elsewhere. Rather than using Google, they used Yahoo or another search engine until the problem was resolved.

Another good thing (though bad from a production computing standpoint, since outages like this are embarrassing) is that this sort of problem helps big companies test the resiliency of their services. Companies can run as many simulated disaster drills as they want; sometimes it takes a real emergency to find holes in the recovery process and areas for improvement.

Will someone lose their job over this? Possibly. Will Google learn from it? Hopefully.

Leave a Reply