Proverbial 'software bug' sent a spiral of bad configurations to other systems
Following one of the longer cross-service outages for Google in recent memory, the search and software giant sent out an apology and explanation for today's occurrences. According to the Official Google Blog, an internal system that sends out configuration information for systems beyond it encountered a software bug that sent out incorrect commands to several areas.
It only took from 10:55am PT when the bug was first seen to 11:02am when users began seeing massive outages in Gmail, Google+, Drive and other services. Roughly 12 minutes later while engineers were still in the process of figuring out what was happening, the initial system that sent out the bad information had self-corrected and began to properly configure other systems. Google claims nearly all users' services were back up and running by 11:30, which seems to be consistent with the general consensus among users.
As you would expect, the post gives some detail on what's being done to prevent this from happening in the future. More checks are being put in place so that improper configurations, if they are generated by bugs, aren't so easily sent out to other systems. Additionally, Google plans to improved targeted searching for issues during service failures.
Needless to say, we don't think we're going to be seeing outages like this with any higher frequency than we already experience now.
Source: Google Blog