(Cross-posted from the Google Official Enterprise Blog)
Posted by Sabrina Farmer, Senior Site Reliability Engineering Manager for Gmail
On September 23rd, many Gmail users received an unwelcome surprise: some of their messages were arriving slowly, and some of their attachments were unavailable. We’d like to start by apologizing—we realize that our users rely on Gmail to be always available and always fast, and for several hours we didn’t deliver. We have analyzed what happened, and we’ll tell you about it below. In addition, we’re taking several steps to prevent a recurrence.
The message delivery delays were triggered by a dual network failure. This is a very rare event in which two separate, redundant network paths both stop working at the same time. The two network failures were unrelated, but in combination they reduced Gmail’s capacity to deliver messages to users, and beginning at 5:54 a.m. PST messages started piling up. Google’s automated monitoring alerted the Gmail engineering team within minutes, and they began investigating immediately. Together with the networking team, the Gmail team restored some of the network capacity that was lost and worked to repurpose additional capacity, clearing much of accumulated message backlog by 1:00 p.m. PST and the remainder by shortly before 4:00 p.m. PST.
The impact on users’ Gmail experience varied widely. Most messages were unaffected—71% of messages had no delay, and of the remaining 29%, the average delivery delay was just 2.6 seconds. However, about 1.5% of messages were delayed more than two hours. Users who attempted to download large attachments on affected messages encountered errors. Throughout the event, Gmail remained otherwise available — users could log in, read messages which had been delivered, send mail, and access other features.
What’s next? Our top priority is ensuring that Gmail users get the experience they expect: fast, highly-available email, anytime they want it. We’re taking steps to ensure that there is sufficient network capacity, including backup capacity for Gmail, even in the event of a rare dual network failure. We also plan to make changes to make Gmail message delivery more resilient to a network capacity shortfall in the unlikely event that one occurs in the future. Finally, we’re updating our internal practices so that we can more quickly and effectively respond to network issues. We’ll be working on all of these improvements and more over the next few weeks—even including this event, Gmail remains well above 99.9% available, and we intend to keep it that way!