Facebook Explains Monday’s Global Outage
All of Facebook’s services (Instagram, Messenger, WhatsApp), its platform for companies, and its internal network were affected by a huge outage that began with ordinary maintenance.
According to Infrastructure Vice President Santosh Janardhan, a maintenance command cut down the backbone that connects all of Facebook’s data centres worldwide.
Even if that were bad enough, you couldn’t use Facebook because of the unexpected disappearance of DNS and BGP routing information pointing to its servers. When Facebook’s DNS servers lost access to the internet backbone, they ceased displaying BGP routing information, which enables every machine on the internet find its servers, according to Janardhan. Even though the DNS servers were still up and running, no one could access them.
Yesterday’s outage across our products was a bad one, so we’re sharing some more detail here on exactly what happened, how it happened, and what we’re learning from it: https://t.co/IXRt572h4c
— Mike Schroepfer (@schrep) October 5, 2021
Engineers were unable to communicate with the servers because of the broken network connections and the loss of DNS, which, as we learned the day before, rendered many of the tools they regularly use for repair and communication inoperable.
Engineers faced additional challenges because of the hardware’s physical and system security, as noted in the blog article. By “activating the secure access protocols” (presumably not a code euphemism for “chop open the server door with an angle grinder”), they were able to get the backbone online and steadily restore services under gradually growing demands.. That may have contributed to the fact that some users were unable to regain access as quickly as others due to the power and computing needs of turning everything on at once.
That’s all there is to it. This time, there are no wild conspiracy theories or hysterical techs sabotaging secure facilities to reactivate Mark Zuckerberg’s baby’s account. For six hours, services connecting billions of people were unavailable due to a defect in a command that was spotted by an audit tool.