We all had a few chuckles around here when we learned the reasons for the Facebook outage on Monday.
Datacenters use something called BGP (Border Gateway Protocol), a way for routers on the internet to learn the fastest path for traffic to take, learn when a circuit may be down, and re-route traffic as needed. We run BGP in our datacenter, and it’s pretty important.
What someone at Facebook did was the equivalent of cutting themselves off at the knees… They deleted the advertised BGP information, and with no way for traffic to reach Facebook, essentially everything was down. Hilarious, at least to the IT crowd, because we understand what they are going to have to go through to fix it.
Facebook’s VP of Infrastructure blogged about this here, and said in part: “it was not possible to access our data centers through our normal means because their networks were down… we sent engineers onsite to the data centers to have them debug the issue and restart the systems. But this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers.”
So someone had to break in (or use the break-glass account) behind which there were additional layers of physical and logical security that had to be overcome. We’ve had to do similar things in the past, but not at this scale. I can recall an incident where we needed to gain access to our datacenter when a backup power system had failed, but the access control system was on the failed system, so we couldn’t get in to fix the problem. It has since been made redundant.
He closed with the following statement: “We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making.”
Yes, we can relate. 🙂