content top

Facebook Engineer Explains “Worst Outage in Over Four Years” In Response To Facebook Outages

I was looking at google trends today (http://google.com/trends/hottrends) and noticed something very interesting. In the top 20 google trends were:

1. dns failure
2. facebook dns error
4. service unavailable dns failure
6. why isnt facebook working right now
7. facebook login problems
9. what is wrong with facebook today
11. facebook outage
14. why is facebook down

8 of the top 20 google trends for the day were due to facebook having an outage. I found this quite interesting. This means that around around 40% of the searchs done on google on september 23, 2010 were due to an outage with facebook.

This makes me think further, what would happen if facebook went down for a day? A week? A Month?

Would this 40% search domination turn into 90% or even more. Would people’s lives stand still at the point where they couldn’t login to facebook and check what their status updates were, or who has tagged them in photos?

This is question that is going to be very hard to answer. Because I am one of the believes who thinks facebook is here to stay.

It still blows me away that the main search terms in google were “dns failure”, “facebook dns error”, “service unavailable dns failure”, “why isnt facebook working right now”, “facebook login problems”, “what is wrong with facebook today”, “facebook outage”, “why is facebook down” to the point were people didnt have anything else to do but work out what was going wrong with facebook.

Makes you think even further than this, in the time that facebook was having this hiccup, how many hours of employers time was wasted in people refreshing their facebook page? More questions that can’t really be answered, but I hope these questions stirr up the emotions inside of you to ask yourself, “do I really need facebook to continue living my life?”.

I know that being an internet marketer, where a bunch of my traffic comes from social media and I remain connected to 100s of my marketing friends from around the world, that facebook makes it easy for me to stay connected to those I know and love. What does facebook mean to you?

Having my curiosity spiked from this trend, I decided to go looking about and found myself winding up on Mashable.com.

It turns out that there was two issues in the past two days, one of the main backbone providers was routing traffic to facebook.com incorrectly and caused all these DNS issue, where-as todays issue was caused by a malinformed script error in the sites configuration itself. I find these things facinating as you probably can already see. I grabbed a copy of the aricle from Mashable.com and put it below for you to take a look at:

Facebook Software Engineering Director Robert Johnson was kind enough to explain to a curious public exactly why Facebook went down earlier today, calling the mishap “the worst outage we’ve had in over four years.”

In a brief blog post, Johnson discussed today’s downtime, which occurred from 11:30 a.m. PST. The site wasn’t functioning again for most users until around 3 p.m. PST.

Today’s outage was unrelated to another period of downtime yesterday, when issues with a third-party networking provider caused problems for some users trying to connect to Facebook.

Johnson said the downtime today was caused by “an unfortunate handling of an error condition” involving an automated system designed to verify configuration values in the cache and replace invalid values with updated values from the persistent store.

Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.

To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued

The automated system for correcting configuration values has been turned off for now, and Facebook is reportedly exploring more, ahem, “graceful” methods of handling this in the future.

Johnson also notes that getting the feedback loop to stop was “quite painful,” saying that the entire site had to be turned off to stop traffic to a particular database cluster.

We don’t envy Facebook the at-scale disaster the site has just survived; 500 million users and a feedback loop adds up to some nasty business however you slice it. And Facebook’s downtime problems aren’t nearly as persistent and severe as those of other social media staples out there.

If you have any opinions on the subject — or horror stories of your own to share — please leave us a comment and let us know about them.

Taken from: http://mashable.com/2010/09/23/facebook-downtime-explained/

Does these two errors in the space of 2 days have any coincidence? I don’t think so. Does this mean facebook is going to die and keel over like an aged old dog sometime in the near future? I don’t think so either. But what it does mean, is that as humans, we rely on this virtual structure to validate our egos, connect with other human beings artifically and entertain us. Is this a good thing? Your the one to jugde that.

I hope you enjoyed this article as much as I have writing it. If you found value in it, share it with those who you think will find it interesting.

Until Next Time,

Cheers,
Mitch Sanders

Technorati Tags: , , , , , , , , , , , , , , ,

Facebook comments:

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment