What can tech companies and businesses learn from Facebook’s outage?

What can tech companies and businesses learn from Facebook’s outage?
123RF

The incident has shown how a single risk event can have a devastating effect and bring a global organisation to a complete standstill.

By on

The Facebook outage last week has brought into awareness the need to have better recovery processes in place, ensure safeguards and being vigilant about business continuity.

While network outages are impossible to prevent and mistakes happen, the costs and stakes are higher when you serve more than 3 billion users dependent on their social networks in their daily life.

The impact of the outage on business owners has yet to be quantified but estimates put the losses at $100 million, underscoring how vital the company has become for both businesses and consumers.

What really went wrong?

In a statement, Facebook attributed the shutdown to configuration changes on the backbone routers that co-ordinate network traffic between data centres.

Santosh Janardhan, the company’s VP of infrastructure, explained: “During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network.”

The disruption to network traffic also had a cascading effect on the way data centres communicate, bringing the company’s services to a halt.

The faulty configuration then affected the company’s internal systems and complicated attempts to resolve the problem. Facebook staff were also unable to access offices and conference rooms that required a security badge. This happened as grants access was connected to the same domain  – Facebook.com

To avoid glitches on a large scale, the best way is to do it step by step.

That way you are rolling out the changes in a small, controlled environment, and be able to contain any possible threat in a sustainable manner. Why was this not possible here and why did this happen is a question only Facebook can answer.

- Boris Cipot, Senior Security Engineer, Synopsys Software Integrity Group

“Facebook is maintaining millions of servers to provide different offerings such as the Facebook platform itself, Instagram, WhatsApp and Oculus Rift VR services to their users. Part of this maintenance is also changing certain server settings that define how the server and the services on it are working,” said Boris Cipot, Senior Security Engineer, Synopsys Software Integrity Group, sharing his perspective on the incident.

 “In this outage, the problem was caused due to the change of the Domain Name Service settings. If this setting is wrong, then your servers will no longer be reachable as other computers will not be able to find them. In simpler terms, think about a phone number. If you have a phone number, other people can call you,” he added.

“If you hand out the wrong phone number, others will not be able to contact you - and this is what has happened to Facebook. Due to a misconfiguration of those DNS settings, the Facebook servers were no longer accessible and therefore took all services offline.” 

As with any programming language, Cipot said that to avoid glitches on a large scale, the best way is to do it step by step.

“That way you are rolling out the changes in a small, controlled environment, and be able to contain any possible threat in a sustainable manner. Why was this not possible here and why did this happen is a question only Facebook can answer.

“Changes of any kind should be done in smaller steps until it’s confirmed that everything is working. After the confirmation, you can up the scale, however all services at once is a big change.”

Facebook had over the past two years consolidated its disparate app ecosystem onto one backend infrastructure – a move that creates operational efficiencies for the company and insulates Facebook from a potential breakup by regulators.

However, at the same time, this also exposes Facebook to concentration risk. A single risk event that produces a cascading effect – like old school Christmas lights where one goes out, they all go out. This strategy comes at the expense of redundancy and impairs the company’s resilience.

- Mike Proulx, Forrester analyst, VP, Research Director

What are the risk management implications?

“This should also spur organisations to take a long look at their processes and identify ones that require technology – even once unthinkable technology failures – because this outage demonstrates that the unthinkable happens,” added Forrester Vice President and Principal Analyst Jeff Pollard.

Another Forrester analyst, VP, Research Director Mike Proulx, said Facebook had over the past two years consolidated its disparate app ecosystem onto one backend infrastructure – a move that creates operational efficiencies for the company and insulates Facebook from a potential breakup by regulators.

However, at the same time, this also exposes Facebook to concentration risk. “A single risk event that produces a cascading effect – like old school Christmas lights where one goes out, they all go out. This strategy comes at the expense of redundancy and impairs the company’s resilience,” he said.

“This outage has widespread implications to the advertising ecosystem given the fact that ads weren’t being served for over six hours across Facebook and Instagram, which command the lion’s share of social media ad revenue. This not only affects Facebook’s revenue but also brands’ bottom lines.” 

Proulx added that the Facebook outage will not be the last we will see. “It’s a reminder to advertisers to have proactive mitigation plans in place to avoid the scramble of trying to figure out what to do in the moment.” 

Speaking on the ‘what ifs’, Laura Petrone, principal analyst at GlobalData, in an interview with Tech Monitor, mused: "We might wonder whether such an incident would have happened if there was more competition in the social media market and other companies were able to innovate and to offer the best services to consumers."

To reach the editorial team on your feedback, story ideas and pitches, contact them here.
© iTnews Asia
Tags:

Most Read Articles