December 2, 2022

Cloudflare confirms outage caused by data center network configuration update error


Cloudflare has confirmed that the short-lived outage that took hundreds of websites offline on Tuesday, June 21 was caused by a planned network configuration change at 19 of its data centers and was not the result of a malicious activity.

As previously reported by Computer Weekly, a wide range of consumer and business websites and online services were temporarily taken offline during the downtime incident, which took just over an hour to complete. the web application security company.

In a blog post, published the same day as the outage, Cloudflare said the outage was the result of a network configuration change, deployed in 19 of its data centers, as part of a set of works wider designed to increase the resilience of its services in its “busiest locations”.

These facilities include multiple data centers in North and South America, Europe and Asia-Pacific, which is why one of the defining characteristics of the outage was the high number of large web properties and online services. scale affected by it.

“Over the past 18 months, Cloudflare has been working to convert all of our busiest sites to a more flexible and resilient architecture,” the blog post reads. “During this period, we have converted 19 of our data centers to this architecture.

“An essential part of this new architecture… is an additional layer of routing that creates a mesh of connections. This mesh allows us to easily disable and enable portions of the Internet in a data center for maintenance or to deal with a problem. »

And while the new setup made its data center network setup more robust, which is important because those 19 data centers carry a significant amount of Cloudflare’s traffic, it’s also a reason why the outage had such significant effects, the blog added.

“This new architecture has given us significant improvements in reliability, while allowing us to perform maintenance on these sites without disrupting customer traffic,” he said.

“As these locations also carry a significant portion of Cloudflare traffic, any issues here can have a very wide impact, and unfortunately, that’s what happened today.”

Following the incident, the company identified several areas for improvement to prevent a recurrence, and “will continue to work to uncover any other deficiencies that may cause a recurrence,” the blog added.

“We are deeply sorry for the disruption to our customers and any users who were unable to access internet properties during the outage. We have already started working on [making] changes and will continue our due diligence to ensure this does not happen again,” he concluded.

Source link