September 22, 2022

Problems with AWS Network Devices Caused Widespread Cloud Outage

Amazon Web Services data centers in Loudoun County, Virginia. (Photo: Rich Miller)





























































































































































































































Problems with multiple network devices in Northern Virginia caused a major outage at Amazon Web Services, with ripples spreading across the internet to disrupt the service of many popular web services that run their infrastructure on the AWS Cloud.

The lengthy outage highlighted the critical role cloud platforms like AWS play in supporting the web operations of at least 1 million corporate customers. The AWS issues have been blamed on performance issues at Netflix, Disney +, Ring, Ticketmaster, Venmo, Roku. Fidelity Investments, Hootsuite and many more. The outage interrupted the online finals for students using the Canvas Learning Management platform, and even deliveries to Amazon warehousesbecause the outage impacted the applications needed to analyze packages and plan delivery routes.

The AWS outage focused on US-East-1, a Northern Virginia-based service region that is home to the largest concentration of Amazon data center infrastructure. The problems started around 12:30 p.m. EST, when users started having problems accessing AWS services. About 5 hours later, at 5:47 p.m., AWS reported that it had “mitigated the underlying issue” and services were starting to restore.

“The root cause of this problem is a corruption of several network devices in the US-EAST-1 region,” AWS said on its status page. At 7:30 p.m. EST, AWS said the network device issues had been resolved and “was now working to recover all corrupted services.”

Large-scale IT service outages can be costly. A 2021 Uptime Institute survey found that data center outages cost businesses an average of $ 100,000 per incident, with about a third of those polled citing costs of $ 1 million or more.

The stakes could be even higher for Amazon Web Services, which is the world’s largest cloud computing platform. AWS had $ 16.4 billion in revenue in the third quarter of 2021, which is roughly $ 7.4 million per hour. While cloud workloads running outside of US-East-1 were apparently unaffected, an outage of more than six hours in the larger cloud region would add up quickly – although such “losses” At service providers are often accounted for through customer credits.

Why networks are so important

The rise of cloud computing underscores the importance of networks and the way they are configured. Network and software issues overtake power outages as the most common causes of data center downtime, according to 2021 outage data from the Uptime Institute. This trend reflects the growing role of cloud computing and SaaS (software as a service) applications, which often use architectures capable of bypassing physical failures of electrical components such as UPS systems, transfer switches and generators.

When Amazon Web Services has reliability issues, they often involve US-East-1, which is not surprising since it is the largest AWS Region and also the oldest, as Amazon has data centers in Virginia. since 2004. AWS has spent $ 35 billion on its cloud computing infrastructure in Northern Virginia over the past 10 years, and operates approximately 50 data centers in the region. It is the largest concentration of corporate data centers in the world, located near a strategic Internet intersection in Ashburn, which serves as a global hub for data traffic.

Network issues are complicated by the highly automated nature of cloud platforms. These data traffic flows are designed to be large and fast and operate without human intervention, making them difficult to tame when humans intervene. Some of the biggest outages affecting cloud platforms and social media have been linked to network issues. Some examples:

  • On October 5, a configuration error interrupted Facebook’s connection to a key network backbone, disconnecting all of its data centers from the Internet and leaving its DNS servers inaccessible, the company said.
  • A long Google outage in 2019 was caused by unusual network congestion in its operations in the eastern United States. incident report, Google said YouTube measured a 10% drop in overall views during the incident, while Google Cloud Storage measured a 30% drop in traffic.

Resilience is always a challenge

At DCF, we’ve often noticed how cloud computing is changing the way businesses approach availability, introducing architectures that create resiliency using software and network connectivity (see “Rethinking Redundancy”). This strategy, pioneered by cloud providers, creates new ways of designing applications. Historically, data center availability has been achieved through layers of redundant power infrastructure, including uninterruptible power systems (UPS) and emergency backup generators.

Free resource from Frontier Data Center White Paper Library

Cloud gaming company case study

A new white paper from Aligned features a case study of their multi-year colocation partnership with a global cloud gaming company. The report describes the challenges presented by the client, the solutions provided by Aligned and three of the main business results achieved by the partnership.

We always respect your privacy and we never sell or rent our list to third parties. By downloading this white paper, you agree to our terms of use. You can unsubscribe anytime.

Receive this PDF by email.

Cloud providers like Google have been leaders in creating failover scenarios that move workloads between data centers, distribute applications and backup systems across multiple data centers, and use sophisticated software to detect failures. failures and redirect data traffic to bypass hardware failures and power outages.

Amazon Web Services pioneered this effort by popularizing the use of Availability Zones (AZ), clusters of data centers within a region that allow customers to run instances of an application in multiple isolated locations to avoid a single point of failure. These architectures enable sophisticated approaches to failover and application backup. But even a distributed availability plan can go down if the network goes down, interrupting the flow of data through the cloud infrastructure.


Source link