September 22, 2022

Overwhelmed network devices took down Amazon Web Services, affecting millions

The news: Amazon Web Services (AWS) provided a post-mortem on the massive breakdown which took down a host of services for seven hours last week.

How we got here: AWS declared that a sharp increase in connection activity and congestion overwhelmed servers and network connections, leading to widespread outage.

  • The AWS outage last week was caused by the USA-East Region in Northern Virginia, which is notorious for breaking up. Known internally as LITthis site opened in 2006, just as Amazon was launching AWS, by Initiated.
  • Services running on AWS servers that have gone offline included Disney+ and Delta Airlines; network games like PUBG, League of Legendsand Valorant; various Amazon services like Kindle eBooks, Amazon Musicand Ring cameras; as well as tinder, Roku, Coinbaseand Venmo, among others.
  • US-East has the largest concentration of AWS data centers in the world, but has become a joke among some employees for often needing patches. A major AWS customer told Insider that IAD data centers are typically “a significant point of failure” for those who rely on them as their primary AWS Region.
  • At least nine of 17 Biggest Outages in AWS History came from IAD data centers, according to AWS Maniaca blog that tracks AWS service outages.
  • AWS is responding by disabling the scaling activities that triggered last week’s event.

The problem: We are beginning to see cracks in the ability of cloud infrastructure providers to keep a growing list of Internet services and applications online. The recent AWS outage exposes the fragility of busy data center regions.

  • The proliferation of streaming services, online gaming platforms, IoT devices and online services are taking their TOLL on internet infrastructure that is decades old or simply cannot scale fast enough to keep up with demand.
  • Recent outages are also take more time to resolve, indicating that massive growth is rapidly becoming unmanageable.

The bigger picture: We see the effects of an overwhelming, highly centralized data region and overreliance on a handful of vendors who can take out large swaths of the Internet when their systems fail.

AWS hopes to be able to better track outages of this magnitude. “We plan to release a new version of our Service Health dashboard early next year, which will make it easier to understand the impact of the service and a new support system architecture across multiple AWS Regions,” said AWS.

Source link