Technology

Storm’s Impact On Amazon Data Center Renews Cloud Concerns

By Wyatt Kash on July 02, 2012 at 1:56 PM

Federal agencies and regional data center operators, including one operated by Amazon Web Services, are still taking stock of the impact of widespread power outages that began Friday night and continue to leave large swaths of greater Washington, D.C., region without electrical power.

Many federal buildings are still without power as a result of unusually violent wind gusts Friday evening, prompting the U.S. Office of Personnel Management to give federal employees in the Washington, D.C. area the option to take unscheduled leave or work remotely. Among agencies taking advantage of the policy is the U.S. Patent and Trademark Office, where 7,000 of the agency’s 10,000 employees now routinely telework at least one day a week, according to Danette R. Campbell, senior advisor, at PTO’s Telework Office.

What is less clear, however, is how widely the power outages may have affected federal IT and communications systems used to support employees.

Northern Virginia and the greater Washington area are home not only to millions of residents, but some of the nation’s largest data center operations. The storm forced data centers in the region to scramble Friday night, including an Amazon Web Services facility, which began seeing “elevated error rates impacting a limited number” of customers,” according to status reports issued on a company blog and subsequently reported by the New York Times.

In response to the incident, an Amazon spokesperson told Breaking Gov Monday:

“Severe thunderstorms caused us to lose primary and backup generator power to a portion of a single Availability Zone in our US-East Region Friday night. For perspective, in our US-East Region in Virginia, we have in excess of 10 data centers.

“In the thunderstorm on Friday night, several of our data centers had their utility power impacted, but in only one of them did the redundant power not operate correctly (which ended up impacting a single digit percentage of our Amazon EC2 instances in the US-East Region). We began restoring service to most of the impacted customers Friday night and the remainder were restored on Saturday,” said the spokesperson.

Amazon’s was not the only data center that lost power.

“Our data center in Reston lost external power for a relatively short time,” said Unisys Group Vice President for Federal Systems Peter Gallagher. “Generators of course kicked in – so we were never off-line in Reston despite the storm. Our primary data centers in Salt Lake City and Eagan, Minn., were similarly not in the storm zone so our contingency operations for all customers – federal and commercial – were unaffected,” he said.

But as a growing number of organizations have come to rely on Amazon and others for storage, the type of shutdown that occurred over the weekend continues to demonstrate why cloud computing systems are still not a complete answer for organizations requiring dedicated backup options.

“Having systems in the cloud doesn’t mean that those systems are fault tolerant,” said Jason Lewis, chief scientist, Lookingglass Cyber Solutions.

“Amazon has multiple zones that can be setup for failover if a problem in one (zone) occurs. If a customer only has a presence in one location and a weather event takes down that location, then they didn’t plan for failure. I think there is a misconception that everything in the ‘cloud’ will be fault tolerant and redundant. This isn’t the case,” he said.

“The Dulles area has a huge Internet presence,” Lewis said. “I’d expect a lot of affected customers are looking at expanding their backup/fail-over plans. While recent events are rare, planning for them is part of highly available design.

According to Amazon’s blogs however, the power outage served as a reminder that organizations that rely on large scale data and cloud computing centers may face more than a temporary suspension of service in the event of regional power outages. It may also mean that the data itself can come into question, as appeared to be the case when Amazon reported at 10:36 PM PDT that Elastic Block Store storage volumes “may have inconsistent data” and in effect be “Impaired.”

Amazon, which supports a variety of Web based services for consumer, commercial and government customers, provides regular blog posts about the status of its Elastic Cloud Computing (EC2) and other data centers operations.

Those blog posts captured a glimpse of how the storm impacted Amazon’s Northern Virginia server operation as millions of area residents and business suddenly lost power from the storm:

Friday, June 29m 8:21 PM PDT | (11:21 PM EDT) We are investigating connectivity issues for a number of instances in the US-EAST-1 Region.

8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power.

8:49 PM PDT Power has been restored to the impacted Availability Zone and we are working to bring impacted instances and volumes back online.

9:20 PM PDT | (Saturday, June 30, 12:20 AM EDT) We are continuing to work to bring the instances and volumes back online. In addition, EC2 and EBS APIs are currently experiencing elevated error rates.

9:54 PM PDT EC2 and EBS APIs are once again operating normally. We are continuing to recover impacted instances and volumes.

10:36 PM PDT We continue to bring impacted instances and volumes back online. As a result of the power outage, some EBS volumes may have inconsistent data. As we bring volumes back online, any affected volumes will have their status in the “Status Checks” column in the Volume list in the AWS console listed as “Impaired.” If your instances or volumes are not available, please login to the AWS Management Console and perform the following steps:

1) Navigate to your EBS volumes. If your volume was affected and has been brought back online, the “Status Checks” column in the Volume list in the console will be listed as “Impaired.”

2) You can use the console to re-enable IO by clicking on “Enable Volume IO” in the volume detail section.

3) We recommend you verify the consistency of your data by using a tool such as fsck or chkdsk.

4) If your instance is unresponsive, depending on your operating system, resuming IO may return the instance to service.

5) If your instance still remains unresponsive after resuming IO, we recommend you reboot the instance from within the Management Console.

More information is available at: http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/monitoring-volume-status.html

11:19 PM PDT We continue to make progress in recovering affected instances and volumes. Approximately 50% of impacted instances and 33% of impacted volumes have been recovered.

Jun 30, 12:15 AM PDT | (3:15 AM PDT) We continue to make steady progress recovering impacted instances and volumes. Elastic Load Balancers were also impacted by this event. ELBs are still experiencing delays in provisioning load balancers and in making updates to DNS records.

Jun 30, 12:37 AM PDT ELB (Elastic Load Balancing) is currently experiencing delayed provisioning and propagation of changes made in API requests. As a result, when you make a call to the ELB API to register instances, the registration request may take some time to process. As a result, when you use the DescribeInstanceHealth call for your ELB, the state may be inaccurately reflected at that time. To ensure your load balancer is routing traffic properly, it is best to get the IP addresses of the ELB’s DNS name (via dig, etc.) then try your request on each IP address. We are working as fast as possible to get provisioning and the API latencies back to normal range.

Jun 30, 1:42 AM PDT We have now recovered the majority of EC2 instances and are continuing to work to recover the remaining EBS volumes. ELBs continue to experience delays in propagating new changes.

Jun 30, 3:04 AM PDT We have now recovered the majority of EC2 instances and EBS volumes. We are still working to recover the remaining instances, volumes and ELBs.

Jun 30, 4:42 AM PDT We are continuing to work to recover the remaining EC2 instances, EBS volumes and ELBs.

Jun 30, 7:14 AM PDT We are continuing to make progress towards recovery of the remaining EC2 instances, EBS volumes and ELBs.

Jun 30, 8:38 AM PDT We are continuing our recovery efforts for the remaining EC2 instances and EBS volumes. We are beginning to successfully provision additional Elastic Load Balancers.

As a result of the power outage, some EBS volumes may have inconsistent data. As we bring volumes back online, any affected volumes will have their status in the “Status Checks” column in the Volume list in the AWS console listed as “Impaired.” If your instances or volumes are not available, please login to the AWS Management Console and perform the following steps:

1) Navigate to your EBS volumes. If your volume was affected and has been brought back online, the “Status Checks” column in the Volume list in the console will be listed as “Impaired.”

2) You can use the console to re-enable IO by clicking on “Enable Volume IO” in the volume detail section.

3) We recommend you verify the consistency of your data by using a tool such as fsck or chkdsk.

4) If your instance is unresponsive, depending on your operating system, resuming IO may return the instance to service.

5) If your instance still remains unresponsive after resuming IO, we recommend you reboot the instance from within the Management Console.

6) If your instance is EBS-backed (NOTE: not instance-store), you may need to perform a stop/start operation to return your instance to service.

More information is available at: http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/monitoring-volume-status.html

Jun 30, 10:25 AM PDT The majority of affected EC2 instances with no impaired EBS volumes attached have been recovered. Instances with an impaired EBS volume attached may still be unavailable. Creation of EBS recovery volumes is complete for the vast majority of recoverable volumes. For instances in this state follow the steps outlined in our last update. There remain a small subset of EBS volumes that are currently stuck in the affected Availability Zone. ELB load balancers are provisioning successfully, but some may be delayed while we process our backlog.

Jun 30, 11:42 AM PDT We are continuing to work on processing our provisioning backlog for ELB load balancers. We are also continuing to work on restoring IO for the remaining small number of stuck EBS volumes. Customer action is required for EBS volumes that do not have IO currently enabled– if you have not already chosen to Enable Volume IO, outlined in the instructions above, please follow those steps to re-enable IO on your EBS volumes.

Jun 30, 12:56 PM PDT EC2 instances and EBS volumes are operating normally. Some EBS volumes may have inconsistent data as a result of the power outage. Affected volumes will have their status in the “Status Checks” column in the Volume list in the AWS Management Console listed as “Impaired.” Please login to the AWS Management Console and perform the steps described above to Enable Volume IO on the affected volumes. We are continuing to work on processing our provisioning backlog for ELB load balancers.

Jun 30, 2:31 PM PDT We are making steady progress on processing our provisioning backlog for ELB load balancers.

Jun 30, 4:10 PM PDT We have now completed processing the vast majority of our provisioning backlog for ELB load balancers. ELB APIs are operating normally.

Jun 30, 4:43 PM PDT The service is now operating normally. We will post back here with an update once we have details on the root cause analysis.

Amazon in many ways demonstrated a level of transparency and support that many federal agencies would not likely see from their own data centers operations.

Others, however, contend that there is an important distinction between having a service outage and the occurrence of corrupted or questionable data.

That kind of uncertainty may have federal IT analysts busy pouring over log files long after the power comes back on in Washington.

Topics: @CategoryFeature, @MainPageFeature, Amazon, Amazon Web Services, cloud computing, data, Data Center, EC2, Lookingglass Cyber Solutions, Peter Gallagher, power outage, Storm, Unisys

Breaking Government

Government news, analysis and commentary

Our Sites

Storm’s Impact On Amazon Data Center Renews Cloud Concerns

Sign up and get Breaking Gov news in your inbox.