This week, many of the world's biggest websites and services stopped working because of a problem with Amazon Web Services, the platform that the retailer provides to power people's websites. One of its sites went offline because of the issue – and since thousands of websites rely on it, including many of the world's biggest, they immediately went competely offline or stopped working.
Websites like Quora, Trello and some of the world's biggest news sites went offline or stopped working properly when the issue happened. It even emerged that people's houses broke down – internet-enabled ovens, lights and front gates stopped working as a result of the outage.
Now it has emerged that all of those problems were the result of just one small typo on a set of servers.
The team at Amazon's Simple Storage Service were working to remove a small set of servers from its system, according to a technical report into the incident. As it did so, someone entered the wrong command and removed "a larger set of servers was removed than intended".
One of the two servers that were affected was one that manages the "metadata and location information of all S3 objects" in the Virginia data centre, according to the post. That meant that many of the core processes broke down and the server centre could no longer be used.
To fix that problem, Amazon had to fully restart all of the affected systems. But that was a huge operation – Amazon said it hadn't done one for "many years – and it took even longer than expected, meaning the problem couldn't be fixed quickly.
The company says it has now added a range of fixes to stop the issue happening again. It has modified the tools that deal with such problems to cope more efficiently, and is auditing the rest of the systems, it said.
It also announced that it would make changes to the page that shows whether the service is actually online. Even as Amazon Web Services and the websites that rely on it were falling over, its update page said that everything was fine – because that same page relied on S3 and so was affected by its own outage. It will now rely on different data centres so that it wouldn't buckle if just one went down, it said.Independent