Amazon Web Services is using the open-source Apache Hadoop distributed computing technology to make it easier for users to access large amounts of computing power to run data-intensive tasks.
AWS (Amazon Web Services) April 2 announced the public beta of its Amazon Elastic MapReduce initiative, a service designed for businesses, researchers and analysts who have large number-crunching projects list Web indexing, data mining, financial analysis and scientific simulations, according to AWS officials.
Using a hosted Hadoop framework, users can instantly provision as much compute capacity they need from Amazon’s EC2 (Elastic Compute Cloud) platform to perform the tasks, and pay only for what they use. To sign up for the service, go here.
Hadoop, the open-source version of Google’s MapReduce, is already being used by such companies as Yahoo and Facebook. Google only uses Hadoop internally.
There are efforts underway to increase the use of Hadoop in enterprise data centers. Most recently, a startup, Cloudera-which calls itself the commercial Hadoop company-announced March 16 the availability of its first product, the Cloudera Distribution for Hadoop. The product lets users store and process petabytes of data that many times is distributed among thousands of servers.
Cloudera also created a portal to help users install and use the company’s free product.
“Cloudera is advancing Hadoop technology to make it easier for everyone to store and process the same types of big data that large Web companies are successfully using in their businesses,” Christophe Bisciglia, the founder of Cloudera and former manager of Google’s Hadoop cluster, said in a statement at the time of Cloudera’s announcement.
According to AWS officials, using Hadoop and other MapReduce-based clusters on the Amazon EC2 cloud computing platform was a difficult task that forced users to do their own set up, management and cluster tuning. With Amazon Elastic MapReduce, those tasks are less time-consuming and more affordable, enabling users to quickly build up and take down Hadoop-based clusters on EC2 in moments.
AWS also is offering sample applications and tutorials to help users get more comfortable with the new service. Amazon Elastic MapReduce automatically deploys and configures the number of EC2 instances users ask for, then launches a Hadoop implementation of the MapReduce tool. MapReduce then loads the data from Amazon S3 (Simple Storage Service) and divides it so it can be processed in a parallel fashion. The data is then recombined after processing, with the end results put back into S3.
“Some researchers and developers already run Hadoop on Amazon EC2, and many of them have asked for even simpler tools for large-scale data analysis,” Adam Selipsky, vice president of product management and developer relations at AWS, said in a statement.