NEW YORK-Amazon Web Services has moved to make the Hadoop framework for handling large data sets easier to use, more manageable and more productive with new support in the Amazon Elastic MapReduce technology.
According to an AWS blog post on the news, “Hive builds on Hadoop to provide tools for data summarization, ad hoc querying and analysis of large data sets stored in Amazon S3 [Simple Storage Service]. Hive uses a SQL-based language called Hive QL with support for map/reduce functions and complex extensible user defined data types such as JSON and Thrift. You can use Hive to process structured or unstructured data sources such as log files or text files. Hive is great for data warehousing applications such as data mining and click-stream analysis.”
Sirota said AWS’ motivation for adding new support for Hadoop was, “We want to enable customers to cost effectively manage vast amounts of data.” Elastic MapReduce uses Hadoop.
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop enables applications to work with thousands of nodes and petabytes of data. It was inspired by Google’s MapReduce and GFS (Google File System) papers.
“Large-scale data processing has a lot of muck and we want to remove it for our customers,” Sirota said. He noted that it is hard to manage compute clusters and also hard to tune Hadoop. “So we decided to make Hadoop simple and easy,” he said.
One new feature in Elastic MapReduce is support for Apache Pig, an open-source platform for analyzing large data sets. Other features include support for batch and interactive modes, and concurrent access to multiple file systems.
Sirota said he would like to see the Hadoop ecosystem continue to grow. With that in mind, he announced, “Amazon Elastic MapReduce is now supported by Karmasphere Studio for Hadoop, a NetBeans-based integrated development environment (IDE) that makes it easy to develop, debug and deploy job flows from your desktop directly to Amazon Elastic MapReduce.”
Sirota “also announced … a private beta release of Amazon Elastic MapReduce support for Cloudera’s Hadoop distribution. Cloudera customers can obtain a support contract to gain access to custom Hadoop patches and help with the development and optimization of processing pipelines,” according to the AWS blog post. It continued:
“These new tools give you the power to process gigantic data sets while keeping you at arm’s length from some of the more complex aspects of parallel programming. You don’t have to find a server cluster, install a bunch of software, coordinate and synchronize processes, copy data between servers, or negotiate with your colleagues for access to shared resources. These tools reduce the distance between the problem and the solution and allow you to spend your time on the more interesting aspects of your work.”