The Hadoop framework for handling large data sets becomes easier to use, more manageable and more productive with new support in the Amazon Elastic MapReduce technology, according to Amazon.com.NEW YORKAmazon Web Services has moved to
make the Hadoop framework for handling large data sets easier to use, more
manageable and more productive with new support in the Amazon Elastic MapReduce
technology.
At the Hadoop World: NYC
conference here on Oct. 2, Peter Sirota, Amazon.com's general manager for
Elastic MapReduce, announced that Elastic MapReduce now supports Apache Hive.
According to an
AWS blog post on the news, "Hive builds on Hadoop to provide tools for
data summarization, ad hoc querying and analysis of large data sets stored in
Amazon S3 [Simple Storage Service]. Hive uses a SQL-based language called Hive
QL with support for map/reduce functions and complex extensible user defined
data types such as JSON and Thrift. You can use Hive to process structured or
unstructured data sources such as log files or text files. Hive is great for
data warehousing applications such as data mining and click-stream
analysis."
Sirota said AWS' motivation for adding new support for Hadoop was, "We
want to enable customers to cost effectively manage vast amounts of data."
Elastic MapReduce uses Hadoop.
The Apache Hadoop project develops open-source software for reliable,
scalable, distributed computing. Hadoop enables applications to work with
thousands of nodes and petabytes of data. It was inspired by Google's MapReduce
and GFS (Google File System) papers.
"Large-scale data processing has a lot of muck and we want to remove it
for our customers," Sirota said. He noted that it is hard to manage
compute clusters and also hard to tune Hadoop. "So we decided to make Hadoop
simple and easy," he said.
One new feature in Elastic MapReduce is support for Apache Pig, an open-source
platform for analyzing large data sets. Other features include support for
batch and interactive modes, and concurrent access to multiple file systems.
Sirota said he would like to see the Hadoop ecosystem continue to grow. With
that in mind, he announced, "Amazon Elastic MapReduce is now supported by
Karmasphere Studio for Hadoop, a NetBeans-based integrated development
environment (IDE) that makes it easy to
develop, debug and deploy job flows from your desktop directly to Amazon Elastic
MapReduce."
Sirota "also announced ... a private beta release of Amazon Elastic
MapReduce support for Cloudera's Hadoop distribution. Cloudera customers can
obtain a support contract to gain access to custom Hadoop patches and help with
the development and optimization of processing pipelines," according to
the AWS blog post. It continued:
"These new tools give you the power to process gigantic data sets while
keeping you at arm's length from some of the more complex aspects of parallel
programming. You don't have to find a server cluster, install a bunch of
software, coordinate and synchronize processes, copy data between servers, or
negotiate with your colleagues for access to shared resources. These tools
reduce the distance between the problem and the solution and allow you to spend
your time on the more interesting aspects of your work."