Cloudera Impala 1.0 Brings SQL to Hadoop for Real-Time Queries

By Darryl K. Taft  |  Posted 2013-05-02

Cloudera Impala 1.0 Brings SQL to Hadoop for Real-Time Queries

Cloudera, a provider of Apache Hadoop solutions for the enterprise, recently announced the general availability of Cloudera Impala, its open-source, interactive SQL query engine for analyzing data stored in Hadoop clusters in real time.

Cloudera claims to have been first to market with a SQL-on-Hadoop offering, releasing Impala to open source as a public beta offering in October 2012. Since that time, the company has worked closely with customers and open-source users, testing and refining the platform in real-world applications to deliver a production-hardened and customer-validated release, designed from the ground up for enterprise workloads, said Mike Olson, CEO of Cloudera.

In an interview with eWEEK, Justin Erickson, senior product manager for Impala at Cloudera, said adoption of the platform has been strong, with more than 40 enterprise customers and open-source users using Impala today, including 37signals, Expedia, Six3 Systems, Stripe and Trion Worlds. With its 1.0 release, Impala extends Cloudera's unified Platform for Big Data, which is designed specifically to bring different computation frameworks and applications to a single pool of data, using a common set of system resources. Cloudera Impala 1.0 can be downloaded here.

"At Ovum, we believe that for Hadoop to cross over to the enterprise, it must become a first class citizen with IT, the business and the data center," said Tony Baer, principal analyst of software and enterprise solutions at market research firm Ovum, in a statement. "A large part of making Hadoop a first-class citizen in the enterprise is making it accessible to the large base of SQL developers and applications that already exist. With Impala, Cloudera has decisively planted the stake in bringing the worlds of Hadoop and enterprise SQL together. And it has done so in a way that addresses the expectations for performance that are taken for granted in the enterprise SQL world."

"Cloudera's Impala is perhaps the most widely known SQL-on-Hadoop solution," said Joseph Turian, Ph.D., and research analyst at GigaOm Research. "Cloudera has chosen to build its system from the ground up. This will allow it to optimize every part of the solution. It believes that by avoiding legacy, it can actually make a better architecture that is superior, both for end users and the ops staff."

Olson said Cloudera invested more than two years of intensive research and development to build Impala from the ground up, delivering a massively parallel processing (MPP) query engine that is native to Hadoop.

"Impala represents a major advance for Cloudera and the Hadoop ecosystem as a whole," Olson said in a statement. "We've invested years of research and development and devoted a team comprised of the world's top engineering talent to execute it. We are immensely proud to be releasing a fully tested and production-hardened Impala to general availability, and to be shattering industry forecasts for its delivery timetable.

"Cloudera was first to recognize that Apache Hadoop would be a catalyst for business transformation in the 21st century," he continued. "We have worked tirelessly to support the rapid development of the platform to form a viable and open enterprise solution, with a rich and vibrant ecosystem to support it. We will continue to be a primary driver behind the evolution of a 100-percent open source Hadoop platform by setting a high bar that pushes the boundaries of what's possible to exceed the high expectations of our enterprise customers."

Cloudera Impala 1.0 Brings SQL to Hadoop for Real-Time Queries

In a blog post about Impala 1.0, Erickson and Marcel Kornacker, lead developer of Impala, wrote that the integration of Impala and Hadoop "allows Impala users to take advantage of the time-tested cost, flexibility, and scale advantages of Hadoop for interactive SQL queries, and makes SQL a first-class Hadoop citizen alongside MapReduce and other frameworks. The net result is that all your data becomes available for interactive analysis simultaneously with all other types of processing, with no ETL [Extract, Transform, Load] delays needed."

With Impala, users can query data stored in the Hadoop Distributed File System (HDFS) and HBase directly. The framework supports all standard file and data formats available, so users can choose the format that best suits their use case, including the latest in analytics-focused columnar formats like Parquet, and can promote data sharing and reuse across all computing workloads—from batch to interactive SQL—all from a single dataset.

This approach eliminates the need to migrate datasets into specialized systems or proprietary formats for analytics purposes and reduces system redundancy and latency that would exist in a legacy data warehouse environment. The Impala framework is optimized for use with CDH, Cloudera's 100 percent open-source distribution of Hadoop and related applications.

Colin Marc, developer at Stripe, said his company must be able to quickly ingest and detect patterns in data coming from banks and its own systems. "Impala is an excellent tool for that," Marc said in a statement, "and its ability to perform speed-of-thought exploratory queries has been useful, both for analytics and development."

Cloudera Enterprise Real-Time Query (RTQ) is an optional subscription module that adds technical support and management automation to Impala for Cloudera Enterprise customers, according to Olson. It moves Apache Hadoop "beyond batch," enabling users to handle real-time workloads that previously required investment in dedicated enterprise data warehouse (EDW) solutions.

Powered by Impala, Cloudera Enterprise with RTQ offers a single, massively scalable system that improves the economics and performance of large-scale enterprise data management, enabling petascale processing and interaction with that data in real time to deliver "speed-of-thought" insights.

Six3 Systems, an industry leader in cyber-security, has turned to Hadoop—and Impala—as well. "Our business is all about comparing current activity with observed historical norms, identifying non-obvious patterns in data, correlation of large/fast moving disparate data sources, and automating threat detection," said Wayne Wheeles, senior network forensics analytic/enrichment developer at Six3 Systems, in a statement. "The larger the data sets that our algorithms can run on, the greater the cyber security threat awareness that can be provided to decision makers, making Hadoop a great fit.

"Impala integrates fully with our existing technologies, infrastructure and analytics providing a smooth transition into real-time, interactive data querying," he said. "With its innovative 'beyond batch' capabilities, we are now able to ask more sophisticated questions and gain actionable intelligence more quickly and efficiently, eliminating traditional data analysis bottlenecks and complexity."

"The ability to query data at the speed of thought is becoming a must-have in today's fast-paced gaming industry," said David Green, director of data services at Trion Worlds, in a statement. "We're deploying Cloudera Impala to empower our support organization to access and analyze issues that customers experience on the fly, while they're connected. The ability to address customer challenges in real time will drive a happier and more loyal customer base that is crucial to our business success."

Cloudera Impala 1.0 Brings SQL to Hadoop for Real-Time Queries

Erickson noted that Cloudera Impala has been widely embraced by Cloudera's partner ecosystem, with numerous companies certifying their solutions for integration with the platform, including Alteryx, Capgemini, IBM Cognos, Karmasphere, MicroStrategy, Pentaho, QlikView, SAP, Splunk and Tableau.

"Our successful collaboration with Cloudera empowers organizations to unlock valuable business insights hidden in large, complex data sets in compelling new ways," said Paul Zolfaghari, president at MicroStrategy, in a statement.

"We are very excited about Cloudera's continuing innovation in the SQL-on-Hadoop market. In our independent testing of Cloudera Impala, we experienced a massive performance increase in the accessibility of data stored in Hadoop, said Zolfaghari. "Through our platform integration with Impala, customers can now perform sophisticated point and click analytics on data stored in Hadoop directly from MicroStrategy applications."

Another partner, Tableau Software, has noticed great improvements in query performance when using Impala, making Hadoop "more valuable to our customers as they adopt it broadly and give more people interactive access," said Dan Jewett, vice president of product management at Tableau, in a statement.

Impala 1.0 offers significant performance improvements over MapReduce/Hive for a wide range of business intelligence (BI) and analytic queries, making BI over Hadoop feasible, Erickson said.

Kornacker and Erickson's post mentioned the following Impala 1.0 features:

  • Support for a subset of ANSI-92 SQL (compatible with Hive SQL), including CREATE, ALTER, SELECT, INSERT, JOIN and subqueries

  • Support for partitioned joins, fully distributed aggregations and fully distributed top-n queries

  • Support for a variety of data formats: Hadoop native (Apache Avro, SequenceFile, RCFile with Snappy, GZIP, BZIP or uncompressed); text (uncompressed or LZO-compressed); and Parquet (Snappy or uncompressed), the new state-of-the-art columnar storage format

  • Support for all CDH4 64-bit packages: RHEL 6.2/5.7, Ubuntu, Debian, SLES

  • Connectivity via JDBC, ODBC, Hue GUI or command-line shell

  • Kerberos authentication and MR/Impala resource isolation

Cloudera Impala is an Apache-licensed open-source project. The platform is open to community contributions, and the source code is available for free download on GitHub. For more information, or to join Cloudera and other open-source contributors in the development of the Impala platform, visit:


Rocket Fuel