Cloudera Impala 1.0 Brings SQL to Hadoop for Real-Time Queries
In a blog post about Impala 1.0, Erickson and Marcel Kornacker, lead developer of Impala, wrote that the integration of Impala and Hadoop "allows Impala users to take advantage of the time-tested cost, flexibility, and scale advantages of Hadoop for interactive SQL queries, and makes SQL a first-class Hadoop citizen alongside MapReduce and other frameworks. The net result is that all your data becomes available for interactive analysis simultaneously with all other types of processing, with no ETL [Extract, Transform, Load] delays needed." With Impala, users can query data stored in the Hadoop Distributed File System (HDFS) and HBase directly. The framework supports all standard file and data formats available, so users can choose the format that best suits their use case, including the latest in analytics-focused columnar formats like Parquet, and can promote data sharing and reuse across all computing workloads—from batch to interactive SQL—all from a single dataset. This approach eliminates the need to migrate datasets into specialized systems or proprietary formats for analytics purposes and reduces system redundancy and latency that would exist in a legacy data warehouse environment. The Impala framework is optimized for use with CDH, Cloudera's 100 percent open-source distribution of Hadoop and related applications. Colin Marc, developer at Stripe, said his company must be able to quickly ingest and detect patterns in data coming from banks and its own systems. "Impala is an excellent tool for that," Marc said in a statement, "and its ability to perform speed-of-thought exploratory queries has been useful, both for analytics and development."Powered by Impala, Cloudera Enterprise with RTQ offers a single, massively scalable system that improves the economics and performance of large-scale enterprise data management, enabling petascale processing and interaction with that data in real time to deliver "speed-of-thought" insights. Six3 Systems, an industry leader in cyber-security, has turned to Hadoop—and Impala—as well. "Our business is all about comparing current activity with observed historical norms, identifying non-obvious patterns in data, correlation of large/fast moving disparate data sources, and automating threat detection," said Wayne Wheeles, senior network forensics analytic/enrichment developer at Six3 Systems, in a statement. "The larger the data sets that our algorithms can run on, the greater the cyber security threat awareness that can be provided to decision makers, making Hadoop a great fit. "Impala integrates fully with our existing technologies, infrastructure and analytics providing a smooth transition into real-time, interactive data querying," he said. "With its innovative 'beyond batch' capabilities, we are now able to ask more sophisticated questions and gain actionable intelligence more quickly and efficiently, eliminating traditional data analysis bottlenecks and complexity." "The ability to query data at the speed of thought is becoming a must-have in today's fast-paced gaming industry," said David Green, director of data services at Trion Worlds, in a statement. "We're deploying Cloudera Impala to empower our support organization to access and analyze issues that customers experience on the fly, while they're connected. The ability to address customer challenges in real time will drive a happier and more loyal customer base that is crucial to our business success."
Cloudera Enterprise Real-Time Query (RTQ) is an optional subscription module that adds technical support and management automation to Impala for Cloudera Enterprise customers, according to Olson. It moves Apache Hadoop "beyond batch," enabling users to handle real-time workloads that previously required investment in dedicated enterprise data warehouse (EDW) solutions.