DataStax Soups Up Hadoop with Apache Cassandra

 
 
By Darryl K. Taft  |  Posted 2011-03-23
 
 
 

NEW YORK - At the Structure Big Data conference, DataStax, the commercial sponsor of Apache Cassandra, unveiled Brisk, a new distribution that enhances the Hadoop and Hive platform with scalable low-latency data capabilities.

DataStax used the Structure Big Data event here on March 23 as the platform to launch its new solution for low-latency applications and Hadoop and Hive analytics.

In an interview with eWEEK at the Structure event, Matt Pfeil, CEO and co-founder of DataStax, said the Brisk platform can act as the low-latency database for extremely high-volume Web and real-time applications while providing tightly coupled Hadoop and Hive analytics. The Structure Big Data conference enabled the big data community to discuss the best technologies for managing and harnessing ever-increasing volumes of data.

"The challenge of -big data' is twofold," Pfeil said in a statement. "The analytical side is well-understood and served by Hadoop and Hive. However, we live in a real-time world and the ability for applications to interact with big data at low-latency is equally important. Apache Cassandra was bred for big data, real-time scenarios, and using it to power Apache Hive and Apache Hadoop gives users a single solution that serves both needs."

DataStax' Brisk is an enhanced open-source Hadoop and Hive distribution that uses Cassandra for many of its core services, Pfeil said. Brisk provides integrated Hadoop MapReduce, Hive and job and task tracking capabilities, while providing a Hadoop Distributed File System compatible storage layer powered by Cassandra. It also exposes the full power of Cassandra for real-time applications. The result is a single integrated solution that provides increased reliability, simpler deployment and lower TCO than traditional Hadoop solutions.

A key benefit of DataStax' Brisk is the tight feedback loop it allows between a real-time application and the analytics that follow. Traditionally, users would be forced to move data between systems via complex extract, transform and load processes, or perform both functions on the same system with the risk of one impacting the other. DataStax' Brisk, a new Hadoop and Hive distribution, will be available under Apache open-source license within 45 days of this announcement.

"By marrying the power of Cassandra-including its simplicity, scalability and speedy reads/writes-to Hadoop, DataStax has created a powerful system that speeds up the time between data creation and analysis." Tim Estes, CEO of Digital Reasoning, said in a statement. "We can count on some of Cassandra's unique capabilities to aid projects that have multiple data center locations, and large and complex bulk ingest demands. We've been thrilled to work with the DataStax team to push its capabilities to some of the most demanding customers-particularly in the Defense and Intelligence Community."

Michael Weir, vice president of marketing at DataStax, explained some key uses of Brisk:

    High-volume Websites-Provide real-time data access and storage for millions of simultaneous users. Directly perform Hive analysis on the latest data, and immediately feed analytic insights back into the application behavior.

    Retail-Maintain real-time summaries and aggregates to allow a continuously up-to-date view of important business metrics. Send alerts when anomalies occur. 

    High-volume event processing-Track and react instantly to millions of sensors or other distributed feeds, while allowing deeper analytic questions to be asked of the historical data at any moment.

    Finance and capital markets-Process, store and trigger actions based on a high-volume real-time event stream. Perform analytics on historical data, and update models directly into the application. 

The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model. Cassandra was open-sourced by Facebook in 2008, and is now developed by Apache committers and contributors from many companies. Cassandra is in use at Digg, Facebook, Twitter, Reddit, Rackspace, Cloudkick, Cisco, SimpleGeo, Ooyala, OpenX and more companies that have large, active data sets. The largest production cluster has over 100TB of data in over 150 machines.

"Not much else can compete with Cassandra in terms of performance," Pfeil said.

Pfeil said he and a former colleague from Rackspace decided to leave the hosting company to create a startup around Cassandra after having worked with Cassandra at Rackspace.

"We continue to support the open-source project," Pfeil said. "We employ 80 percent of the people working on it. And we'll continue to build products that help users use Cassandra more easily and effectively."

Explaining the value of DataStax' Brisk and Cassandra, Weir said, "It would be as if Watson was not just taking cues from its vast knowledge base, but was also taking in all the other variables around him, like the other players and how they're playing, and assessing all of that in real time. You can run the real-time processing and the analytics at the same time. We're bridging that gap between real time and analytics."

Moreover, keying in on the emerging importance of technologies like Cassandra, Pfeil said, Rackspace, is "storing a large amount of really small files on commodity hardware, and we had to expect failure to happen, so we had to find ways to scale horizontally. You don't need a supercomputer to make everything work anymore; you can use cheap commodity computers."

With Cassandra, data is automatically replicated to multiple nodes for fault tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime

"The tide has turned," Pfeil said. "Big data for enterprises used to be a problem; now, it's an opportunity."

 

 


Rocket Fuel