SHARE

When to Select Apache Spark, Hadoop or Hive for Your Big Data Project

Written By

Sep 16, 2015

3 minute read

eWeek content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

When to Select Apache Spark, Hadoop or Hive for Your Big Data Project
Sparking the Hadoop Industry
Hadoop vs. Spark
Streaming Data
Machine Learning
Interactive Analysis
Fog Computing
When Not to Use Spark
The Future of Spark

When to Select Apache Spark, Hadoop or Hive for Your Big Data Project

1 - When to Select Apache Spark, Hadoop or Hive for Your Big Data Project

Apache Spark is making remarkable gains at the expense of the original Hadoop ecosystem. Here’s a guide to help decide between Spark and other Hadoop engines.

Sparking the Hadoop Industry

2 - Sparking the Hadoop Industry

Spark has been gaining major traction in the Hadoop community, with IBM recently announcing huge investments in the technology and with numerous big data vendors integrating Spark into their big data offerings. A recent survey conducted at big data-as-a-service provider Qubole, according to CEO Ashish Thusoo, reveals that 42 percent of 288 respondents either are using or planning on integrating Spark into their infrastructure strategy in the next two years.

Hadoop vs. Spark

3 - Hadoop vs. Spark

There’s been an ongoing debate about whether Spark will replace Hadoop altogether because of its advantages around machine learning and its ability to process workloads up to 100 times faster. Spark also addresses certain shortcomings of Hadoop, such as the batch-oriented and disk-intensive limitations. However, Qubole is tool-agnostic and does not believe that Spark will be dominating the big data ecosystem. There are still distinct trade-offs in terms of use cases, infrastructure cost and relative maturity.

Streaming Data

4 - Streaming Data

Apache Spark’s key use case is its ability to process streaming data. With so much data being processed on a daily basis, it has become essential for companies to be able to stream and analyze it all in real time. Apache Spark has the capability to handle this extra workload. Some experts even theorize that Spark could become the go-to platform for stream-computing applications, no matter the type. By supporting streaming analytics of multiple kinds, Apache Spark shows its versatility, making it a clear choice in most use cases. That versatility extends to other Spark streaming capabilities, such as fraud detection and log processing.

Machine Learning

5 - Machine Learning

Another of the many Apache Spark use cases is machine learning. Spark helps users run repeated queries on sets of data, which essentially amounts to processing machine learning algorithms. Spark’s machine learning library can work in areas such as clustering, classification and dimensionality reduction, among many others. All this enables Spark to be used for some common big data functions, such as predictive intelligence, customer segmentation for marketing purposes and sentiment analysis.

Interactive Analysis

6 - Interactive Analysis

MapReduce was built to handle batch processing. SQL-on-Hadoop engines, such as Hive or Pig, are frequently too slow for interactive analysis. Apache Spark, however, is fast enough to perform exploratory queries without sampling. Spark also interfaces with a number of development languages including SQL, R and Python.

Fog Computing

7 - Fog Computing

Connected objects in the Internet of things collect massive amounts of data, process it, and deliver revolutionary new features and applications for people to use in their everyday lives. All that processing, however, is tough to manage with the current analytics capabilities in the cloud. That’s where fog computing and Apache Spark come in. Fog computing decentralizes the data processing and storage, instead performing those functions on the edge of the network. Analyzing and processing this type of data can best be carried out by Apache Spark with its streaming analytics engine and interactive real-time query tool.

When Not to Use Spark

8 - When Not to Use Spark

Even though it’s versatile, that doesn’t necessarily mean Apache Spark’s in-memory capabilities are the best fit for all use cases. For example, Spark was not designed as a multiuser environment. Spark users are required to know whether the memory they have access to is sufficient for a dataset. Adding more users further complicates this, since the users will have to coordinate memory usage to run projects concurrently. Due to this, users will want to consider an alternate engine, such as Apache Hive, for large batch projects.

The Future of Spark

9 - The Future of Spark

Over time, Apache Spark will continue to develop its own ecosystem, becoming even more versatile. In a world where big data has become the norm, organizations will need to find the best way to utilize it. Judging from these Apache Spark use cases, there should be many opportunities in the coming years to see how powerful Spark truly is.

When to Select Apache Spark, Hadoop or Hive for Your Big Data Project

When to Select Apache Spark, Hadoop or Hive for Your Big Data Project

Sparking the Hadoop Industry

Hadoop vs. Spark

Streaming Data

Machine Learning

Interactive Analysis

Fog Computing

When Not to Use Spark

The Future of Spark

Chris Preimesberger

Company

Categories