When to Select Apache Spark, Hadoop or Hive for Your Big Data Project

1 - When to Select Apache Spark, Hadoop or Hive for Your Big Data Project
2 - Sparking the Hadoop Industry
3 - Hadoop vs. Spark
4 - Streaming Data
5 - Machine Learning
6 - Interactive Analysis
7 - Fog Computing
8 - When Not to Use Spark
9 - The Future of Spark
1 of 9

When to Select Apache Spark, Hadoop or Hive for Your Big Data Project

Apache Spark is making remarkable gains at the expense of the original Hadoop ecosystem. Here's a guide to help decide between Spark and other Hadoop engines.

2 of 9

Sparking the Hadoop Industry

Spark has been gaining major traction in the Hadoop community, with IBM recently announcing huge investments in the technology and with numerous big data vendors integrating Spark into their big data offerings. A recent survey conducted at big data-as-a-service provider Qubole, according to CEO Ashish Thusoo, reveals that 42 percent of 288 respondents either are using or planning on integrating Spark into their infrastructure strategy in the next two years.

3 of 9

Hadoop vs. Spark

There's been an ongoing debate about whether Spark will replace Hadoop altogether because of its advantages around machine learning and its ability to process workloads up to 100 times faster. Spark also addresses certain shortcomings of Hadoop, such as the batch-oriented and disk-intensive limitations. However, Qubole is tool-agnostic and does not believe that Spark will be dominating the big data ecosystem. There are still distinct trade-offs in terms of use cases, infrastructure cost and relative maturity.

4 of 9

Streaming Data

Apache Spark's key use case is its ability to process streaming data. With so much data being processed on a daily basis, it has become essential for companies to be able to stream and analyze it all in real time. Apache Spark has the capability to handle this extra workload. Some experts even theorize that Spark could become the go-to platform for stream-computing applications, no matter the type. By supporting streaming analytics of multiple kinds, Apache Spark shows its versatility, making it a clear choice in most use cases. That versatility extends to other Spark streaming capabilities, such as fraud detection and log processing.

5 of 9

Machine Learning

Another of the many Apache Spark use cases is machine learning. Spark helps users run repeated queries on sets of data, which essentially amounts to processing machine learning algorithms. Spark's machine learning library can work in areas such as clustering, classification and dimensionality reduction, among many others. All this enables Spark to be used for some common big data functions, such as predictive intelligence, customer segmentation for marketing purposes and sentiment analysis.

6 of 9

Interactive Analysis

MapReduce was built to handle batch processing. SQL-on-Hadoop engines, such as Hive or Pig, are frequently too slow for interactive analysis. Apache Spark, however, is fast enough to perform exploratory queries without sampling. Spark also interfaces with a number of development languages including SQL, R and Python.

7 of 9

Fog Computing

Connected objects in the Internet of things collect massive amounts of data, process it, and deliver revolutionary new features and applications for people to use in their everyday lives. All that processing, however, is tough to manage with the current analytics capabilities in the cloud. That's where fog computing and Apache Spark come in. Fog computing decentralizes the data processing and storage, instead performing those functions on the edge of the network. Analyzing and processing this type of data can best be carried out by Apache Spark with its streaming analytics engine and interactive real-time query tool.

8 of 9

When Not to Use Spark

Even though it's versatile, that doesn't necessarily mean Apache Spark's in-memory capabilities are the best fit for all use cases. For example, Spark was not designed as a multiuser environment. Spark users are required to know whether the memory they have access to is sufficient for a dataset. Adding more users further complicates this, since the users will have to coordinate memory usage to run projects concurrently. Due to this, users will want to consider an alternate engine, such as Apache Hive, for large batch projects.

9 of 9

The Future of Spark

Over time, Apache Spark will continue to develop its own ecosystem, becoming even more versatile. In a world where big data has become the norm, organizations will need to find the best way to utilize it. Judging from these Apache Spark use cases, there should be many opportunities in the coming years to see how powerful Spark truly is.

Top White Papers and Webcasts