Why Hadoop Analytics Projects Often Fall Short of Their Goals

August 8, 2014

eWEEK content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Prev Next

1Why Hadoop Analytics Projects Often Fall Short of Their Goals

by Chris Preimesberger

2Jobs Crash or Fail to Be Completed on Time

Hadoop deployments often start as “sandbox” environments. Over time, the workload expands, and an increasing number of jobs on the cluster support production applications that are governed by service-level agreements. Ad hoc queries and other low-priority jobs compete with business-critical applications for system resources, causing high-priority jobs to fail to complete when needed.

3No Ability to Monitor Cluster Performance in Real Time

Hadoop diagnostic tools are static, providing log file information about what happened to jobs on the cluster after they have run. Hadoop does not provide the ability to monitor at a sufficiently granular level what’s happening while multiple jobs are running. As a result, it is difficult and often impossible to take corrective action to prevent operational problems before they occur.

4Lack of Macro-Level Visibility, Control Over the Cluster

Various Hadoop diagnostic tools provide the ability to analyze individual job statistics and examine activity on individual nodes on the cluster. In addition, developers can tune their code to ensure optimal performance for individual jobs. What’s lacking, however, is the ability to monitor, analyze and control what’s happening with all users, jobs and tasks running on the entire cluster, including use of each hardware resource.

5Insufficient Ability to Set and Enforce Job Priorities

While job schedulers and resource managers provide basic capabilities such as job sequencing, time- and event-based scheduling and node allocation, they are insufficient in ensuring that cluster resources are being used in the most efficient manner while jobs are running.

6Underutilized, Wasted Capacity

Organizations typically size their clusters for maximum peak workloads. The extra capacity that is rarely used can be very expensive and often unnecessary.

7Insufficient Ability to Control Allocation of Resources Across a Cluster in Real Time

When rogue jobs, inefficient or expensive queries, or other processes running on the cluster adversely impact performance, it is often too late for Hadoop operators to take the necessary corrective actions before service-level agreements are missed.

8Lack of Granular View Into How Cluster Resources Are Used

When jobs crash or fail to complete on time, Hadoop operators/administrators have difficulty diagnosing performance problems. Hadoop does not provide a way of monitoring and analyzing cluster performance with sufficient context and detail. For example, it is impossible to isolate problems by user, job, or task and pinpoint bottlenecks related to network, memory or disk.

9Inability to Predict When a Cluster Will Max Out

More jobs, different kinds of jobs, expanding data volumes, different data types, more complex queries and many other variables continuously increase the load on cluster resources over time. Often, the need for additional cluster resources isn’t apparent until a disaster occurs (for example, a customer-facing Website goes down or a mission-critical report doesn’t run). Consequences can include unsatisfactory customer experiences, missed business opportunities, unplanned capital expense requests and more.

10HBase and MapReduce Contention

Contention for system resources by HBase and MapReduce jobs can affect overall performance significantly. The inability to optimize resource utilization while these different types of workloads run concurrently leads many organizations to suffer the expense of deploying separate, dedicated clusters.

11Lack of Key Visual Dashboards

The ones at issue enable interactive exploration and fast diagnoses of performance-related issues on the cluster. The static reports and detailed log files provided with Hadoop schedulers and resource managers are not conducive to fast or easy problem diagnosing. Culling through voluminous data while troubleshooting can waste hours or even days. Hadoop operators need the ability to quickly visualize, analyze and understand the causes of performance problems and identify opportunities to optimize resource use.

Prev Next

Why Hadoop Analytics Projects Often Fall Short of Their Goals

1Why Hadoop Analytics Projects Often Fall Short of Their Goals

2Jobs Crash or Fail to Be Completed on Time

3No Ability to Monitor Cluster Performance in Real Time

4Lack of Macro-Level Visibility, Control Over the Cluster

5Insufficient Ability to Set and Enforce Job Priorities

6Underutilized, Wasted Capacity

7Insufficient Ability to Control Allocation of Resources Across a Cluster in Real Time

8Lack of Granular View Into How Cluster Resources Are Used

9Inability to Predict When a Cluster Will Max Out

10HBase and MapReduce Contention

11Lack of Key Visual Dashboards

Get the Free Newsletter!

MOST POPULAR ARTICLES

9 Best AI 3D Generators You Need...

RingCentral Expands Its Collaboration Platform

8 Best AI Data Analytics Software &...

Zeus Kerravala on Networking: Multicloud, 5G, and...

Datadog President Amit Agarwal on Trends in...

Advertisers

Menu

Our Brands