Why Hadoop Analytics Projects Often Fall Short of Their Goals

1 - Why Hadoop Analytics Projects Often Fall Short of Their Goals
2 - Jobs Crash or Fail to Be Completed on Time
3 - No Ability to Monitor Cluster Performance in Real Time
4 - Lack of Macro-Level Visibility, Control Over the Cluster
5 - Insufficient Ability to Set and Enforce Job Priorities
6 - Underutilized, Wasted Capacity
7 - Insufficient Ability to Control Allocation of Resources Across a Cluster in Real Time
8 - Lack of Granular View Into How Cluster Resources Are Used
9 - Inability to Predict When a Cluster Will Max Out
10 - HBase and MapReduce Contention
11 - Lack of Key Visual Dashboards
1 of 11

Why Hadoop Analytics Projects Often Fall Short of Their Goals

by Chris Preimesberger

2 of 11

Jobs Crash or Fail to Be Completed on Time

Hadoop deployments often start as "sandbox" environments. Over time, the workload expands, and an increasing number of jobs on the cluster support production applications that are governed by service-level agreements. Ad hoc queries and other low-priority jobs compete with business-critical applications for system resources, causing high-priority jobs to fail to complete when needed.

3 of 11

No Ability to Monitor Cluster Performance in Real Time

Hadoop diagnostic tools are static, providing log file information about what happened to jobs on the cluster after they have run. Hadoop does not provide the ability to monitor at a sufficiently granular level what's happening while multiple jobs are running. As a result, it is difficult and often impossible to take corrective action to prevent operational problems before they occur.

4 of 11

Lack of Macro-Level Visibility, Control Over the Cluster

Various Hadoop diagnostic tools provide the ability to analyze individual job statistics and examine activity on individual nodes on the cluster. In addition, developers can tune their code to ensure optimal performance for individual jobs. What's lacking, however, is the ability to monitor, analyze and control what's happening with all users, jobs and tasks running on the entire cluster, including use of each hardware resource.

5 of 11

Insufficient Ability to Set and Enforce Job Priorities

While job schedulers and resource managers provide basic capabilities such as job sequencing, time- and event-based scheduling and node allocation, they are insufficient in ensuring that cluster resources are being used in the most efficient manner while jobs are running.

6 of 11

Underutilized, Wasted Capacity

Organizations typically size their clusters for maximum peak workloads. The extra capacity that is rarely used can be very expensive and often unnecessary.

7 of 11

Insufficient Ability to Control Allocation of Resources Across a Cluster in Real Time

When rogue jobs, inefficient or expensive queries, or other processes running on the cluster adversely impact performance, it is often too late for Hadoop operators to take the necessary corrective actions before service-level agreements are missed.

8 of 11

Lack of Granular View Into How Cluster Resources Are Used

When jobs crash or fail to complete on time, Hadoop operators/administrators have difficulty diagnosing performance problems. Hadoop does not provide a way of monitoring and analyzing cluster performance with sufficient context and detail. For example, it is impossible to isolate problems by user, job, or task and pinpoint bottlenecks related to network, memory or disk.

9 of 11

Inability to Predict When a Cluster Will Max Out

More jobs, different kinds of jobs, expanding data volumes, different data types, more complex queries and many other variables continuously increase the load on cluster resources over time. Often, the need for additional cluster resources isn't apparent until a disaster occurs (for example, a customer-facing Website goes down or a mission-critical report doesn't run). Consequences can include unsatisfactory customer experiences, missed business opportunities, unplanned capital expense requests and more.

10 of 11

HBase and MapReduce Contention

Contention for system resources by HBase and MapReduce jobs can affect overall performance significantly. The inability to optimize resource utilization while these different types of workloads run concurrently leads many organizations to suffer the expense of deploying separate, dedicated clusters.

11 of 11

Lack of Key Visual Dashboards

The ones at issue enable interactive exploration and fast diagnoses of performance-related issues on the cluster. The static reports and detailed log files provided with Hadoop schedulers and resource managers are not conducive to fast or easy problem diagnosing. Culling through voluminous data while troubleshooting can waste hours or even days. Hadoop operators need the ability to quickly visualize, analyze and understand the causes of performance problems and identify opportunities to optimize resource use.

Top White Papers and Webcasts