Why Hadoop Analytics Projects Often Fall Short of Their Goals

 
 
By Chris Preimesberger  |  Posted 2014-08-08 Email Print this article Print
 
 
 
 
 
 
 
 

No enterprise IT solution is perfect; for every advantage a product brings to the table, there is usually a trade-off of some kind. While batch-analytics provider Hadoop has proven to be a valuable resource to businesses across a range of industries, its limitations also have come to the fore. Though these concerns often seem minor at first, they soon can become problematic. For example, some of Hadoop's most common operational challenges can create serious and costly business obstacles. Sean Suchter, co-founder and CEO of Pepperdata, which makes real-time cluster-optimization software and the person who managed the first commercial use of Hadoop while on Yahoo's Web search engineering team, is all too familiar with the technology's business-impacting limitations. In order for companies to optimize Hadoop, they must be well acquainted ahead of time with its most common shortcomings, many of which affect reliability, predictability and visibility. This slide show, developed with eWEEK reporting and insight from Suchter, examines key reasons Hadoop can fall short of desired performance.

 
 
 
  • Why Hadoop Analytics Projects Often Fall Short of Their Goals

    by Chris Preimesberger
    1 - Why Hadoop Analytics Projects Often Fall Short of Their Goals
  • Jobs Crash or Fail to Be Completed on Time

    Hadoop deployments often start as "sandbox" environments. Over time, the workload expands, and an increasing number of jobs on the cluster support production applications that are governed by service-level agreements. Ad hoc queries and other low-priority jobs compete with business-critical applications for system resources, causing high-priority jobs to fail to complete when needed.
    2 - Jobs Crash or Fail to Be Completed on Time
  • No Ability to Monitor Cluster Performance in Real Time

    Hadoop diagnostic tools are static, providing log file information about what happened to jobs on the cluster after they have run. Hadoop does not provide the ability to monitor at a sufficiently granular level what's happening while multiple jobs are running. As a result, it is difficult and often impossible to take corrective action to prevent operational problems before they occur.
    3 - No Ability to Monitor Cluster Performance in Real Time
  • Lack of Macro-Level Visibility, Control Over the Cluster

    Various Hadoop diagnostic tools provide the ability to analyze individual job statistics and examine activity on individual nodes on the cluster. In addition, developers can tune their code to ensure optimal performance for individual jobs. What's lacking, however, is the ability to monitor, analyze and control what's happening with all users, jobs and tasks running on the entire cluster, including use of each hardware resource.
    4 - Lack of Macro-Level Visibility, Control Over the Cluster
  • Insufficient Ability to Set and Enforce Job Priorities

    While job schedulers and resource managers provide basic capabilities such as job sequencing, time- and event-based scheduling and node allocation, they are insufficient in ensuring that cluster resources are being used in the most efficient manner while jobs are running.
    5 - Insufficient Ability to Set and Enforce Job Priorities
  • Underutilized, Wasted Capacity

    Organizations typically size their clusters for maximum peak workloads. The extra capacity that is rarely used can be very expensive and often unnecessary.
    6 - Underutilized, Wasted Capacity
  • Insufficient Ability to Control Allocation of Resources Across a Cluster in Real Time

    When rogue jobs, inefficient or expensive queries, or other processes running on the cluster adversely impact performance, it is often too late for Hadoop operators to take the necessary corrective actions before service-level agreements are missed.
    7 - Insufficient Ability to Control Allocation of Resources Across a Cluster in Real Time
  • Lack of Granular View Into How Cluster Resources Are Used

    When jobs crash or fail to complete on time, Hadoop operators/administrators have difficulty diagnosing performance problems. Hadoop does not provide a way of monitoring and analyzing cluster performance with sufficient context and detail. For example, it is impossible to isolate problems by user, job, or task and pinpoint bottlenecks related to network, memory or disk.
    8 - Lack of Granular View Into How Cluster Resources Are Used
  • Inability to Predict When a Cluster Will Max Out

    More jobs, different kinds of jobs, expanding data volumes, different data types, more complex queries and many other variables continuously increase the load on cluster resources over time. Often, the need for additional cluster resources isn't apparent until a disaster occurs (for example, a customer-facing Website goes down or a mission-critical report doesn't run). Consequences can include unsatisfactory customer experiences, missed business opportunities, unplanned capital expense requests and more.
    9 - Inability to Predict When a Cluster Will Max Out
  • HBase and MapReduce Contention

    Contention for system resources by HBase and MapReduce jobs can affect overall performance significantly. The inability to optimize resource utilization while these different types of workloads run concurrently leads many organizations to suffer the expense of deploying separate, dedicated clusters.
    10 - HBase and MapReduce Contention
  • Lack of Key Visual Dashboards

    The ones at issue enable interactive exploration and fast diagnoses of performance-related issues on the cluster. The static reports and detailed log files provided with Hadoop schedulers and resource managers are not conducive to fast or easy problem diagnosing. Culling through voluminous data while troubleshooting can waste hours or even days. Hadoop operators need the ability to quickly visualize, analyze and understand the causes of performance problems and identify opportunities to optimize resource use.
    11 - Lack of Key Visual Dashboards
 
 
 
 
 
 
 
 
 
 
 

Submit a Comment

Loading Comments...
 
Manage your Newsletters: Login   Register My Newsletters























 
 
 
 
 
 
 
 
 
Rocket Fuel