1The Next Challenge for Hadoop: Quality of Service
2What Is QoS for Hadoop?
Quality of service for Hadoop is the best first step toward measuring Hadoop performance. QoS provides the ability to ensure performance service levels for applications running on Hadoop by enabling the prioritization of critical jobs and addressing problems like resource contention, missed deadlines and sluggish cluster performance. By avoiding bottlenecks and contention, multiple jobs can run side-by-side, effectively and without interference.
3Why QoS for Hadoop?
Many companies run into roadblocks when they try to guarantee performance because priority jobs aren’t completed on time and clusters are underutilized. Resource contention is inevitable with today’s multi-tenant, multi-workload clusters, especially as big data applications scale. Why is this a problem? On the business side, companies waste time and money trying to fix cluster performance issues that prevent them from gaining competitive advantages linked to big data initiatives or realizing the full ROI of their big data efforts. From a technological perspective, unreliable Hadoop performance means late jobs, missed service-level agreements, overbuilt clusters and under-utilized hardware.
4Hadoop, We Have a Problem
As organizations get more advanced in their Hadoop use and run business-critical applications in multi-tenant clusters, they can no longer afford to lose sight of what’s happening from behind an increasingly insurmountable class of performance challenges—especially, if they want to make the most out of their distributed computing investments. Complicated frameworks like YARN already place performance pressure on systems, and if you look into the future at new compute platforms like Mesos, OpenStack and Docker, they will all run into this same set of widely applicable problems eventually. It’s vital that organizations get ahead of these issues now.
5Getting Around Workarounds
Once a Hadoop cluster hits a performance wall, admins need to find a resolution but are discovering that traditional best practices and manual tuning workarounds just don’t work. Over-provisioning, silo-ing and tuning aren’t solutions that last long term; plus, they are very expensive and create needless overhead. Purchasing additional nodes when hardware utilization is well below 100 percent is a costly, temporary fix that only addresses performance symptoms, not the fundamental limitations of Hadoop. Similarly, cluster isolation is costly, doubles complexity and simply isn’t a viable solution at scale. Finally, tuning by definition is a response to problems that have already occurred, and it’s impossible for a human to make the thousands of decisions necessary to tune settings in real time to adjust to constantly changing cluster conditions.
6Going Real Time
The most effective solution for resource contention is to monitor hardware resources in real time. Monitoring the hardware resources of each node in the cluster second-by-second allows you to understand which job has control over resources and to know the priority levels of each job across the cluster. This ensures that all jobs get access to cluster hardware resources in an equitable manner and business-critical jobs can finish on time, thereby guaranteeing QoS for Hadoop.
7QoS for Hadoop in Production
Companies like Trulia, Chartboost and Upsight are implementing systems that guarantee QoS for Hadoop and reaping the benefits. Trulia has successfully disrupted a decades-old industry by using and analyzing real-time data to deliver customized insights straight to consumers. With many teams writing Hadoop jobs or using Hive or Spark, Trulia has to ensure reliability in its multi-tenant, multi-workload environment. In response to delayed or unpredictable jobs that affected their customer push-notification programs, Trulia would intentionally underutilize its clusters to ensure jobs were completed on time and prevent traffic from being negatively affected. Now, Trulia uses Pepperdata to actively monitor and control all their Hadoop clusters.