2Promote Collaboration Around Performance
When there’s a performance problem on Hadoop, there can be several culprits: your code, your data, your hardware or the way you’re sharing resources (your cluster configuration). At a startup, a single person—data scientist, developer and operator all rolled into one—might be responsible for all that. But at a large enterprise, multiple teams have to cooperate to figure out what went wrong and how to fix it. If you’re managing a big data operation at a large, distributed organization, nurture collaboration by giving your team tools that let developers, operators and managers work together to address performance issues.
3Make It Easy to Share Application Context Around Errors
4Monitor the Fleet, Not the Vehicle
To an operator running hundreds or thousands of applications on a Hadoop cluster, all of them look the same—until there’s a problem. So you need tools that let you look at performance over groups of applications. Ideally, you should be able to segment performance tracking by application types, departments, teams and data-sensitivity levels.
5Define and Enforce Service-Level Agreements
Monitoring a fleet still means knowing when an individual vehicle performs poorly. Similarly, operators need to set SLA bounds on performance and define alerts and escalation paths when they’re violated. SLA bounds should incorporate both raw metadata, such as job status, as well as business-level events, such as sensitive data access. Successful practitioners of operational readiness also set up metrics that help predict future SLA violations, so they can proactively address and avoid them.
6Understand Inter-App Dependencies
Large, traditional enterprises tend to run their Hadoop clusters as a shared service across many lines of business. As a result, each application has at least a few “roommates” in the cluster, some of which can be detrimental to its own performance. To understand the errant behavior of one Hadoop application, operators must understand what others were doing on the cluster when it ran. Therefore, provide your operations team with as much cluster-related context as possible.
7Ration Your Cluster
To optimize cluster use and ROI, operators must ration resources on the cluster and enforce the limits. An operator can budget mappers for the execution of a particular application, and if the application doesn’t perform appropriately, rationing rules should prevent the application from being deployed. Establishing and enforcing the rules for rationing cluster resources is vital for achieving meaningful operational readiness and meeting SLA commitments.
8Trace Data Access at the Operational Level
Good Hadoop management isn’t only about rationing compute resources; it also means regulating access to sensitive data, especially in industries with heightened privacy concerns like health care, insurance and financial services. Solving for data lineage and governance in an unstructured environment like Hadoop is difficult. Traditional techniques to manually maintain a metadata dictionary quickly lead to stale and old repositories, and they offer no way to prove that a production dataset is dependent on some fields and not on others. As a result, visibility and enforcement on the use of data fields are required at the operational level. If you can reliably track if and when a data field is accessed by an app, your compliance teams will be happy.
9Record Data Misfires
Compliance professionals at large enterprises also want proof that a Hadoop application processed every record in a dataset, and they look for documentation when it fails to do so. Failures can result from format changes in upstream data sets or plain old data corruption. Keeping track of all records that the application failed to process is particularly vital in regulated industries.
10Tune Your Engine Before You Replace It
With new compute fabrics emerging all the time, teams are sometimes too quick to junk their old ones in pursuit of better performance. However, it’s often the case that you can achieve equal or greater performance gains just by optimizing code and data flows on your existing fabrics. That way, you can avoid expensive infrastructure upgrades unless they’re truly necessary.