Nine Best Practices for Working With Hadoop in the Enterprise

1 - Nine Best Practices for Working With Hadoop in the Enterprise
2 - Promote Collaboration Around Performance
3 - Make It Easy to Share Application Context Around Errors
4 - Monitor the Fleet, Not the Vehicle
5 - Define and Enforce Service-Level Agreements
6 - Understand Inter-App Dependencies
7 - Ration Your Cluster
8 - Trace Data Access at the Operational Level
9 - Record Data Misfires
10 - Tune Your Engine Before You Replace It
1 of 10

Nine Best Practices for Working With Hadoop in the Enterprise

by Chris Preimesberger

2 of 10

Promote Collaboration Around Performance

When there's a performance problem on Hadoop, there can be several culprits: your code, your data, your hardware or the way you're sharing resources (your cluster configuration). At a startup, a single person—data scientist, developer and operator all rolled into one—might be responsible for all that. But at a large enterprise, multiple teams have to cooperate to figure out what went wrong and how to fix it. If you're managing a big data operation at a large, distributed organization, nurture collaboration by giving your team tools that let developers, operators and managers work together to address performance issues.

3 of 10

Make It Easy to Share Application Context Around Errors

When execution errors do arise, teams can spend hours tracking down what went wrong if all they've got are Hadoop's log files and Job Tracker. Invest in tools that help your team quickly connect errors to application context—where in your code they're happening—and share that information easily.

4 of 10

Monitor the Fleet, Not the Vehicle

To an operator running hundreds or thousands of applications on a Hadoop cluster, all of them look the same—until there's a problem. So you need tools that let you look at performance over groups of applications. Ideally, you should be able to segment performance tracking by application types, departments, teams and data-sensitivity levels.

5 of 10

Define and Enforce Service-Level Agreements

Monitoring a fleet still means knowing when an individual vehicle performs poorly. Similarly, operators need to set SLA bounds on performance and define alerts and escalation paths when they're violated. SLA bounds should incorporate both raw metadata, such as job status, as well as business-level events, such as sensitive data access. Successful practitioners of operational readiness also set up metrics that help predict future SLA violations, so they can proactively address and avoid them.

6 of 10

Understand Inter-App Dependencies

Large, traditional enterprises tend to run their Hadoop clusters as a shared service across many lines of business. As a result, each application has at least a few "roommates" in the cluster, some of which can be detrimental to its own performance. To understand the errant behavior of one Hadoop application, operators must understand what others were doing on the cluster when it ran. Therefore, provide your operations team with as much cluster-related context as possible.

7 of 10

Ration Your Cluster

To optimize cluster use and ROI, operators must ration resources on the cluster and enforce the limits. An operator can budget mappers for the execution of a particular application, and if the application doesn't perform appropriately, rationing rules should prevent the application from being deployed. Establishing and enforcing the rules for rationing cluster resources is vital for achieving meaningful operational readiness and meeting SLA commitments.

8 of 10

Trace Data Access at the Operational Level

Good Hadoop management isn't only about rationing compute resources; it also means regulating access to sensitive data, especially in industries with heightened privacy concerns like health care, insurance and financial services. Solving for data lineage and governance in an unstructured environment like Hadoop is difficult. Traditional techniques to manually maintain a metadata dictionary quickly lead to stale and old repositories, and they offer no way to prove that a production dataset is dependent on some fields and not on others. As a result, visibility and enforcement on the use of data fields are required at the operational level. If you can reliably track if and when a data field is accessed by an app, your compliance teams will be happy.

9 of 10

Record Data Misfires

Compliance professionals at large enterprises also want proof that a Hadoop application processed every record in a dataset, and they look for documentation when it fails to do so. Failures can result from format changes in upstream data sets or plain old data corruption. Keeping track of all records that the application failed to process is particularly vital in regulated industries.

10 of 10

Tune Your Engine Before You Replace It

With new compute fabrics emerging all the time, teams are sometimes too quick to junk their old ones in pursuit of better performance. However, it's often the case that you can achieve equal or greater performance gains just by optimizing code and data flows on your existing fabrics. That way, you can avoid expensive infrastructure upgrades unless they're truly necessary.

Top White Papers and Webcasts