110 Best Practices for Managing Modern Data in Motion
2Limit Hand Coding As Much As Possible
It has been commonplace to write custom code to ingest data from sources into your data store. This practice is inadequate given the unstable nature of big data. Custom code creates brittleness in dataflows: Minor changes to the data schema can cause the pipeline to drop data or fail altogether. Today, there are ingest systems that create code-free plug-and-play connectivity between data source types, intermediate processing systems (such as Kafka and other message queues) and your data store. The benefits you get from a system like this are flexibility instead of brittleness, visibility instead of an opaque black box, and agility in terms of being able to upgrade parts of your operation without manual coding.
3Replace Specifying Schema With Capturing Intent
While a standard practice in the traditional data world, full schema specification of big data leads to wasted engineering time and resources. Consuming applications will make use of only a few important fields for analysis and in many cases the data will have no schema or it will be poorly defined and unstable over time. Rather than relying on full schema specification, dataflow systems must be intent-driven, whereby you specify conditions for, and transformations on, only those fields that matter to analysis and simply pass through the rest of the data as is. Being intent-driven is a minimalist approach that reduces the work and time required to develop and implement pipelines.
4Sanitize Before Storing
In the Hadoop world, it has been an article of faith that you should store untouched immutable raw data in your store. The downside of this practice is twofold: First, it can leave you with known dirty data in the store that engineers must inefficiently clean for each consumption activity. Second, it means you will have sensitive information in your data store that creates security risks and compliance violations. With modern dataflow systems, you can and should sanitize your data upon ingest. Sanitizing data as close to the data source as possible makes data scientists more productive by providing consumption-ready data to the multiple applications that may need it.
5Expect and Deal With Data Drift
An insidious challenge of big data management is dealing with data drift, the unpredictable, unavoidable and continuous mutation of data characteristics caused by the operation and maintenance of systems producing the data. It shows up in three forms: structural drift, semantic drift or infrastructure drift. Data drift erodes data fidelity, data operations reliability and ultimately the productivity of your data scientists and engineers. It increases your costs, delays your time to analysis and leads to poor decision-making. To deal with drift, you must implement tools and processes to detect unexpected changes in data and drive the needed remedial activity such as alerts, coercing data to appropriate values or trapping data for special handling.
6Avoid Managed File Transfer as Your Ingestion Strategy
Data sets are increasingly unbounded and continuous. They consist of ever-changing logs, clickstreams and sensor output. Assuming that you will use file transfers or other rudimentary mechanisms is a fragile assumption that will eventually call for a redesign of your infrastructure. Files, because their contents vary in size, structure and format, are impossible to introspect on the fly. If you’re intent on using a file transfer mechanism, consider pre-processing that standardizes the data format to allow inspection, or adopt an ingestion tool or framework that does this for you.
7Instrument Everything in Your Dataflows
No amount of visibility in a complex dataflow system is enough. Instrumentation will guide you while you contend with the challenge of evolving dataflows and thus help keep you a step ahead of these inevitable changes. This instrumentation is not just needed for time-series analysis of a single dataflow to tease out changes over time. More importantly, instrumentation can correlate data across flows to identify interesting events in real time. Organizations must capture details of every aspect of the system to the extent possible without introducing significant overhead.
8Don’t Just Count Packages, Inspect the Contents
Not that the TSA is a popular metaphor, but would you feel more secure if they solely counted passengers and luggage “ingested” rather than actually scanned for unusual contents? Yet the traditional metrics for data movement are measuring throughput and latency of the packages. The reality of data drift means that you’re much better off if you also analyze the values of the data itself as it flows into your systems. Otherwise, you leave yourself blind to the fact that your data flow operation is at risk because the format or meaning of the data has changed without your knowledge.
9Implement a DevOps Approach to Data in Motion
The DevOps paradigm of an agile workflow with tight linkages between those who design a system and those who run it is well-suited to the complex system required for big data. In a world where there is a continual evolution of data sources, consumption use cases and data-processing systems, data pipelines will need to be adjusted frequently. Fortunately, unlike the world of traditional data, in which tools date back to when waterfall design, or design-centric big data ingest frameworks like Sqoop and Flume were the norm, there are now modern dataflow tools that provide an integrated development environment for continual use through the evolving dataflow life cycle.
10Decouple for Continual Modernization
Unlike monolithic solutions found in traditional data architectures, big data infrastructure is marked by a need for coordination across best-of-breed components for specialized functions such as ingest, message queues, storage, search and analytics. These components evolve and need to be upgraded on their own independent schedules. This means the large and expensive lockstep upgrades you’re used to in the traditional world are untenable and will need to be replaced by an ongoing series of more frequent changes to componentry. To keep your data operation up-to-date, you will be best served by decoupling each stage of data movement so you can upgrade each as you see fit.
11Create a Center of Excellence for Data in Motion
The movement of data is evolving from a stove-piped model to one that resembles a traffic grid. You can no longer get by with a fire-and-forget approach. In such a world you must formalize the management of the overall operation to ensure it functions reliably and meets internal service-level agreements (SLAs). This means adding tools that provide real-time visibility into the state of traffic flows, with the ability to get warnings and act on issues that may violate contracts around data delivery, completeness and integrity. Otherwise, you are trying to navigate a busy city using a paper map, and risk your data arriving late, incomplete or not at all.