Close
  • Latest News
  • Artificial Intelligence
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
Read Down
Sign in
Close
Welcome!Log into your account
Forgot your password?
Read Down
Password recovery
Recover your password
Close
Search
Logo
Logo
  • Latest News
  • Artificial Intelligence
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
More
    Home Database
    • Database

    10 Best Practices for Managing Modern Data in Motion

    By
    Darryl K. Taft
    -
    May 30, 2016
    Share
    Facebook
    Twitter
    Linkedin

      PrevNext

      110 Best Practices for Managing Modern Data in Motion

      1 - 10 Best Practices for Managing Modern Data in Motion

      Modern databases need to be managed differently from old-school models. Here are 10 tips for managing data movement in a system.

      2Limit Hand Coding As Much As Possible

      2 - Limit Hand Coding As Much As Possible

      It has been commonplace to write custom code to ingest data from sources into your data store. This practice is inadequate given the unstable nature of big data. Custom code creates brittleness in dataflows: Minor changes to the data schema can cause the pipeline to drop data or fail altogether. Today, there are ingest systems that create code-free plug-and-play connectivity between data source types, intermediate processing systems (such as Kafka and other message queues) and your data store. The benefits you get from a system like this are flexibility instead of brittleness, visibility instead of an opaque black box, and agility in terms of being able to upgrade parts of your operation without manual coding.

      3Replace Specifying Schema With Capturing Intent

      3 - Replace Specifying Schema With Capturing Intent

      While a standard practice in the traditional data world, full schema specification of big data leads to wasted engineering time and resources. Consuming applications will make use of only a few important fields for analysis and in many cases the data will have no schema or it will be poorly defined and unstable over time. Rather than relying on full schema specification, dataflow systems must be intent-driven, whereby you specify conditions for, and transformations on, only those fields that matter to analysis and simply pass through the rest of the data as is. Being intent-driven is a minimalist approach that reduces the work and time required to develop and implement pipelines.

      4Sanitize Before Storing

      4 - Sanitize Before Storing

      In the Hadoop world, it has been an article of faith that you should store untouched immutable raw data in your store. The downside of this practice is twofold: First, it can leave you with known dirty data in the store that engineers must inefficiently clean for each consumption activity. Second, it means you will have sensitive information in your data store that creates security risks and compliance violations. With modern dataflow systems, you can and should sanitize your data upon ingest. Sanitizing data as close to the data source as possible makes data scientists more productive by providing consumption-ready data to the multiple applications that may need it.

      5Expect and Deal With Data Drift

      5 - Expect and Deal With Data Drift

      An insidious challenge of big data management is dealing with data drift, the unpredictable, unavoidable and continuous mutation of data characteristics caused by the operation and maintenance of systems producing the data. It shows up in three forms: structural drift, semantic drift or infrastructure drift. Data drift erodes data fidelity, data operations reliability and ultimately the productivity of your data scientists and engineers. It increases your costs, delays your time to analysis and leads to poor decision-making. To deal with drift, you must implement tools and processes to detect unexpected changes in data and drive the needed remedial activity such as alerts, coercing data to appropriate values or trapping data for special handling.

      6Avoid Managed File Transfer as Your Ingestion Strategy

      6 - Avoid Managed File Transfer as Your Ingestion Strategy

      Data sets are increasingly unbounded and continuous. They consist of ever-changing logs, clickstreams and sensor output. Assuming that you will use file transfers or other rudimentary mechanisms is a fragile assumption that will eventually call for a redesign of your infrastructure. Files, because their contents vary in size, structure and format, are impossible to introspect on the fly. If you’re intent on using a file transfer mechanism, consider pre-processing that standardizes the data format to allow inspection, or adopt an ingestion tool or framework that does this for you.

      7Instrument Everything in Your Dataflows

      7 - Instrument Everything in Your Dataflows

      No amount of visibility in a complex dataflow system is enough. Instrumentation will guide you while you contend with the challenge of evolving dataflows and thus help keep you a step ahead of these inevitable changes. This instrumentation is not just needed for time-series analysis of a single dataflow to tease out changes over time. More importantly, instrumentation can correlate data across flows to identify interesting events in real time. Organizations must capture details of every aspect of the system to the extent possible without introducing significant overhead.

      8Don’t Just Count Packages, Inspect the Contents

      8 - Don't Just Count Packages, Inspect the Contents

      Not that the TSA is a popular metaphor, but would you feel more secure if they solely counted passengers and luggage “ingested” rather than actually scanned for unusual contents? Yet the traditional metrics for data movement are measuring throughput and latency of the packages. The reality of data drift means that you’re much better off if you also analyze the values of the data itself as it flows into your systems. Otherwise, you leave yourself blind to the fact that your data flow operation is at risk because the format or meaning of the data has changed without your knowledge.

      9Implement a DevOps Approach to Data in Motion

      9 - Implement a DevOps Approach to Data in Motion

      The DevOps paradigm of an agile workflow with tight linkages between those who design a system and those who run it is well-suited to the complex system required for big data. In a world where there is a continual evolution of data sources, consumption use cases and data-processing systems, data pipelines will need to be adjusted frequently. Fortunately, unlike the world of traditional data, in which tools date back to when waterfall design, or design-centric big data ingest frameworks like Sqoop and Flume were the norm, there are now modern dataflow tools that provide an integrated development environment for continual use through the evolving dataflow life cycle.

      10Decouple for Continual Modernization

      10 - Decouple for Continual Modernization

      Unlike monolithic solutions found in traditional data architectures, big data infrastructure is marked by a need for coordination across best-of-breed components for specialized functions such as ingest, message queues, storage, search and analytics. These components evolve and need to be upgraded on their own independent schedules. This means the large and expensive lockstep upgrades you’re used to in the traditional world are untenable and will need to be replaced by an ongoing series of more frequent changes to componentry. To keep your data operation up-to-date, you will be best served by decoupling each stage of data movement so you can upgrade each as you see fit.

      11Create a Center of Excellence for Data in Motion

      11 - Create a Center of Excellence for Data in Motion

      The movement of data is evolving from a stove-piped model to one that resembles a traffic grid. You can no longer get by with a fire-and-forget approach. In such a world you must formalize the management of the overall operation to ensure it functions reliably and meets internal service-level agreements (SLAs). This means adding tools that provide real-time visibility into the state of traffic flows, with the ability to get warnings and act on issues that may violate contracts around data delivery, completeness and integrity. Otherwise, you are trying to navigate a busy city using a paper map, and risk your data arriving late, incomplete or not at all.

      PrevNext
      Get the Free Newsletter!
      Subscribe to Daily Tech Insider for top news, trends & analysis
      This email address is invalid.
      Get the Free Newsletter!
      Subscribe to Daily Tech Insider for top news, trends & analysis
      This email address is invalid.

      MOST POPULAR ARTICLES

      Latest News

      Zeus Kerravala on Networking: Multicloud, 5G, and...

      James Maguire - December 16, 2022 0
      I spoke with Zeus Kerravala, industry analyst at ZK Research, about the rapid changes in enterprise networking, as tech advances and digital transformation prompt...
      Read more
      Applications

      Datadog President Amit Agarwal on Trends in...

      James Maguire - November 11, 2022 0
      I spoke with Amit Agarwal, President of Datadog, about infrastructure observability, from current trends to key challenges to the future of this rapidly growing...
      Read more
      IT Management

      Intuit’s Nhung Ho on AI for the...

      James Maguire - May 13, 2022 0
      I spoke with Nhung Ho, Vice President of AI at Intuit, about adoption of AI in the small and medium-sized business market, and how...
      Read more
      Applications

      Kyndryl’s Nicolas Sekkaki on Handling AI and...

      James Maguire - November 9, 2022 0
      I spoke with Nicolas Sekkaki, Group Practice Leader for Applications, Data and AI at Kyndryl, about how companies can boost both their AI and...
      Read more
      Cloud

      IGEL CEO Jed Ayres on Edge and...

      James Maguire - June 14, 2022 0
      I spoke with Jed Ayres, CEO of IGEL, about the endpoint sector, and an open source OS for the cloud; we also spoke about...
      Read more
      Logo

      eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site’s focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

      Facebook
      Linkedin
      RSS
      Twitter
      Youtube

      Advertisers

      Advertise with TechnologyAdvice on eWeek and our other IT-focused platforms.

      Advertise with Us

      Menu

      • About eWeek
      • Subscribe to our Newsletter
      • Latest News

      Our Brands

      • Privacy Policy
      • Terms
      • About
      • Contact
      • Advertise
      • Sitemap
      • California – Do Not Sell My Information

      Property of TechnologyAdvice.
      © 2022 TechnologyAdvice. All Rights Reserved

      Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.

      ×