SHARE

Eight Essential Boxes to Check for Managing Analytics Data Pipeline

Written By

May 11, 2017

3 minute read

eWeek content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Eight Essential Boxes to Check for Managing Analytics Data Pipeline
Checklist Item #1: Data Connectivity
Checklist Item #2: Data Engineering
Checklist Item #3: Data Delivery
Checklist Item #4: Data Preparation
Checklist Item #5: Analytics
Checklist Item #6: Pipeline Automation and Management
Checklist Item #7: Governance and Security
Checklist Item #8: Extensibility and Scalability

Eight Essential Boxes to Check for Managing Analytics Data Pipeline

1 - Eight Essential Boxes to Check for Managing Analytics Data Pipeline

Enterprises of every type are updating their IT in order to use data more efficiently. While the importance of analytics and business intelligence for successfully operating a company is no longer in question, the right data management processes are important to providing accurate and reliable information. The number of data sources and data volumes is exploding, making it essential that your data is analytics-ready at the start of the data pipeline. These critical processes also will help ensure that your pipeline is future-proofed against changing analytic needs and emerging technologies. In this eWEEK slide show, we present a checklist from Pentaho, a Hitachi Group company, for how to manage a data pipeline to make sure your data is analytics-ready.

Checklist Item #1: Data Connectivity

2 - Checklist Item #1: Data Connectivity

To manage your data pipeline effectively, it’s essential that your tools have the right connectivity to both traditional and emerging sources of structured, semi-structured and unstructured data. When evaluating potential vendors, it’s important to ask questions about connectivity capabilities to different types of data sources.

Checklist Item #2: Data Engineering

3 - Checklist Item #2: Data Engineering

Data engineering requires more than just connecting to or loading data. It involves managing a changing array of data sources, establishing repeatable processes at scale, and maintaining control and governance. Whether an organization is implementing an ongoing process for ingesting hundreds of data sources into Hadoop or enabling business users to upload diverse data without IT assistance, onboarding projects tend to create major obstacles.

Checklist Item #3: Data Delivery

4- Checklist Item #3: Data Delivery

Getting data where it needs to go is essential. Some solutions perform better with traditional data warehouses and some with newer technologies. It’s important to consider how to future-proof your solution to avoid getting stuck with outdated technology when innovative companies and the open-source community develop something new, requiring you to adapt to the changing data landscape.

Checklist Item #4: Data Preparation

5 - Checklist Item #4: Data Preparation

According to Ventana Research, big data projects require organizations to spend 46 percent of their time preparing data and 52 percent of their time checking for data quality and consistency. Stand-alone tools that help with data preparation may lack the flexibility to blend both traditional and new unstructured data sources. The more stand-alone vendors you use, the more likely you are to run into problems going from one stage of the data pipeline to another.

Checklist Item #5: Analytics

6 - Checklist Item #5: Analytics

As your business needs evolve, it’s important to have a platform that can evolve with you. Vendors that provide a fixed library of analytics options may not have the flexibility you need. Being able to use the best of predictive analytics and to embed analytics into your existing business processes—or even into your software—is critical for getting the most business value.

Checklist Item #6: Pipeline Automation and Management

7 - Checklist Item #6: Pipeline Automation and Management

As the saying goes, “Excel ETL” isn’t scalable. The methodology one person follows may not be followed by colleagues in other business units, which leads to nonstandard reporting. It’s vital to be able to automate as much of your data pipeline as possible to make the most of your team’s resources.

Checklist Item #7: Governance and Security

8 - Checklist Item #7: Governance and Security

Data governance and security are not optional—and it’s best to have a security plan rather than handle damage control after a breach. If you’re working in a regulated industry, it’s especially important to use a data pipeline platform that captures the flow of who did what, with what data, and when. As you’re evaluating vendors, review their capabilities when it comes to data governance and security.

Checklist Item #8: Extensibility and Scalability

9 - Checklist Item #8: Extensibility and Scalability

The big data ecosystem surrounding Apache Hadoop includes dozens of tools, which are each constantly evolving. Much of the innovation in the last few years around data management, especially with big data, has taken place in the open-source community. Accordingly, if you’re considering a vendor based on proprietary rather than open source code, you might get left behind when tools evolve. To stay flexible, consider a vendor’s extensibility and scalability features.