1Eight Essential Boxes to Check for Managing Analytics Data Pipeline
Enterprises of every type are updating their IT in order to use data more efficiently. While the importance of analytics and business intelligence for successfully operating a company is no longer in question, the right data management processes are important to providing accurate and reliable information. The number of data sources and data volumes is exploding, making it essential that your data is analytics-ready at the start of the data pipeline. These critical processes also will help ensure that your pipeline is future-proofed against changing analytic needs and emerging technologies. In this eWEEK slide show, we present a checklist from Pentaho, a Hitachi Group company, for how to manage a data pipeline to make sure your data is analytics-ready.
2Checklist Item #1: Data Connectivity
To manage your data pipeline effectively, it’s essential that your tools have the right connectivity to both traditional and emerging sources of structured, semi-structured and unstructured data. When evaluating potential vendors, it’s important to ask questions about connectivity capabilities to different types of data sources.
3Checklist Item #2: Data Engineering
Data engineering requires more than just connecting to or loading data. It involves managing a changing array of data sources, establishing repeatable processes at scale, and maintaining control and governance. Whether an organization is implementing an ongoing process for ingesting hundreds of data sources into Hadoop or enabling business users to upload diverse data without IT assistance, onboarding projects tend to create major obstacles.
4Checklist Item #3: Data Delivery
Getting data where it needs to go is essential. Some solutions perform better with traditional data warehouses and some with newer technologies. It’s important to consider how to future-proof your solution to avoid getting stuck with outdated technology when innovative companies and the open-source community develop something new, requiring you to adapt to the changing data landscape.
5Checklist Item #4: Data Preparation
According to Ventana Research, big data projects require organizations to spend 46 percent of their time preparing data and 52 percent of their time checking for data quality and consistency. Stand-alone tools that help with data preparation may lack the flexibility to blend both traditional and new unstructured data sources. The more stand-alone vendors you use, the more likely you are to run into problems going from one stage of the data pipeline to another.
6Checklist Item #5: Analytics
As your business needs evolve, it’s important to have a platform that can evolve with you. Vendors that provide a fixed library of analytics options may not have the flexibility you need. Being able to use the best of predictive analytics and to embed analytics into your existing business processes—or even into your software—is critical for getting the most business value.
7Checklist Item #6: Pipeline Automation and Management
8Checklist Item #7: Governance and Security
Data governance and security are not optional—and it’s best to have a security plan rather than handle damage control after a breach. If you’re working in a regulated industry, it’s especially important to use a data pipeline platform that captures the flow of who did what, with what data, and when. As you’re evaluating vendors, review their capabilities when it comes to data governance and security.
9Checklist Item #8: Extensibility and Scalability
The big data ecosystem surrounding Apache Hadoop includes dozens of tools, which are each constantly evolving. Much of the innovation in the last few years around data management, especially with big data, has taken place in the open-source community. Accordingly, if you’re considering a vendor based on proprietary rather than open source code, you might get left behind when tools evolve. To stay flexible, consider a vendor’s extensibility and scalability features.