eWEEK content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.
1How to Successfully Deploy an Enterprise Data Lake
As enterprises try to extract more value from their data, the notion of building a “data lake” has gained traction. Data lakes are repositories of large amounts of structured and unstructured data that can be processed without the restrictions of a traditional data warehouse. By pooling a variety of data types, an enterprise ostensibly can uncover new insights to improve competitiveness. However, taking every scrap of data a company has and storing it all in one place is generally slow and inefficient. For actionable insights, you need to: a) have the right data; b) have the right sources; and c) process them in the right way. Simply installing Hadoop is not a strategy. This eWEEK slide show offers industry-information suggestions from Mark Gibbs, senior product manager at SnapLogic, on how to make a data lake work.
2Know Your Sources
Most enterprises employ a broad set of software-as-a-service (SaaS) applications and devices that yield high volumes of data. Salesforce, Workday and Zendesk, for example, all hold key business insights. Streaming data from internet of things (IoT) sensors provides information about the health of equipment, and social streams reveal how customers perceive your product. But you need to combine this data with traditional, on-premises sources to get a full view of your business, according to Mark Gibbs of SnapLogic.
3Know Your Objectives
Knowing what you hope to achieve will help you decide which of these numerous data sources to ingest. That doesn’t mean you need to know in advance the insights you’ll get, but decide if your primary objective is to improve your product, refine your supply chain, be more efficient as a company or something else. Having broad goals in mind will allow you to know which data sources to prioritize.
4Exercise Control
Avoid “dumping,” so that your data lake doesn’t become a swamp of useless information. Enterprises are often tempted to throw everything into their data lake without a coherent strategy. While it’s not necessary to thoroughly clean every bit of data, organizations should establish a governed process for introducing new information to avoid messiness.
5Use the Cloud Wisely
Along with SaaS applications, cloud storage is becoming the norm for new projects, whether that’s a data warehouse such as Snowflake or Amazon Redshift, or an object storage service like Microsoft WASB. Cloud storage removes the need to think about backups and data replication, since they’re included as part of the cloud service, and they allow you to scale a data store quickly and cost-effectively. Take advantage of cloud economics, according to Mark Gibbs of SnapLogic.
6Hadoop-as-a-Service is Your Friend
While Hadoop and HDFS are the most common ways to store and process large, complex data sets, they require specialized skills. What’s more, Hadoop clusters are typically on-premises systems that require a significant capital investment. Instead, consider a managed Hadoop service such as Amazon EMR, Microsoft HDInsight or Cloudera Altus. These can make use of underlying cloud storage systems such as S3 and Microsoft ADLS, making it easier to ingest and deliver the data without having to move or copy it to HDFS.
7Mind the Skills Gap
One of the biggest obstacles to deploying data lakes is a lack of skills for managing Hadoop and other big data technologies. Consider retraining data management employees to fill the necessary roles, or bring in technical consultants to help with the integration and training process. Be sure to implement user-friendly platforms so teams can leverage existing skills and eliminate the need to go on a Hadoop specialist hiring blitz.
8Keep Building
The beauty of a data lake is what you can layer on top. New, top-level analysis techniques allow you to see across silos and unlock the real value of big data. Predictive analytics are now embedded in many enterprise applications, and machine learning is increasingly part of cloud infrastructures. The possibilities are endless, from getting ahead of market trends to predicting factory machine failures to automating customer segmentation for marketing and sales, Mark Gibbs of SnapLogic said.