Computers, mobile devices and smart sensors are generating trillions of pieces of data around the world at an unprecedented rate. Tiny sensors collect soil data for farmers, wearable devices accumulate data about our health, scanners track consumer purchases. The list of data sources goes on and on.
But how much of all that data is organized and easily accessible? The rise of big data also has seen a huge demand for data scientists to help companies gain insights from all the data they're collecting. But data scientists are in short supply and custom data analysis applications aren't always flexible enough to accommodate new data sources.
The big data dilemma is often compared to bodies of water. Ideally you want a clear data lake that lets you see everything regardless of depth. But with disparate systems and means of collecting and processing data, what you often end up with is a murky data swamp.
Data integration company Informatica has been in the data management business for decades. Later this year it plans to release a Data Lake Management system described as an out-of-the-box system for managing big data. The system will be previewed at the Strata-Hadoop conference in New York Sept. 27-29, with availability expected by the end of this year.
"The biggest competition for this is the do-it-yourselfers, the idea that all you need is Hadoop and a few guys in a basement to make it work," Murthy Mathiprakasam, director of product marketing for data lake management at Informatica, told eWEEK. "Dealing with terabytes of data is hard, especially as you have to scale."
Hadoop is an open-source, Java-based programming framework designed to support the processing and storage of extremely large data sets in a distributed computing environment. It's been gaining popularity as a data analysis tool, but not every organization has the skills and knowledge to make it work effectively.
Mathiprakasam said there are four problems Informatica was hearing from customers that led to the creation of its Data Lake Management application.
"The first is that creating solutions on top of Hadoop is extremely difficult because at some point you need reusability, maintainability and some way to automate some of the processes," said Mathiprakasam.
Second is finding an easier way to understanding all the data you may be collecting.
"Companies have enough data, but they don't know all of what they have and the business analysts that need it aren't getting the value of big data. We also talk to customers in regulated industries like health care and finance, where you need to know what you have and where it is," he said.
The third problem is time. Some companies have developed applications that work, but the preparation and cleansing of data from disparate sources so it can be presented to business analysts can take weeks.
The fourth issue is what Mathiprakasam calls insufficient trust. There often is overlapping data collection—for example multiple customer accounts for the same person—that can be inaccurate. "A lot of times the data being collected doesn't meet audit and compliance requirements regulated industries have," he said.
The Data Lake Management package is offered as an integrated suite brought together as a single metadata-driven platform.
Components include the Informatica Enterprise Information Catalog that offers intelligent self-service tools powered by machine learning and artificial intelligence, according to the company.
Informatica Intelligent Streaming is designed to help organizations more efficiently capture and process big data—such as machine, social media feeds and website click streams—and real-time events to gain timely insight for business initiatives, such as the internet of things (IoT), marketing and fraud detection.
Another component, Informatica Blaze, is said to dramatically increase Hadoop processing performance with intelligent data pipelining, job partitioning, job recovery and scaling.
For users of Microsoft's Azure cloud, Informatica Big Management can be deployed with a click of a button via the Microsoft Azure Marketplace to integrate, govern and secure big data at scale in Hadoop, according to company officials.
A related component, Informatica Cloud Microsoft Azure Data Lake Store Connector, helps customers achieve faster business insights with self-service connectivity to integrate and synchronize diverse data sets into Microsoft Azure Data Lake Store.
"There are so many point solutions on the market that solve one issue like delivery, but not trust and then you have the challenge of integrating that solution with another one," said Mathiprakasam.
To that end he said Informatica is also working with other vendors to offer connectors to Informatica Data Lake Management. For example, the company has partnered with Tableau to offer a connector from Informatica to Tableau's business intelligence application making the data and insights accessible to a business or marketing analyst without needing a data scientist to create a solution.