Google is making it easier for software developers to write and integrate applications with its Cloud Dataflow managed service for processing large data sets.
The company on Dec. 18 released a Java software development kit for Cloud Dataflow into the open-source community as part of what it described as an effort to spur application development around the technology.
The idea behind making the SDK available open source is also to help developers port Cloud Dataflow to other languages and other service execution environments, Google software engineer Sam McVeety said in a blog post.
“Reusable programming patterns are a key enabler of developer efficiency,” McVeety wrote. “The Cloud Dataflow SDK introduces a unified model for batch and stream data processing” that developers can take advantage of in innovative new ways, he said.
“We look forward to collaboratively building a system that enables distributed data processing for users from all backgrounds,” McVeety said.
Google announced Cloud Dataflow at the Google I/O conference in June as a managed service to help enterprises ingest and analyze massive data sets both in real time and in batch mode.
The company has described Cloud Dataflow as technology that builds on MapReduce and more recent technologies like Flume and MillWheel, all of which Google has used internally to analyze really massive data stores.
By combining elements of all these technologies, Google hopes to deliver a data processing service that will give companies the flexibility to do batch analysis on large data sets as well as near real-time analysis on data as it streams into the database. It will also let companies ingest and stage data for consumption by other analytics tools and services such as Google’s own BigQuery.
Such capabilities are considered crucial for companies looking to extract business value from big data. The proliferation of cloud services, mobile devices and sensor technologies has allowed businesses to gather increasingly large volumes of data from myriad sources. The challenge has been to find a way to organize and manage the data in a manner as to drive business value from it.
Amazon, one of the biggest cloud service providers, already offers a managed service called Kinesis that is similar to the one that Google plans to launch with Cloud Dataflow. Amazon bills Kinesis as a service for real-time processing of streaming data at massive scale. It is designed as a service to help companies capture, store and analyze terabytes worth of data pulled in from online transactions, Web logs, social media feeds and mobile devices.
With Cloud Dataflow, Google hopes to be able to offer developers and business similar capabilities. “The value of data lies in analysis—and the intelligence one generates from it,” McVeety noted in his blog post.
“Turning data into intelligence can be very challenging as data sets become large and distributed across disparate storage systems. Add to that the increasing demand for real-time analytics, and the barriers to extracting value from data sets become a huge challenge for developers,” he said.