Developers using Google Cloud Dataflow to write pipelines that combine batch and stream processing tasks now have the option of running their programs on the new Apache Flink distributed processing engine.
Data Artisans, the Berlin, Germany-based maker of Flink has released a Cloud Dataflow runner for Flink that allows any Dataflow program to execute on a Flink cluster located in the cloud or installed on-premise.
Flink is a new top-level project from the Apache Software Foundation that offers a distributed processing engine for running batch and steam processing applications. Data Artisans describes it as an alternative to Hadoop’s MapReduce component that is capable of working completely independently of the Hadoop ecosystem.
Google’s Cloud Dataflow is a programming model for combining batch and stream processing tasks on large data sets. The technology is designed for companies looking to extract business value from both data at rest and data in motion. Some use cases for streaming analytics include real-time data visualization, real-time alerting and security monitoring.
The Flink announcement expands to three the number of platforms that are now available to developers for running batch and stream processing applications using Dataflow.
Cloud Dataflow was originally released as a service on Google’s Cloud Platform. Then in December, Google released a Cloud Dataflow Software Development Kit (SDK) for developers looking to port the programming model to other processing engines. In January, Google and Cloudera announced the availability of Dataflow on Cloudera’s popular Apache Spark platform.
In a blog post, Data Artisans described the new runner as a tool that would enable Dataflow users to more easily leverage Apache Flink as an execution backend for their programs.
“Flink and Cloud Dataflow are very well aligned, as they both share the vision of natively unifying stream and batch processing at the engine level,” the blog noted.
By adding Flink to the runners that are available for Dataflow, users now have more choice for running hybrid batch and stream analytics both in the cloud and on premise, the blog post said.
According to Data Artisans, the new Flink runner supports all the batch functionality of Dataflow. The team is currently working on building streaming analytics support into the runner, the blog noted, without specifying a time frame.
In the blog post announcing the new development, Google Senior Product Manager William Vambenepe said the Flink runner boosts the portability and performance of Dataflow pipelines.
“[Flink] provides a robust execution engine with custom memory management and a cost-based optimizer,” Vambenepe said. “And best of all, you have the assurance that your Dataflow pipelines are portable beyond Google Cloud Dataflow.”
Analyst firms like Forrester expect demand for streaming analytics services and technologies to grow in the next few years as more organizations try to extract value from the huge volumes of data being generated these days from transactions, Web clickstreams, mobile applications and cloud services.
Google’s major cloud rivals Amazon and Microsoft both have real-time stream processing services that are similar to Dataflow. Amazon’s technology is called Kinesis and is touted by the company as a service that is capable of helping businesses capture and analyze terabytes of data per hour. Microsoft’s Stream Analytics event processing engine is similarly designed to help companies gain real-time business insights from data captured from applications, data, sensors and devices.