Google announced it has updated its cloud platform to make handling big data in the cloud much easier for the everyday programmer or data analyst.
The announcement includes a series of new services and improvements to existing ones, including a beta release of Google Cloud Dataflow and key enhancements to Google BigQuery.
Over the last 10 years, Google has created and relied on a lot of the big data innovations in use today.
“We pride ourselves on being among the key innovators in big data,” Tom Kershaw, director of product management for Google Cloud Platform, told eWEEK. “We created things like MapReduce, Flume and a bunch of technologies to deal with the volumes of data that we see in the Internet world that we just never saw before."
The amount of data that organizations are dealing with is exploding though point- of -sale devices, mobile devices, the Internet of Things (IoT), log files and more, and to be able to survive in that world, you really have to be able to harness that data quickly and to transform it into intelligence. There are a lot of tools that allow you to do that, but the problem with those tools is they’re just too complicated, Kershaw said. "They are very difficult to use. Stringing together map reductions can be very hard," he said. "And for the average startup or the average Java developer or the average data analyst in a large company, these tools have remained out of reach.”
Enter Google with new solutions to simplify things. Google Cloud Dataflow, now in beta, is a tool that lets you create big data applications using simple programming languages and simple SDKs, Kershaw said. In a blog post from last year’s Google I/O event, Greg DeMichillie, another director of product management for the Google Cloud Platform, said, “Cloud Dataflow is a fully managed service for creating data pipelines that ingest, transform and analyze data in both batch and streaming modes. Cloud Dataflow is a successor to MapReduce, and is based on our internal technologies like Flume and MillWheel.”
Cloud Dataflow provides unified programming primitives for both batch and stream-based data analysis. The SDK allows the Cloud Dataflow programming model to be widely used, so that developers can benefit from the productivity of writing simple and extensible data processing pipelines which can describe both stream and batch processing tasks.
“Cloud Dataflow makes it easy for you to get actionable insights from your data while lowering operational costs without the hassles of deploying, maintaining or scaling infrastructure,” DeMichillie said. “You can use Cloud Dataflow for use cases like ETL [Extract, Transform, Load], batch data processing and streaming analytics, and it will automatically optimize, deploy and manage the code and resources required.”
In a blog post, William Vambenepe, product manager for big data at Google said that nothing stands between you and the satisfaction of seeing your processing logic, applied in streaming or batch mode, via a fully- managed processing service.
“Just write a program, submit it, and Cloud Dataflow will do the rest," he said. "No clusters to manage - Cloud Dataflow will start the needed resources, auto-scale them (within the bounds you choose), and terminate them as soon as the work is done.”
Kershaw said users should view Cloud Dataflow as a Python tool where you can identify data from all kinds of sources, you can specify which data, you can prepare that data and anonymize it or remove the data you don’t care about and then run high scale analytics against that information.
“Think about it as a Java and Python based toolkit for writing complex database analytics applications really easily,” Kershaw said. “The other thing Dataflow will do that we think is really going to change the game is it allows you to use the same programming language and the same application for both streaming and batch information. Most big data has been batch analytics of historical data such as looking at point of sale data for the month of February for the last five years. What that’s missing is the real-time data that’s current now. Dataflow allows you to do streaming and batch on the same runtime and the same analysis. You can unify historical and real-time information in the same simple program.”