Google announced it has updated its cloud platform to make handling big data in the cloud much easier for the everyday programmer or data analyst.
The announcement includes a series of new services and improvements to existing ones, including a beta release of Google Cloud Dataflow and key enhancements to Google BigQuery.
Over the last 10 years, Google has created and relied on a lot of the big data innovations in use today.
“We pride ourselves on being among the key innovators in big data,” Tom Kershaw, director of product management for Google Cloud Platform, told eWEEK. “We created things like MapReduce, Flume and a bunch of technologies to deal with the volumes of data that we see in the Internet world that we just never saw before.”
The amount of data that organizations are dealing with is exploding though point- of -sale devices, mobile devices, the Internet of Things (IoT), log files and more, and to be able to survive in that world, you really have to be able to harness that data quickly and to transform it into intelligence. There are a lot of tools that allow you to do that, but the problem with those tools is they’re just too complicated, Kershaw said. “They are very difficult to use. Stringing together map reductions can be very hard,” he said. “And for the average startup or the average Java developer or the average data analyst in a large company, these tools have remained out of reach.”
Enter Google with new solutions to simplify things. Google Cloud Dataflow, now in beta, is a tool that lets you create big data applications using simple programming languages and simple SDKs, Kershaw said. In a blog post from last year’s Google I/O event, Greg DeMichillie, another director of product management for the Google Cloud Platform, said, “Cloud Dataflow is a fully managed service for creating data pipelines that ingest, transform and analyze data in both batch and streaming modes. Cloud Dataflow is a successor to MapReduce, and is based on our internal technologies like Flume and MillWheel.”
Cloud Dataflow provides unified programming primitives for both batch and stream-based data analysis. The SDK allows the Cloud Dataflow programming model to be widely used, so that developers can benefit from the productivity of writing simple and extensible data processing pipelines which can describe both stream and batch processing tasks.
“Cloud Dataflow makes it easy for you to get actionable insights from your data while lowering operational costs without the hassles of deploying, maintaining or scaling infrastructure,” DeMichillie said. “You can use Cloud Dataflow for use cases like ETL [Extract, Transform, Load], batch data processing and streaming analytics, and it will automatically optimize, deploy and manage the code and resources required.”
In a blog post, William Vambenepe, product manager for big data at Google said that nothing stands between you and the satisfaction of seeing your processing logic, applied in streaming or batch mode, via a fully- managed processing service.
“Just write a program, submit it, and Cloud Dataflow will do the rest,” he said. “No clusters to manage – Cloud Dataflow will start the needed resources, auto-scale them (within the bounds you choose), and terminate them as soon as the work is done.”
Kershaw said users should view Cloud Dataflow as a Python tool where you can identify data from all kinds of sources, you can specify which data, you can prepare that data and anonymize it or remove the data you don’t care about and then run high scale analytics against that information.
“Think about it as a Java and Python based toolkit for writing complex database analytics applications really easily,” Kershaw said. “The other thing Dataflow will do that we think is really going to change the game is it allows you to use the same programming language and the same application for both streaming and batch information. Most big data has been batch analytics of historical data such as looking at point of sale data for the month of February for the last five years. What that’s missing is the real-time data that’s current now. Dataflow allows you to do streaming and batch on the same runtime and the same analysis. You can unify historical and real-time information in the same simple program.”
Google Updates Cloud Platform, Delivers Dataflow Beta
Kershaw also noted that Google’s goal with Cloud Dataflow was to create an environment where any programmer or any analyst could take the power of big data and be able to transform their business quickly and easily.
“The intent is if you can do basic Java programming you can now write big data applications,” he said. “A few years ago that was not possible. It was impossible for the average programmer to be able to deal with the complexity of stringing together map reductions.”
Making big data easier along with the natural advantages of the cloud will drive a transformation in how people approach these problems. In that regard, big data and the cloud are natural bedfellows. Doing big data in the cloud helps organizations be more productive when building applications, with faster and better insights without having to worry about the underlying infrastructure.
“It’s very difficult to do an on-prem model where you have to buy, set up and run machines to suit the growing needs of your big data environment,” Kershaw said. “So the on-demand compute model and big data just go together hand- in- hand. There’s the operations piece, there’s ability to scale and run different workloads and there’s the issue of security and collaboration and how you can take information and share it across the organization. We think the cloud collaboration model and the new tools we’re delivering to make big data easy are going to change the game in how organizations use big data. So it’s no longer just the realm of the data scientist; it’s really going to be accessible to any developer anywhere at any time.”
Meanwhile, Google also updated BigQuery.
BigQuery is a large-scale analytics engine that allows you to run through massive volumes of data and do it with a SQL front end. It is Google’s flagship product for being able to integrate large scale analytics with off –the- shelf business tools. Google also announced the availability of BigQuery in Europe so users can store their data in Google Cloud Platform European data centers and support for data residency so users can specify which continent they want their data to be stored in and Google will make sure it stays there.
Google also enhanced BigQuery’s ingestion capability, so it can now ingest 100,000 rows per second per table. And the company introduced row-level permissions, a new security feature that helps with how you store information.
“BigQuery is the ideal platform for storing, analyzing, and sharing structured data,” Vambenepe said. “It also supports repeated records and querying inside JSON objects for loosely structured data.”
Meanwhile, Google Cloud Pub/Sub is designed to provide scalable, reliable, and fast event delivery as a fully managed service, he said.
“Along with BigQuery streaming ingestion and Dataflow stream processing, it completes the platform’s end-to-end support for low-latency data processing,” Vambenepe added. “Whether you’re processing customer actions, application logs, or IoT events, Google Cloud Platform allows you to process them in real time, the cloud way. Leave Google in charge of all the scaling and administration tasks so you can focus on what needs to happen, not how.”