Data warehousing vendor Greenplum is claiming to enhance data loading speeds to the tune of as much as four terabytes an hour. The key to this is what the vendor calls its scatter/gather streaming.
Greenplum is banking on new technology to speed the data loading process for companies dealing with large data warehouses.
Greenplum's massively parallel processing (MPP) Scatter/Gather
Streaming (SG Streaming) technology is designed to eliminate the
bottlenecks associated with other approaches to data loading. At its
core, its approach utilizes a parallel-everywhere approach to loading
in which data flows from one or more source systems to every node of
The technology is part of the company's bid to challenge players such as
Oracle and Netezza. Customers are running into cost and performance
constraints with competing solutions, and are looking for scalable
software solutions to meet their needs, opined Paul Salazar, vice
president of marketing.
According to Greenplum, this is different from traditional bulk loading technologies used by most mainstream database
MPP appliance vendors that push data from a single source, often over a
single or small number of parallel channels. The aforementioned
situation can result in bottlenecks and higher load times.
"With our approach we hit fully linear parallelism because we take
all the source systems and we essentially do what we call scatter the
data," explained Ben Wether, director of product management at
Greenplum. "We break it up into chunks that are sprayed across hundreds
or thousands of parallel streams into the database and received...by all
the nodes of the database in parallel. The essence of it is we
eliminate all the bottlenecks."
Performance scales with the number of Greenplum Database nodes, and
the technology supports both large batch and continuous near-real-time
loading patterns, company officials said. Data can be transformed and
processed in-flight, leveraging all nodes of the database in parallel.
Final gathering and storage of data to disk takes place on all nodes
simultaneously, with data automatically partitioned across nodes and
optionally compressed, Greenplum officials explained.
"Our objective as we go through the product evolution...is to build
out a range of capabilities that are just again appealing to the
customers who we have today who want in many cases ever-increasing
rates of speed and loading, speed of query response, flexibility of
doing embedded analytics and really to most easily access very vast
volumes of data without having to do a lot of manipulation or a lot of
moving of data," Salazar said.