How to Scale the Storage and Analysis of Data Using Distributed Data Grids - Analyze Grid-Based Data Using Simple Analysis Codes (
Page 3 of 3 )
Step No. 3: Analyze grid-based data using simple analysis codes and the MapReduce programming pattern
Once a collection of objects (such
as a Website's shopping carts or a financial company's pool of stock
histories) has been hosted in a distributed data grid, it's important
to be able to scan all of this data for important patterns and trends.
Over the last 25 years, researchers have developed a powerful two-step
method now popularly called "MapReduce" for analyzing large volumes of
data in parallel.
In the first step, each object in
the collection is analyzed for an important pattern of interest by
writing and running a simple algorithm that just looks at one object at
a time. This algorithm is run in parallel on all objects to quickly
analyze all of the data. Next, the results that were generated by
running this algorithm are combined to determine an overall result,
which hopefully identifies an important trend.
For example, an e-commerce
developer could write a simple code which analyzes each shopping cart
to rate which product categories are generating the most interest. This
code could be run on all shopping carts several times during the day
(or perhaps after a marketing blitz on the Website has been launched)
to identify important shopping trends.
Distributed data grids offer an
ideal platform for analyzing data using this MapReduce programming
pattern. Because they store data as memory-based objects, the analysis
code is very easy to write and debug as a simple "in-memory" code.
Programmers do not need to learn parallel programming techniques or
understand how the grid works. Also, distributed
data grids provide the infrastructure needed to automatically run this
analysis code on all grid servers in parallel and then combine the
results. The net result is that, by using a distributed data grid, the
application developer can easily and quickly harness the full
scalability of the grid to rapidly discover data patterns and trends
that are vital to a company's success.
As companies become ever more
pressed to manage increasing data volumes and quickly respond to
changing market conditions, they are turning to distributed data grids
to obtain the "scalability" boost they need. As clouds become an
integral part of enterprise infrastructures, distributed data grids
should further prove their value in harnessing the power of scalable
computing to provide an essential competitive edge.
William L. Bain is founder and CEO of ScaleOut Software.
He founded the company in 2003. He has worked at Bell Labs research,
Intel and Microsoft. Bill founded and ran three startup companies prior
to joining Microsoft. In the most recent company (Valence Research), he
developed a distributed Web load balancing software solution that was
acquired by Microsoft and is now called Network Load Balancing within
the Windows Server operating system. William holds several patents in
computer architecture and distributed computing. As a member of the
screening committee for the Seattle-based Alliance of Angels, William
is actively involved in entrepreneurship and the angel community. He
has a PhD in Electrical Engineering/Parallel Computing from Rice
University. He can be reached at wbain@scaleoutsoftware.com.