Analyze Grid-Based Data Using Simple Analysis Codes
Step No. 3: Analyze grid-based data using simple analysis codes and the MapReduce programming pattern
Once a collection of objects (such as a Website's shopping carts or a financial company's pool of stock histories) has been hosted in a distributed data grid, it's important to be able to scan all of this data for important patterns and trends. Over the last 25 years, researchers have developed a powerful two-step method now popularly called "MapReduce" for analyzing large volumes of data in parallel.
In the first step, each object in the collection is analyzed for an important pattern of interest by writing and running a simple algorithm that just looks at one object at a time. This algorithm is run in parallel on all objects to quickly analyze all of the data. Next, the results that were generated by running this algorithm are combined to determine an overall result, which hopefully identifies an important trend.
For example, an e-commerce developer could write a simple code which analyzes each shopping cart to rate which product categories are generating the most interest. This code could be run on all shopping carts several times during the day (or perhaps after a marketing blitz on the Website has been launched) to identify important shopping trends.
Distributed data grids offer an ideal platform for analyzing data using this MapReduce programming pattern. Because they store data as memory-based objects, the analysis code is very easy to write and debug as a simple "in-memory" code. Programmers do not need to learn parallel programming techniques or understand how the grid works. Also, distributed data grids provide the infrastructure needed to automatically run this analysis code on all grid servers in parallel and then combine the results. The net result is that, by using a distributed data grid, the application developer can easily and quickly harness the full scalability of the grid to rapidly discover data patterns and trends that are vital to a company's success.
As companies become ever more pressed to manage increasing data volumes and quickly respond to changing market conditions, they are turning to distributed data grids to obtain the "scalability" boost they need. As clouds become an integral part of enterprise infrastructures, distributed data grids should further prove their value in harnessing the power of scalable computing to provide an essential competitive edge.
William L. Bain is founder and CEO of ScaleOut Software. He founded the company in 2003. He has worked at Bell Labs research, Intel and Microsoft. Bill founded and ran three startup companies prior to joining Microsoft. In the most recent company (Valence Research), he developed a distributed Web load balancing software solution that was acquired by Microsoft and is now called Network Load Balancing within the Windows Server operating system. William holds several patents in computer architecture and distributed computing. As a member of the screening committee for the Seattle-based Alliance of Angels, William is actively involved in entrepreneurship and the angel community. He has a PhD in Electrical Engineering/Parallel Computing from Rice University. He can be reached at firstname.lastname@example.org.