How to Scale the Storage and Analysis of Data Using Distributed Data Grids

Data parallel programming on a distributed data grid is an important new method for overcoming performance bottlenecks for a broad class of applications. This new method is expected to have important applications in cloud computing over the next few years. Here, Knowledge Center contributor William L. Bain discusses how a distributed data grid can be used to implement powerful, Java-based applications for parallel data analysis.


A hallmark of the Information Age is the incredible amount of business data that companies have to store and analyze. The ability to efficiently search data for important patterns can provide an essential competitive edge. For example, an e-commerce Website needs to be able to monitor online shopping carts to see which products are selling quickly. A financial services company needs to hone its equity trading strategy as it optimizes its response to fast-changing market conditions.

Businesses that face challenges such as these have turned to distributed data grids (also called distributed caches) to scale their ability to manage fast-changing data and comb through data to identify patterns and trends requiring a timely response. Distributed data grids offer a few key advantages.

First, they store data in memory instead of on disk for fast access. Second, they run seamlessly across a farm of servers to scale performance. But perhaps best of all, they provide a fast, easy to use platform for running "what if" analyses on the data they store. By breaking the sequential bottleneck, they can take performance to a level that stand-alone database servers cannot match.

Software architects and developers often say, "OK, I see the advantages, but how do I incorporate a distributed data grid into my data storage architecture? And how could it help me to analyze my data?" The following are three simple steps for building a fast, scalable data storage and analysis solution using a distributed data grid:

Step No. 1: Store fast-changing business data directly in a distributed data grid instead of a database server

Distributed data grids are designed to plug directly into the business logic of today's enterprise applications and services. By storing data as collections of objects instead of relational database tables, they match the in-memory view of data already used by business logic. This makes distributed data grids exceptionally easy to integrate into existing applications using simple APIs-which are available for most modern languages such as C#, Java and C++.

Because distributed data grids run on server farms, their storage capacity and throughput scale just by adding more grid servers. When hosted on a large server farm or in the cloud, a distributed data grid's ability to store and quickly access large volumes of data can grow well beyond that for a stand-alone database server.