Reducted Data Footprint

 
 
By Rick Abbott and Bob Zurek  |  Posted 2010-04-28
 
 
 

How to Achieve Greener Data Storage and Analysis


In a world where business is transacted 24/7 across every possible channel available, companies need to collect, store, track and analyze enormous volumes of data-everything from clickstream data and event logs to mobile call records and more. But this all comes with a cost to both businesses and the environment. Data warehouses and the sprawling data centers that house them use up a huge amount of power, both to run legions of servers and to cool them. Just how much? A whopping 61 billion kilowatt-hours of electricity, at an estimated cost of $4.5B annually.

The IT industry has begun to address energy consumption in the data center through a variety of approaches including the use of more efficient cooling systems, virtualization, blade servers and storage area networks (SANs). But a fundamental challenge remains. As data volumes explode, traditional, appliance-centric data warehousing approaches can only continue to throw more hardware at the problem. This can quickly negate any green gains seen through better cooling or more tightly packed servers.

To minimize their hardware footprint, organizations also need to shrink their "data footprint" by addressing how much server space and resources their information analysis requires in the first place. A combination of new database technologies expressly designed for analysis of massive quantities of data and affordable, resource-efficient, open-source software can help organizations save money and become greener.

Organizations can do so in the following three key areas: reduced data footprint, reduced deployment resources, and reduced ongoing management and maintenance. Let's take a look at each more closely:

Reducted Data Footprint


1. Reduced data footprint

In recent years, column-oriented databases have been noted by many as the preferred architecture for high-volume analytics. A column-oriented database stores data column by column instead of row by row. There are many advantages to this. Most analytic queries only involve a subset of the columns in a table, so a column-oriented database focuses on retrieving only the data that is required. This speeds queries and reduces disk I/O and computer resources.

Furthermore, these databases enable efficient data compression because each column stores a single data type, as opposed to rows that typically contain several data types. Compression can be optimized for each particular data type, reducing the amount of storage needed for the database. Column orientation also greatly accelerates query processing, which significantly increases the concurrent queries a server can process.

There are a variety of column-oriented solutions on the market. Some duplicate data and require as large a hardware footprint as traditional row-based systems. Others have combined the column basis with other technologies, which eliminates the need for data duplication. This means that users don't need as many servers or as much storage to analyze the same volume of data.

For example, some column-oriented databases can achieve compression results ranging from 10:1 (a 10TB database becomes a 1TB database) to more than 40:1, depending on the data. With this level of compression, a distributed server environment can be reduced by a factor of 20 to 50 times and be brought down to a single box-slashing heat, power consumption and carbon emissions.

Virtual data marts are also coming on the scene, leveraging Enterprise Information Integration (EII) technologies to create specialized views of data sets without the need for physical storage. The downside to this approach is that complex queries can be sluggish, which can be a problem when analytic needs call for close to real-time insight.

Open-source software takes efficient resource utilization a step further as it typically does not require proprietary hardware or specialized appliances.

Reduced Deployment Resources


2. Reduced deployment resources

New database technology, combined with open source, also enables simpler, "do it yourself" testing and deployment models. This greatly reduces the amount of resources involved in getting an analytic solution up and running.

Consider the resource requirements potentially involved in the acquisition and deployment of a traditional, proprietary solution: a lengthy product evaluation process will likely be followed by on-site visits from the vendor to set up and configure hardware and equipment. The costs-from both an environmental and bottom-line perspective-include travel (plane trips, car rentals and hotel accommodations) and hardware (multiple servers, cooling equipment and connectors), as well as personnel (a full team of experts may be required to customize the solution).

With new open-source technologies, software can be downloaded online. Plus, it's designed to be easy to install so that one person can handle setup and deployment. Support needs are also less involved, which means that issues can be handled via conference calls rather than through more costly and carbon-consuming in-person travel.

Reduced Ongoing Management and Maintenance


3. Reduced ongoing management and maintenance

Because traditional data warehousing solutions are generally built to handle specific types of queries, they are not particularly well-suited for environments where data management needs are constantly changing and real-time analysis is critical. (And the reality is that real-time, dynamic requirements are pretty much the norm in a Web-dominated world.) Retrofitting these solutions to handle ad hoc queries requires an enormous amount of manual fine-tuning and results in a huge drain on IT resources.

For instance, trying to run a set of complex analytic queries alongside constantly changing schemas on a warehouse designed to store a deep amount of IT-oriented log data is like trying to use a dictionary to find driving directions. It involves a complete reconfiguration of the underlying data structures, requiring database designers to create indexes and data partitions. Indexing and partitioning also increase data size, in some cases by a factor of two or more.

In contrast, some of the new analytic database products eliminate these manual and ongoing efforts to provide a more "Google-like" experience, so that users can easily leverage the software to answer many types of questions. This level of flexibility presents the potential to reduce ongoing maintenance and operational support by as much as 90 percent. A business that only needs to build one solution that can be used by many optimizes staff usage, as well as time and financial investments. In addition, greater data analysis productivity means less hardware can be used without sacrificing performance.

Some companies are even moving to outsourced operations management, further reducing the number of resources required on-site. Simple to use and maintain, analytic, integration and intelligence solutions lend themselves well to this kind of arrangement as they can be easily administered by outsourced vendors. 

Rick Abbott is President of 360DegreeView, LLC. Rick has over 19 years of information management and technology experience, including private and public sector work. On the commercial side, Rick has significant experience in both the telecommunications and financial services industries. Rick has over eight years of "Big 5" experience, including an associate partnership position with Deloitte Consulting. Rick's primary focus over the past 13 years has been on large-scale business intelligence initiatives. He has direct experience in all aspects of business intelligence and data warehouse projects including business case development, strategic planning and business alignment, business requirements, and technical architecture and design. He possesses over 10 years of large, IT-related project management experience. Rick also has significant experience in assisting clients in negotiating large technology product, service, and outsourcing contracts. Read Rick's blog here. He can also be reached at rick@360degreeview.com.

  

Bob Zurek is Chief Technology Officer and Vice President of Product Management at Infobright. Bob is also responsible for client services including sales engineering and implementation services. Bob has over 25 years of proven success in software development, technology research, and product management. He also possesses deep expertise in database management systems, business intelligence, and open-source technologies. Prior to joining Infobright, Bob was vice president of products and CTO at EnterpriseDB. While at EnterpriseDB, Bob led the company's technology and product management operations for their open-source product line. Prior to EnterpriseDB, Bob held management positions at IBM, Ascential Software and other technology companies where he consistently demonstrated the ability to define and deliver market-leading products with strong competitive differentiation. Read Bob's blog here. He can also be reached at bob.zurek@infobright.com.

 

Rocket Fuel