Scaling Toward the Petabyte

Applications must be capable of handling huge warehouses.

If theres any doubt that enterprises are facing a data explosion, consider the next major threshold on the horizon for large databases. Within 18 months or so, experts say, a company or organization somewhere in the world will reach a petabyte of storage for a database.

More than likely, it will be for a massive data warehouse, not the transactional systems running day-to-day operations, and it likely will fall within retailing, financial services, health care or government. With the petabyte—1,000 terabytes—will come a growing need for data management systems that can scale even larger, perform even faster and manage themselves more automatically. "Theres not an intrinsic technical meaning to the milestone," said Richard Winter, president of Winter Corp., in Waltham, Mass., a large- database consultancy. "Its something that makes people ... take notice."

It will also force database vendors—no matter how much they say their systems are ready—to improve their software, Winter said. Vendors such as Teradata, a division of NCR Corp.; IBM; and Oracle Corp. already are focusing on key areas crucial to managing such huge databases. Data management systems must be able to scale and grow rapidly through technologies such as clustering, they must be able to maintain top performance even as demands on them increase, and they must provide self-management capabilities to avoid unruly database administrator demands, Winter said.

IBM, for one, will continue adding more self-management capabilities to its DB2 Universal Database, planning new features later this year with Version 8 and having launched new self- managing database tools earlier this month. Such advances will be critical to support databases reaching the petabyte range and maintaining high availability, said Jeff Jones, director of strategy at IBM, in San Jose, Calif.

Perhaps the largest data warehouse today belongs to Wal-Mart Stores Inc. Teradata, Wal-Marts database vendor, estimates that the warehouse can hold 200 terabytes of data today. Winter said the figure includes the databases storage capacity, not the total amount of data in it, which is closer to 50 terabytes. Often, to improve performance, large data warehouses need more storage capacity than the actual amount of data collected. Still, at that size and with leading-edge companies doubling the size of their data warehouses every year, Winter predicted the first data warehouse with more than a petabyte of storage will be in production in 2004.

Teradata Chief Technology Officer Stephen Brobst agreed, saying he expected a customer to deploy a petabyte database within 18 months. Brobst, in Dayton, Ohio, said the holdup isnt technological but rather customer readiness. Teradata this month certified the database components necessary to build a petabyte database, Brobst said."Its all about how much detailed information can you have that makes economic sense," he said.

Enterprises such as CNN, a division of Turner Broadcasting Inc., are eyeing a petabyte. CNN has used IBMs DB2 to underpin an archive of its news footage. The archive still is in its early stages of transferring some 1.2 petabytes into the system, which could take at least five years, said Gordon Castle, senior vice president of CNN technology, in Atlanta. Within a year, Castle expects to be archiving as much as 240 terabytes of data every year from new footage. Technically, the system wont be a full-fledged petabyte database because the database itself wont store the data objects but will point to them through metadata, he said. One of his biggest concerns is making sure the system can continue to perform well with fast query responses as demands increase. The archive wont be static but will instead be accessed daily.

"This is somewhat heavy lifting for a database, and theres lot of access and controlled movement of content, and the files are quite large," Castle said.

Even after it is reached, a petabyte database will remain the exception for most companies years beyond its unveiling. Most enterprises dont have compelling-enough business reasons to build and manage such massive databases. United Airlines Inc. is a good example. The airline is building a 6-terabyte data warehouse but has been taking a slower-growth approach by focusing more on analyzing real business issues than on the warehouses sheer size, said Casey Hossa, director of sales, marketing and call center technology at United, in Elk Grove Village, Ill. "Theres a lot of [data] you can capture, but were ... trying to find those things that are core to the organization," Hossa said.