Content-addressed storage systems arrived just in time to help organizations deal with what was then a new challenge—compliance. For harried IT managers desperate to bring their companies into compliance with regulatory mandates, the proprietary nature of these systems was a necessary evil.
That was then.
CAS stormed on to the scene in 2002, helping IT managers bring their organizations into compliance with a host of regulatory mandates by making their content searchable and tamper-proof. Probably the most widely known CAS product is EMCs Centera—it was the first to enter the market, and other vendors have been trying to catch up with it since.
So, what exactly is a CAS solution? Its an intelligent repository used to store and preserve business data, such as documents and e-mail messages. CAS solutions can be used effectively by a wide range of organizations, but they are best suited for the storage of compliance-sensitive documents, such as medical records, blueprints, invoices and e-mail messages.
Through the use of identifiers, IT managers can ensure that sensitive business content is not altered, preserving the “paper trail” within the context of a paperless environment—a requirement for compliance in many types of industries.
Another core element of CAS products is the ability to perform high-speed searches through the repository. Since CAS products use hard drive-based arrays to store data—as opposed to slower and more cumbersome tape and optical archive technologies—auditors can search the contents of a CAS system at a very fast rate. This is key because the ability to swiftly retrieve information is another requirement for compliance with many regulations.
Vendors in this space offer a couple of ways for IT managers to move data into CAS units—from file servers and from applications.
Most CAS vendors support CIFS (Common Internet File System), NFS (Network File System) and other common protocols, allowing IT managers to easily move files from file servers to the CAS systems.
Migrating data from applications to various CAS solutions, however, is not so straightforward. CAS vendors publish open APIs that application vendors can use to establish links between their products and various CAS systems. Unfortunately, because every CAS vendor has its own set of APIs, application vendors must modify their wares specifically to each.
As data is fed into a CAS system, a unique identifier is created for each piece of data. The identifier is derived from a hash value of the data being archived, and this information is stored in a repository for safekeeping. Because the identifier is derived from the content itself, any change to the original content spawns the creation of a new identifier.
However, there also is no standard for creating the unique identifiers, so vendors create their hashes using different protocols. EMC uses MD-5, for example, while Hewlett-Packards StorageWorks RISS (Reference Information Storage System) appliance uses SHA-1.
Proprietary hardware is another unfortunate characteristic of CAS solutions—one that doesnt seem like it will be going away any time soon.
From a physical standpoint, there is nothing exotic about the hardware in CAS solutions, which are basically built out of commodity servers and storage. However, as storage and processing power run low, IT managers must go back to their CAS vendors to purchase additional hardware—they cant just throw cheap hardware at the problem. HPs RISS 1.4, for example, scales to impressive heights, but only with proprietary SmartCell units.
When you factor in this limitation, along with the fact that archive solutions and the data they store are long-term investments, it is clear that IT managers should never rush into a CAS implementation.
Standards SOS
Despite its limitations, CAS is an important technology with the potential to ease much of the compliance burden for IT shops. So, eWEEK Labs is glad to see standards emerging that promise to simplify at least the movement of data from applications to CAS solutions.
Currently being developed by members of the Storage Networking Industry Association is XAM (Extensible Access Method), a storage interface that is designed to provide a standard method for applications to talk to CAS solutions and move data without the use of proprietary APIs.
Within the XAM specification will be provisions for metadata that will allow applications to tell CAS devices how long they want to retain the migrated information and what type of security needs to be maintained for that content.
An SDK (software development kit) for XAM could be available as early as this year.
There also is room for hope in software. Because the hardware used in CAS solutions is not unique, there are opportunities for software players to create a CAS application with no hardware lock-in. One vendor with some potential here is Caringo, a CAS startup founded by Paul Carpentier. Carpentier invented the technology on which EMCs Centera was built.
It is important to note that, in the grand scheme of things, the long-term management of content is a complex problem that transcends the storage world. Solving it will require cooperation between application development teams and storage administrators.
Looking at things from an ILM (information lifecycle management) perspective, documents and other forms of content get created by people using a wide array of applications, and this content eventually completes its life cycle in an archive system—be it a tape or optical library or a CAS system.
Many applications do not have shared data repositories, so IT managers need to spend a significant amount of resources plotting out the ILM path for content coming from the various applications.
CAS can be implemented fairly quickly to deal with specific problems, such as e-mail and document archiving, but to create a more comprehensive corporate archive, we suggest that IT managers perform in-depth analyses of their applications before committing to a CAS solution. For example, administrators should make a conscious effort to eliminate outdated applications and file shares and consolidate application data when possible.
A broader standard, but one that could ease the consolidation of data before its dumped into CAS solutions, is JSR (Java Specification Request)-170. Also known as the Content Repository API for Java technology, JSR-170 is a powerful standard that allows IT managers to consolidate data stores and to simplify and standardize the movement of data among applications.
JSR-170 was finalized last June, and work is being done on its successor, JSR-283, which should be complete in May 2007.
While it was originally designed for Java content, JSR-170 is not limited to Java applications—support has been added to allow PHP and .Net applications to work with JSR-170 repositories.
The first and most long-lasting benefit of a standard like JSR-170 is that it will give IT managers the flexibility to change when technology changes. With JSR-170 in place, new applications can connect to legacy repositories, allowing IT managers and developers to move forward without sacrificing old data and code.
Repository consolidation is another key benefit that can come from JSR-170. Often, many versions of a document can be found throughout a network. (For example, different versions of the same file might be sitting on a file server, portal and CMS [content management system].) With a centralized repository, users can easily find the most up-to-date version of the document instead of hunting through multiple locations.
From a storage management perspective, a single repository also makes it easier to maintain backups and manage storage resources.
JSR-283 will feature improvements to the management of access control and retention policies. (For more information, go to www.jcp. org/en/jsr/detail?id=283.)
Senior Analyst Henry Baltazar can be contancted at henry_baltazar@ziffdavis.com.