Five 'Dirty Little Secrets' to Know When Buying a Data Archive

By Chris Preimesberger  |  Posted 2008-10-10

Five 'Dirty Little Secrets' to Know When Buying a Data Archive

It turns out there are some so-called dirty little secrets that not every vendor will tell you about archiving products. They fall into five categories of "secrets": scalability, data protection, performance, data migration and energy efficiency.

Dirty Little Secret No. 1:  Scalability. CAS (content-addressable storage) archives have a hard limit on the number of objects that can be stored. 

This is a very different metric from the total amount of usable storage a system might have.

"What nobody tells you is that as you grow the number of your stored objects, you're going to run into a few challenges," said Bob Woolery, senior vice president of marketing at Nexsan, which makes SANs (storage area networks) and archiving packages. "Let's say you have 5 terabytes of space. You say, 'Great, when I run out of 5TB, I'll buy 5TB more.' And you purchase it based on that. But the other constraint is your object count.

"Why this is important is that you can grow your archive so large in terms of object count that the system will give you an 'all full up,' when you still may have plenty of capacity left," Woolery said.

So you call up your local vendor and tell him that your system thinks it is full when you still have, say, 2TB of capacity left. "That's when you find out that the object count is what really determines how much capacity you use," Woolery said.

An object limit can be reached long before the actual storage limit is reached, which means customers now have to invest in a second expensive database even though they technically still have space available.

A good example of this is e-mail. A company may archive all e-mail for compliance purposes. The vast majority of these e-mail objects may be small in size, but the sheer volume may max out the archive's object limit quickly, leaving gigabytes or terabytes of storage space unused. This is usually a big shock for companies.

Dirty Little Secret No. 2: Performance degradation. As objects pile up in an archive, the speed at which the archive runs slows down tremendously.

"What they don't want to tell you is that all of a sudden when you get near your object limit, you get this 'crawl' effect," Woolery said. "When you look under the hood of an archive, you see a single database. With the exception of [Nexsan's] Assureon, which has a dual [database], all of those systems have a single database. It can be a small or a large one, but it is still a single database."

A database simply gets filled up and overwhelmed with managing a high number of objects and all their corresponding metadata.

"Because it had to manage an ever-growing number of objects and process them, the processors within the archive end up spending so much time managing those objects that they're not able to take in as many files and push them out the door when you need them," Woolery said.

A dual-database setup alleviates this issue, he said.

Dirty Little Secrets 3-5

Dirty Little Secret No. 3: Data protection. The existence of the commodity hardware "back door."

Many archive companies allow a customer to use "commodity" hardware-standard storage arrays from companies such as Dell, Hewlett-Packard, EMC and IBM-along with their software as part of the "system solution."

"They then will claim that those archives are secure and that they maintain data protection and integrity," Woolery said. "But there is a huge back door, and that is this: Because it's commodity storage, it's anybody's storage, and you can't control that storage. The software is not tied directly to the hardware."

An example of this: An administrator could delete all data at the hardware level, effectively bypassing the software-based security. A good archiving solution should manage this at both the hardware and software layer.

"Someone can go in there and delete a complete RAID set, if they want to, without the application security being able to stop him," Woolery said. "The software doesn't know you deleted the RAID set, and because it's commodity storage, the software and hardware are not necessarily linked all the way down to the hardware level. That's a lot of work to do if you're using anybody's commodity storage."

Dirty Little Secret No. 4: Data migration: When an archive is moved, the files can become orphaned and the entire process could become exceedingly slow.

When most archive applications ingest a file, they create a CAS address at the application level, not the archive level. As a result, the application is the only thing that knows how to find that file.

"So, let's say down the line you want to change your application, or your archive system is growing old and you want to upgrade it and create a new one, then you must migrate that data off the old system," Woolery said.

"The problem is, you have to migrate it off, but you have to do it back through the old application-or you orphan the data. Let's say it's an e-mail database [such as Exchange]. So that means that while you're backing it out during that migration, that e-mail application will not be able to be used by the company."

And we're usually talking about a lot of objects, especially when it comes to e-mail.

"This will take longer than hours," Woolery said. "It could take days. Can you afford to have your e-mail app down for days? It's not possible. What if you're a hospital with lots of image files, which are big files? Due to HIPAA regulations, these files must be kept for seven years. That's a huge amount of data to migrate, and that's why it's so difficult. Vendors won't tell you this ahead of time."

Dirty Little Secret No. 5: Energy efficiency. Not the best in most archive systems.

Since archiving systems usually have one database running across all the system disk drives, it is very difficult to be energy-efficient.

"If you request one file in the archive and it's on one disk, you will have to spin up every single disk in that system to get that one file," Woolery said.

"If you have 100 disks and there's one small file you need to access, or a diagnostic you need to perform, they all spin up if you have a single database. Whether it's a large, robust DB or a small one, it still requires spinning up all 100 disks."

Seems like something the archive industry should be trying to remedy, don't you think? In fact, all five of these issues perhaps should be researched a little closer.


Rocket Fuel