In an effort to improve the understanding of human speech patterns and early child development, a group of vendors is joining forces to test the scalability boundaries of storage by constructing a petabyte-scale IP storage system at the Media Lab at the Massachusetts Institute of Technology.
The unique storage array, being built using integrated technology components from Bell Microproducts, Marvell, Seagate Technology, and Zetera, will be used to collect and analyze video and audio data for the Media Labs Human Speechome Project at MIT.
By the middle of 2008, the project will have collected enormous quantities of A/V data exceeding a 1.4PB—which is more than 1,000TB, according to MIT associate professor Deb Roy, who is serving as team lead for the Speechome project.
Over the past year the research project has accumulated several terabytes of data per week from digital audio and video recordings of early childhood learning and socialization data.
The goal is to study how human children acquire language based on human input within the home environment. The information and footage is being recorded through 14 omni-directional cameras and microphones installed in homes.
The A/V data being accumulated and stored will be processed and analyzed by several hundred parallel processing devices as part of the exhaustive scientific analysis of long-term infant learning patterns.
According to Roy, a specialized suite of data mining tools is currently being built by his team to sift through the bulk of the A/V data. Over 200 servers are being used to analyze incoming A/V data patterns.
As part of the project, about 400,000 hours of video and data will be collected and stored throughout the next few years. Each night, Roys department is tasked with moving two large chunks of data.
One is a collection volume, which is the main source of data and where all the information is being sorted. The other is a migrated copy of that data sent to another volume for analysis purposes.
The basic requirements for the project—high-performance reads/writes in excess of 160G bps (gigabits per second), massive, shared volumes creeping up over several hundred TBs and the ability to scale from an initial 50TB to over a petabyte of storage—proved too burdensome for existing storage technologies.
In addition, the human speech study must support a fully virtualized storage fabric, file access via computers running multiple operating systems and low-cost high-capacity SATA (Serial ATA) drives.
Powering the Media Lab Human Speechome Project storage collaboration is Zeteras Z-SAN as the underlying storage fabric. Built into Bell Microproducts Hammer Z-Rack storage enclosures, Zeteras SOIP (storage over IP) technology is combined with the Hammers ability to aggregate for capacity and performance via core and edge Ethernet switches from Marvell.
For its part, Marvell is supplying the SOIP processing nodes and XGE connectivity fabric, while Seagate is lending its SATA drives to the project.
Upon completion, the storage system will feature over 3,500 Seagate SATA drives, more than 300 Hammer Z-Rack storage enclosures, over 100 Marvell-based 10-Gigabit Ethernet switches, and 400 blade processors, said Ryan Malone, senior director of marketing for Zetera, based in Irvine, Calif.
High-performance I/O predicts the processing of 700TB of data during each 12-hour overnight analytical operation, with protection against data loss through RAID 10 duplicate copies of the raw video data, transformed data and metadata files.
The Human Speechome Project is being supported in part through a grant from the National Science Foundation.