With Amazon Web Services able to host 200 terabytes of genetic data in the cloud, medical researchers hope to spot the sequences leading to illnesses such as breast cancer and Parkinson's disease.
At a White House Big Data Summit on March 29, Amazon and the National Institutes of Health announced that they will make the full 1000 Genomes Project available as a free public data set on the company's Simple Storage Service (S3) and Elastic Block Store (EBS) services. Researchers can search the data for free from Amazon's Elastic Compute Cloud (EC2) and Elastic MapReduce (EMR) platforms.
The cloud database will allow medical researchers to predict the risks of illnesses, such as diabetes, heart disease, sickle cell anemia and breast cancer.
The 1000 Genomes Project is an international research effort started in 2008 that holds anonymized genetic data for more than 1,700 peoplethe largest amount of genomic material available to researchers, according to Amazon. The genomic database will hold genomes of 2,600 people from 26 populations, the company reported.
"The goal was to build up the world's largest map of human genetic variation," Dr. Matt Wood, product manager for big data and high-performance computing at AWS, told eWEEK.
The 200 terabytes of genetic data in the 1000 Genomes Project is comparable to 16 million file cabinets of text, or more than 30,000 standard DVDs, according to the White House Office of Science and Technology.
Amazon has been tracking the progress of the 1000 Genomes Project in a pilot stage and noticed the difference in the speed of sequencing and the clearer patterns in genetics that can be traced.
The first human genome took 13 years to sequence, but with next-generation sequencing technology, the work can be done in weeks rather than years, Wood noted. "This is a real quantum leap," he said.
"In comparison to the pilot data, this data is of real biological importance," said Wood.
One area researchers will look at is genetic patterns in the BRCA2 gene, which has been linked to breast and ovarian cancer. Researchers will also search for patterns in hypertension, vascular conditions and Parkinson's disease, said Wood.
"We're really allowing researchers to start to look at the genetics that cause disease," said Wood. "The 1000 Genome Project and the data that's been made available on Amazon Web Services are all part of this continual shift of genomics getting closer and closer to clinical practice."
Low-cost DNA sequencing will enable genomics to directly impact clinical outcomes, according to Wood.
"In addition to providing insight into disease processes for early identification of risk factors, this new era of genomics is allowing clinicians and informaticians to work together on individual patient cases to influence clinical outcomes," Wood explained.
Researchers can search the genomic data to spot geographic patterns, like those of Chinese people that live in Denver, or those for people with Mexican or European ancestry, said Wood.
Scientists will not only be able to compare genomic data for populations and subpopulations of humans but also compare the human genome with patterns found in other species such as the gorilla or duck-billed platypus, said Wood.
In addition to studying genes that cause disease, researchers will examine genetic information that can help prevent illnesses and protect individuals, said Wood.
"There's a lot of scope here to look at both classes of research," he said.
Cloud computing is an area of growth for big data in health care. "Data sets can be considered big data when they exceed researchers' experience in dealing with them," Wood noted.
On Nov. 10, 2011, Dell announced a donation of cloud infrastructure for pediatric cancer research trials, and IBM announced Clinical Genomics, a data-analytics platform, on March 14.