By hosting the human genetics catalog in the cloud, AWS gives researchers instant access to the complete 1000 Genomes Project on AWS, enabling scientists to accelerate disease research.
Amazon Web Services (AWS) and the U.S. National
Institutes of Health (NIH) announced that the complete
1000 Genomes
Project is now available on AWS as a publically available data set.
AWS and NIH announced the news at the
White House Big Data Summit on March 29. The announcement makes the largest
collection of human genetics available to researchers worldwide, free of
charge. The 1000 Genomes Project is an international research effort
coordinated by a consortium of 75 companies and organizations to establish the
most detailed catalog of human genetic variation, AWS officials said.
The project has grown to 200 terabytes
of genomic data, including DNA sequenced from more than 1,700 individuals that
researchers can now access on AWS for use in disease research. The 1000 Genomes
Project aims to include the genomes of more than 2,600 individuals from 26
populations around the world, and the NIH will continue to add the remaining
genome samples to the public data set this year.
The 1000 Genomes Project started out
with pilot phases in 2008 that included just a couple terabytes of data, AWS
told
eWEEK. In 2010, NIH made a small
portion of that data available on AWS as a public data set, and due to the
positive feedback from scientists, it decided to make the 1000 Genomes Project
as it stands todayat more than 2000TB of datafully accessible on AWS. The
amount of data produced by the 1000 Genomes Project is unprecedented in
biomedical research, NIH officials said. NIH, part of the U.S. Department of
Health and Human Services, serves as one of the data coordinators for the 1000
Genomes Project.
Previously, researchers wanting access
to public data sets such as the 1000 Genomes Project had to download them from
government data centers to their own systems, or have the data physically
shipped to them on discs, said Lisa D. Brooks, Ph.D., program director for the
Genetic Variation Program, National Human Genome Research Institute, a part of
NIH, in a statement. This process took a long time, and thats assuming a lab
had the bandwidth to download the data and sufficient storage and compute
infrastructure to hold and analyze the data once they had it. We are happy that
the 1000 Genomes Project data are on AWS to give researchers anywhere in the
world a simple way to access the data so they can put the data to work in their
research.
Putting the data in the AWS cloud
provides a tremendous opportunity for researchers around the world who want to
study large-scale human genetic variation but lack the computer capability to
do so, said Richard Durbin, Ph.D., co-director of the 1000 Genomes Project and
joint head of human genetics at the Welcome Trust Sanger Institute, in Hinxton,
England.
AWS said for researchers to download
the complete 1000 Genomes Project on their own servers, it would take weeks to
months, and thats assuming they had the bandwidth to download the data and
enough hardware and storage to hold it. To do meaningful analysis on the data,
researchers often needed access to very large, high performing compute
resources, which cost hundreds of thousands and sometimes millions of dollars,
AWS officials said. The NIH was selected as one of the data coordinators for
the 1000 Genomes Project, and it wanted to remove this friction and make the
data as widely accessible as possible, so researchers can immediately start
analyzing and crunching the data, even if they dont have the large budgets that
are traditionally required for this level of data analytics, AWS said.
Public Data Sets on AWS provide a
centralized repository of public data stored in Amazon Simple Storage Service
(Amazon S3) and Amazon Elastic Block Store (Amazon EBS). The data can then be
directly accessed from AWS services such as Amazon Elastic Compute Cloud
(Amazon EC2) and Amazon Elastic MapReduce (Amazon EMR), eliminating the need
for organizations to move the data in-house and then procure enough technology
infrastructure to analyze the data effectively, AWS said.
For its part, AWS highly scalable
compute resources are being used to power big data and high-performance
computing applications such as those found in science and research.
NASAs
Jet Propulsion Laboratory,
Langone Medical Center at New York University,
Unilever,
Numerate,
Sage Bionetworks and
Ion Flux are among the organizations leveraging
AWS for scientific discovery and research. AWS is storing the public data
sets at no charge to the community. Researchers pay only for the additional AWS
resources they need for further processing or analysis of the data.
It took more than 10 years and
billions of dollars to sequence and publish the very first human genome. Recent
advances in genome sequencing technology have enabled researchers to tackle
projects like the 1000 Genomes by collecting far more data, faster, said
Deepak Singh, Ph.D. and principal product manager for Amazon Web Services, in a
statement.
This has created a growing need for
powerful and instantly available technology infrastructure to analyze that
data. Were excited to help scientists gain access to this important data set
by making it available to anyone with access to the Internet. This means
researchers and labs of all sizes and budgets have access to the complete 1000
Genomes Project data and can immediately start analyzing and crunching the data
without the investment it would normally require in hardware, facilities and
personnel. Researchers can focus on advancing science, not provisioning
the resources required for their research.
AWS said the 1000 Genomes is a prime example
of big data, where data sets become so massive that few researchers have
access to the compute power in their own data centers to analyze and process
the data. Yet, a key point here is that the 1000 Genomes data will be sitting
right next to the compute power researchers need to derive value from the data.
In a matter of minutes, scientists can spin up as much compute power as they
need to crunch the massive data sets. Researcher will only pay for the
additional AWS resources needed for further processing or analysis of the data,
AWS said.
For more information about Public Data
Sets on AWS go to:
http://aws.amazon.com/publicdatasets/.