Sunday, April 1, 2012

Big Data on Amazon's Cloud

The world’s largest set of data on human genetic variation produced by the 1000 Genomes Project is now available on the Amazon Web Services (AWS) cloud, as announced by NIH and AWS jointly. This means that researchers and labs of all sizes and budgets have access to 1000 Genome Project data and can immediately start analyzing and crunching the data without the investment it would normally require in hardware, facilities, and personnel.

The data being released in the cloud includes results from sequencing the DNA of 1,700 people with the remaining 900 samples to be sequenced in 2012. The results identify genetic variations occurring in less than one percent of the study population and will make important genetic contributions to common diseases, such as cancer and diabetes.

The “Big Data Initiative” including NIH, NSF, DOD, and the Department of Energy is committing more than $200 million in a collaborative effort to develop core technologies and other resources to manage and analyze enormous data sets. Among the NIH components participating in the Big Data Initiative are the National Human Genome Research Institute and the NIH National Center for Biotechnology Information, a division of NLM.

Since 1000 Genomes Project was launched, the data set has grown enormously. At 200 terabytes, the current 1000 Genomes Project records are an example of how data has become so massive so that few researchers have enough of their own computing power to use the enormous amount of data available. Cloud access enables users to analyze the data more quickly and eliminates the time consuming download of data since users can run their analyses over many servers at one time.

AWS has posted the 1000 Genomes Project data for free as a public data set. The data can be seamlessly accessed through services such as Amazon Elastic Compute Cloud and Amazon Elastic MapReduce. These organizations are able to provide highly scalable resources needed to power big data and high performance computing applications often needed in research.

The public-private collaboration to store the data in the AWS cloud allows any researcher to access and analyze the data at a fraction of the cost it would take for their institution to acquire the needed internet bandwidth, data storage, and analytical computing capacity. Researchers pay only for the additional AWS resources they need to further process or analyze the data.

As part of the “Big Data Initiative”, NIH is joining with NSF to fund the development of core technologies for data collection, management, analysis, and extractions. NIH is particularly interested in imaging, molecular, cellular, electrophysiological, chemical, behavioral, epidemiological, clinical, and other data sets related to health and disease.

Go to http://s3.amazonaws.com/1000genomes to view the 1000 Genomes Project data available through AWS. The data is also available at www.1000genomes.org and from NCBI at ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes.

NIH also funds many projects to develop new computational tools for analyzing genomic data. For example, NHGRI just provided $1.5 million to fund the development of Galaxy at https://main.g2.bx.psu.edu an open source software suite to use for data analysis in life sciences that was developed at Pennsylvania State University, and Emory University.