The scientists who study which genes are "talking" in cells and tissues do a good job of sharing their stories in the form of freely available data. Not so for the folks studying proteins. These experts dont share much, and when they do, they tend to use different formats.
The gene experts have a history of being open. Pat Brown of Stanford University, the fellow largely credited for inventing the microarray, or "gene chip," absolutely refused to patent it or the software necessary to interpret the resulting data, and his scientific progeny have followed in his footsteps.
Genetic scientists are also some of the driving forces behind the Public Library of Science journals, which publish peer-reviewed articles but dont, on principle, charge for subscriptions. There are some 100 million freely available measurements of mRNA expression, which indicate which genes are actively issuing their particular instructions in a certain cell or tissue.
The comparable data for proteins are some 10,000-fold fewer, says a group of scientists from the Center for Systems and Synthetic Biology and the Institute for Cellular and Molecular Biology at the University of Texas at Austin. In an article published in the Nature Biotechnology journal this month, John Prince and colleagues bemoan the need for a public proteomics depository, saying the lack of one hampers scientific progress in understanding which proteins are around in various cell types, tissues and disease states.
In contrast, publicly available mRNA expression data have allowed scientists to figure out systems of genes that work together and to find evidence of post-transcriptional modification (ways to amend instructions that dont have to do directly with DNA).
Such opportunities are practically unavailable to the protein crowd because very little data are public. To be fair, collecting protein data is more difficult and less developed than collecting mRNA data. Hence the need for a concerted effort to collect them.
The most robust data come from mass spectra, which you get when you break up proteins into fragments and shoot the fragments through an electric field under a vacuum in tiny, twisted tubes. Differences in mass and charge make some proteins move faster than others. Complicated algorithms and smart scientists can chomp on the arrival times to identify proteins, but always repeatedly.
Prince and his colleagues estimated that fewer than one in four spectra from a large-scale yeast study yielded interpretable results. Better algorithms would help, but statisticians need more data to draft more sophisticated software.
So, for both cultural and scientific reasons, not much public protein data are hanging out waiting to be reanalyzed. This fact sets up a vicious cycle: Too little data are public to develop common systems, and too few common systems mean less incentive to use or contribute public data.
But some researchers are trying to change this. The Human Proteome Organization, which has recruited a number of labs for its project to characterize all of the proteins in human plasma, is pushing for standards to exchange protein mass spectra. And Prince and colleagues invite their peers to deposit mass spectra sets into an Open Proteomics Database hosted at their university.
Others studying proteomes or massive data sets will doubtless have more suggestions.
Please e-mail them to me at firstname.lastname@example.org.