Now, scientists at Harvard and Stanford have created a software application that overcomes some of these barriers.
The program, called Genotext, trolled through publicly available data and came back with genes implicated in aging, leukemia and injury, as described this month in the journal "Nature Biotechnology."
The program automatically analyzes text descriptions of different experiments. It then identifies which genes were turned on or off, up or down, in various diseases or environmental conditions.
Thats no easy task, since a single experiment can collect millions of data points and descriptions of very similar experiments can vary widely.
"This is a real advance," said John Wilbanks, head of Science Commons, a nonprofit group dedicated to helping scientists find productive ways to share data. "The use of annotation and knowledge to understand functional relationships between genes is where the field has to go."
Researchers are searching for the genes responsible for everything from making people fat to making wounds heal. To find them, scientists routinely use microarrays, also known as "gene chips," to sample the activity of tens of thousands of genes, then home in on just a few.
This approach necessarily excludes some information. Genes with small effects or small changes are likely to be overlooked, and specialists in one disease might not notice results that could be valuable for other disease specialists.
People have been using microarrays to study diseases for over a decade, said Atul Butte, a Stanford bioinformatics specialist and the studys lead author, "but they only study one disease at a time, so we decided, Why dont we try to study everything simultaneously?"
Scientific journals often require researchers to deposit their microarray data in publicly available databases. Though data formats to describe which genes are turned up to what level are fairly standard, the same cant be said for descriptions of the conditions and tissues in which the genes are measured. That makes it difficult to compare experiments that probe how the environment might change gene activity or how gene activity differs between sickness and health.
"Weve all agreed on how to represent the genes, but we havent agreed on how to represent what we actually did in the experiments," Butte said. Thats one of the problems that Butte, along with Isaac Kohane, a bioinformaticist at Harvard, set out to solve by creating Genotext.
First, Genotext analyzes text describing experiments and links individual experiments to a predefined set of medical concepts, UMLS (Unified Medical Language System). Then, the software linked the genes from the experimental results to the concepts.
Finally, the researchers determined whether the same genes were found in experiments with humans and mice and assessed the likelihood of false positives. The technique identified genes associated with disease, environment, age and experimental conditions. It was able to find these associations regardless of the questions the original experiment had been designed to answer.
The software did make mistakes. When an experiment used equipment from Axon Instruments, a part of Molecular Devices Corp, in Foster City, Calif., Genotext mapped the experiment to brains and foster care. If researchers pasted long lists of general abbreviations or their entire scientific papers into experimental descriptions, the software assigned too many concepts to the experiment.
The software designers were able to fix major sources of error, but Butte said software can only go so far, particularly with the quantity of deposited data doubling every year. "If we dont standardize how we describe our experiments, were going to have a big mess," Butte said.
Butte also plans to write software that will help scientists assign the proper concepts to each experiment.
Though Genotext is available for free over the Web, researchers need some programming experience to use it. Plans to create a user-friendly query interface are underway.