Secondary Protein Database Spreads Data, Boosts Technology

The PDBbind database shows that freely available data expands useful information.

The Protein Data Bank was established in 1972 as a public repository for solved protein structures; this information reveals the precise shape of proteins and is invaluable in figuring out how they work and in designing drug molecules that bind to proteins in certain ways. Today, scientists at drug companies routinely search the PDB for everything from how to make a protein to ideas for designing better proteins and drugs.

In the first year of the PDBs existence, only two structures were deposited. Last year, nearly 5,000 were, and more than 25,000 protein structures are available today; all are accessible for free. This fact is leading to a triumph of the commons as scientists create and share ways to make available information even more useful.

Recently, a group of scientists in Shaomeng Wangs laboratory at the University of Michigan undertook a tedious task that will make freely available protein data much more useful for use in drug design. The use of computers to screen "virtual libraries" for potential drug candidates has become commonplace. One of the most accurate and efficient methods relies on so-called "scoring" functions. In short, a computer examines the structures of molecules that bind a protein in a manner that might make for a good drug. This generates a set of rules by which to assess structures of other compounds so that only ones with the greatest apparent potential as drugs are actually synthesized as compounds.

The strengths and limitations of scoring techniques lie in their reliance on data. While the PDB contains many structures with small molecules bound to proteins, information about binding strength is spotty. Such information is important, since two compounds that bind to the same site on a protein may differ in binding energy by many orders of magnitude, and only the more potent binders have a shot at becoming drugs. Led by Shaomeng Wang and Renxiao Wang, scientists have created a secondary database based on the PDB, named PDBbind, that does so. More laudably, they are following in the tradition of making their work freely available to academics, a gift that could lead to better science, technology and medicine.

Creating the new database was not trivial. Indeed, the researchers were surprised that even searching the PDB for bound structures required some creativity. Multiple searches using seemingly comprehensive terms like "complex," "ligand" or "inhibitor" missed too many eligible structures. So the researchers excluded structures most likely to be ineligible, like complexes of a protein and a nucleic acid. Then they looked at every structure containing a protein and organic molecule to see if the organic molecule was simply present in the structure or was interacting with it in a meaningful way. They also excluded common but non-drug-like cofactors (such as heme, NAD, CoA and FAD) that might bind in a very specific orientation to a protein but that would not yield useful information for assessing molecules potential as drugs. After these exclusions, 5,671 protein structures were left.

Next came the hard part. While researchers often report some measure of affinity or binding strength of a small molecule bound to a protein, these may include the molecules dissociation constant, the inhibition constant or the concentration at which it shows 50 perent inhibition of an enzyme. Worse, this information is not usually included when researchers deposit structures into the PDB. To standardize this data for PDBbind, the researchers obtained and manually reviewed over 3,000 original scientific articles. Finally, they examined what data was available to determine what structures would have information useful to docking and scoring studies. For example, the resolution of the crystal structure had to be at least 2.5 angstroms, only one ligand per structure could occupy a binding site, and the ligand could not be covalently bound or contain unusual elements (such as boron or beryllium) that molecular modeling software might not be able to interpret.

The final list contained 800 entries, only a tiny fraction of the total in the PDB, but considerably more than other efforts to create databases with similar information. And unlike other databases, the information for PDBbind is compiled from primary literature rather than secondary sources. It is the first attempt to collect experimental binding affinity data over the entire PDB. Its construction is described in the June 3 issue of the Journal of Medicinal Chemistry. Soon, I expect, more papers will be describing its use.

The database is accessible to the public at