reat tools will always be highly cited, because many different micro-specialities will all use them. A quick glance at the top of the table confirms it. Protein databases are at #2, #3, and #4, although PROSITE at #3 is exactly a year older than both the Protein Data Bank (PDB) at #2 and Pfam at #4. In fact, as this publication noted in its previous cover story, PDB and Pfam were among the most-cited papers of 2000, with PDB at #2 in that list too, and Pfam not far behind at #7 (see Science Watch 12[2]:1-2, March-April 2001). So they are undeniably hot. PDB is where researchers first deposit newly-solved structures, while PROSITE and Pfam are tools that allow scientists to read meaningful sentences from the alphabet soup of letters gushing daily from banks of gene-sequencers around the world. The problem is awesome. Out of the machines pour streams of sequence. Some contain genes, which can be converted into proteins. Others are control regions, or have no known function. Many new protein sequences are related to other proteins that have already been investigated to some degree; Pfam and PROSITE will tell you which. But there will also be differences of detail. It isn't as simple as lining up two texts and comparing them letter by letter or line by line. Chunks will be missing, or rearranged. Words will be spelled in different ways. Whole paragraphs might have come from another chapter. The cutting edge of genomics is no longer reading the sequence; it is understanding it. Each of the databases offers slightly different aids to comprehension. But why are several tools needed? Wouldn't a single overarching tool make more sense? Alex Bateman, group leader of Pfam, offered Science Watch a historical perspective. "Each of the protein family databases started with their own specializations. Some specialize in very high-quality data, so they add only a few [new entries] a year. Our goal is to try and cover everything. Our quality is lower, but we have good quality control." Pfam, which is administered from the Sanger Centre outside Cambridge, was designed originally to support the effort to sequence the nematode Caenorhabditis elegans. "We were really obsessed with not having any false positives," Bateman says. Almost all users access Pfam over the Internet, so Bateman's group set out to make the web pages nice to look at and—even more important—easy to use. "For the biologist," he tells Science Watch" it is reasonably intuitive." Development included sitting users down in front of a computer and asking them to perform various tasks. "We watched how they made mistakes," Bateman said, just like the giant software corporations. And users contribute too, not only information about the sequences but also suggestions for how to present the data. The web site tempts contributions with the chance to win a t-shirt. Users wanted Pfam to show them active site residues in enzymes, which it now does, and Bateman says they are looking at providing similar visual pointers for, say, metal binding sites. "We want users to look at a page and say 'Oh, that's what's happening' without having to get out highlighter pens." Pfam can be accessed at three main sites; Cambridge, the Karolinska Insitute in Sweden, and Washington University in St. Louis, (the links will open a separate browser window, simply close the window to return to this page) and each has a different interface that allows users to interrogate the underlying data. The three teams don't share code, but they do share ideas. St. Louis gets the most hits, Bateman tells Science Watch"but we have a much nicer interface." That, like almost everything about these databases, is hard for an outsider to judge. If you don't have a specific question, how do you assess a system designed to provide answers? But even genuine users do not have to understand the underlying mechanics of pattern matching and identification. They merely have to use them. So why has PDB accumulated 127 citations to Pfam's 84? It's an unfair question. As Helen Berman, who runs PDB at Rutgers University, tells Science Watch
PDB is "the single archive for all macromolecular structures," so everyone solving protein structures ought to use it. The more important question, says Pfam's Bateman, is why does PDB get so few citations. "People take [it] so much for granted that they are not citing PDB when they use it," observes Bateman. "I certainly know of tens of papers where the authors have used Pfam in figures and tables but have not cited it." While he concedes that this may sound like sour grapes, Bateman should be heartened. The long-lasting, true measure of the finest tools is to be used, not
cited. Dr. Jeremy Cherfas is Science Writer
|
Search | May/June 2001 Index | Archives | Contact | Home
|
|
|
|
|
Science
Watch® is an editorial component of Essential
Science Indicators |
|
|
|
(c) 2008 The
Thomson Corporation. |