If you use GenBank as a tool in your published research, we ask that this paper be cited." Are there any molecular biologists who don't use GenBank, the congressionally-mandated public database of all known nucleotide and protein sequences? No wonder the 1998 review of GenBank's activities has stormed into the bottom (see paper #10) of an otherwise barely changed Top Ten. GenBank is the main product of the National Center for Biotechnology Information (NCBI), which the U.S. Congress created in 1988 to underpin efforts to sequence the genome. It consists of an intricate set of interdependent databases and the tools to interrogate them, the raw information streaming in from laboratories around the world. (Each day, GenBank in Bethesda talks to the European Bioinformatics Institute and the DNA Databank of Japan to ensure that all three of them have the latest versions of the sequences each has collected.) Researchers submit sequences to GenBank, which tries to make sense of them. It sorts out sequences according to what species they come from, what proteins they code for, what other sequences they resemble, and whether they are unique. All that information is then made available to members of the molecular biology community (and anyone else who cares to point a browser at http://www.ncbi.nlm.nih.gov) who can use it to attempt to understand the endless letters that pour out of DNA sequencers. The sheer scale of the enterprise is astounding. The #10 paper by the Dennis A. Benson groupBenson insists on sharing credit, telling Science Watch "it's humbling to be the first author, an honor I got by virtue of the alphabet"enumerates 690,000 new sequences in the previous year, with more than 1 billion bases from 1.6 million sequences. The database was doubling in size every 18 months, but lately the doubling time has fallen to 15 months. (The whole thing used to be available on a CD-ROM, but ease of access across the Internet, coupled with the fact that the database now spans several CD-ROMs, prompted the NCBI to abandon that format.) Lest you think that this is no more than a story of some big machines and clever programs, Benson tells Science Watch that there are highly trained staff, known as sequence annotators, who review the entire input stream of sequence data. "Approximately 20 individuals, with masters and Ph.D.s in molecular biology...are responsible as the gatekeepers for the database, reviewing records and interacting with scientist-submitters." What is more, the NCBI is more than just a data warehouse. "A lot of computational biology research goes on here," says Benson. "It's not informatics for just informatics sakewe try to be very biologically oriented in the computer work we do." To what extent people use the additional facilities is not yet clear, but something like Entrez, which integrates DNA sequence information with published references, taxonomic information and protein dataall over the Internetis clearly very handy. The #10 paper is actually the second in what has become a series of annual reports from GenBank. The latest, third, version (Nucleic Acids Research, 27:12-17, 1999) reads in part like boilerplate, updated on the fly. Where the 1998 version added 690,000 sequences, by 1999 the figure was 770,000. Likewise, 1999's edition holds 1.6 billion bases from 1.6 million sequences. And the number of species represented jumped from 30,000 to 40,000. Only one figure is significantly down: the number of complete genomes currently being sequenced of which GenBank is aware. In 1998 it was 32, but the 1999 paper mentions only 20. As it happens 10 complete genomes were added in 1998 (compared to two in 1996 and six in 1997) so the drop in organisms being sequenced probably represents a real move away from the smaller "model" genomes and into the "real" genomes of mouse and human. Even now the Delphic injunction to "know thyself" is fully reflected in GenBank's memory; well over half of all the sequences come from our own, singular species. The role of GenBank and other sequence databases in the "genome revolution" has been largely ignored. In a sense that is as it should be. Engines that power enquiry are taken for granted no less than engines that power automobiles; as long as they work, who cares exactly how? From time to time, though, it is good to peer beneath the hood and pay tribute to a fine piece of engineering. Oh, and there's another complete
sequence at #11. It doesn't cite GenBank. |
|
| Science
Watch®, May/June 1999, Vol. 10, No. 3 Citing URL: http://www.sciencewatch.com/may-june99/sw_may-june99_page8.htm |
Search | May/June 1999 Index | Archives | Contact | Home
|
|
|
|
|
Science
Watch® is an editorial component of Essential
Science Indicators |
|
|
|
(c) 2008 The
Thomson Corporation. |