Altschul: That is the most interesting aspect of that paper. It’s an acronym for "Position Specific Iterated" BLAST. And it's really, underneath, a quite different sort of program. Basically if you have a multiple alignment of a number of related sequences, and it's an accurate alignment, PSI-BLAST will make it much easier to find distant relatives. The reason is that you can look in a given column of that alignment and see that a certain residue is very highly conserved. You might see, for instance, that there’s always a glycine at a certain position. Whereas in another position, the original sequence might have glycine, but you see from the multiple alignment that virtually any other amino acid can go there. PSI-BLAST exploits this with a scoring system that, in the position where glycine is completely conserved, gives a very high score for aligning a glycine and a large negative score for everything else. Whereas in the position where there’s great variability, pretty much every residue gets a neutral score. That ends up being much more sensitive for finding related sequences.
Altschul: This idea has actually been around since the mid-1980s, and a lot of people have developed programs based on it. As I’ve said a number of times, PSI-BLAST is like the Model T Ford of this kind of sequence comparison, in that there were a lot of cars before the Model T, and perhaps even better cars, but the Model T was accessible to everyone. There have been a number of programs similar to PSI-BLAST, but they have tended to require a fair amount of expertise to use, and to take a long time to run. What we did with PSI-BLAST is to completely automate the process. With PSI-BLAST, you put in your sequence and run a regular BLAST search, which finds sequences likely to be related. Then the program constructs a multiple alignment, creates a position-specific scoring system from it, and searches the database again for more distantly related sequences. And this can be iterated an arbitrary number of times, with no user intervention.
Altschul: There are now a number of efforts by different groups, including our own, to build databases of multiple alignments or of patterns. So that rather than having to search a complete database of individual sequences, you will search a database of patterns or of domains. The hope is that the universe of protein domains is relatively small–that even though we keep getting more sequences from different organisms, the number of new domains or new patterns is not going to grow much. So rather than comparing your new sequence to all sequences known, you’ll compare it just to a database of domains. A number of people have already created such databases. And we have a recent paper in Bioinformatics on a program to search the sort of pattern that PSI-BLAST generates. [Note: see A.A. Schäffer, et al., "IMPALA: Matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices," Bioinformatics, 15(12):1000-11, 1999.] This program has already been used to analyze chromosome 4 of Arabidopsis.
Altschul: When you use PSI-BLAST to search a database, it generates Position Specific Scoring Matrices, which can then be built into a database of patterns. Then you just search one of these databases with a new sequence. One of the difficulties in doing this is curating the database. In a regular sequence database, you just keep throwing in new sequences, whereas with one of these pattern databases, you have to periodically go back and redo the patterns and try to consolidate them and so forth. It takes a lot of effort to keep up to date.
Altschul: A couple have already been produced as research projects, and Steve Bryant is coordinating an effort to build one that will be maintained over the long term.
Altschul: It’s not clear to me that I will be. It is a pretty well-plowed field at this point–although I've thought that in the past and people keep finding new things to do. On the other hand, there is a lot of excitement in trying to figure out how to analyze the expression data that is now being generated. And this is a virgin field. There are few really good ideas on how to analyze the data being generated by the new expression chips or expression arrays.
Altschul: It's generated to analyze which genes are turned on and off in different cells. There is now technology that allows you to look, for instance, at cancer cells versus normal cells, or normal cells versus cells exposed to certain drug, and see what genes are expressed–whether and to what degree they are making messenger RNA. You can analyze data for all the genes in the cell simultaneously. There’s a lot of excitement in this area, because by seeing how genes are regulated, you can hope to find which ones are important in different diseases or growth processes. There is a huge amount of expression data that’s going to be flowing from this, and people are just beginning to think about how to analyze it.
Altschul: I have done some work on it. It is certainly interesting, and whether I get more involved really depends on whether I can come up with some good
ideas. |
| Science
Watch®, July/August 2000, Vol. 11, No. 4 Citing URL: http://www.sciencewatch.com/july-aug2000/sw_july-aug2000_page4.htm |
Search | July/August 2000 Index | Archives | Contact | Home
|
|
|
|
|
Science
Watch® is an editorial component of Essential
Science Indicators |
|
|
|
(c) 2008 The
Thomson Corporation. |