Chuong B. Do talks with
ScienceWatch.com and answers a few questions about
this month's New Hot Paper in the field of Computer
Science.
Article Title: CONTRAfold: RNA secondary structure
prediction without physics-based models
Authors: Do,
CB;Woods, DA;Batzoglou, S
Journal: BIOINFORMATICS
Volume: 22
Issue: 14
Page: E90-E98
Year: JUL 2006
* Stanford Univ, Dept Comp Sci, Stanford, CA 94305
USA.
* Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA.
Why do you think your paper is highly
cited?
Functional noncoding RNA genes are an important class of genomic elements,
which perform numerous catalytic and regulatory roles in living cells. The
function of RNA genes is dictated by their secondary structure, i.e., the
patterns of base-pairings that form between nucleotides of an RNA molecule.
This paper describes CONTRAfold, (CONditional TRAining for RNA secondary
Structure Prediction), a novel approach to predicting the secondary
structures of RNAs for single sequences using machine learning, which makes
significantly more accurate predictions than all previous methods.
"In this paper, we adapt an existing
probabilistic modeling technique, known as
conditional log-linear models
(CLLMs)"
Over the last several decades, the most accurate methods have relied on
physics-based models, whose energy terms were measured through laborious
experiments. Our approach is the first competitive method that allows
automated estimation of parameters without the need for direct experimental
measurements.
Does it describe a new discovery, methodology, or
synthesis of knowledge?
In most computational approaches to RNA secondary structure prediction, the
energy of a structure is modeled as the summation of local interaction
terms describing small portions of the global base-pairing configuration;
the predicted RNA secondary structure is the one achieving the minimum free
energy. Obtaining the free energies for each type of local interaction term
that could occur in an RNA secondary structure, however, is a difficult
endeavor, often involving carefully calibrated optical melting experiments.
In this paper, we adapt an existing probabilistic modeling technique, known
as conditional log-linear models (CLLMs), to the problem of modeling RNA
secondary structure. Unlike previous applications of machine learning to
the problem of RNA secondary structure prediction, our model uses
parameters which closely mirror the local interaction terms of
thermodynamics-based models. Using discriminative learning techniques, we
estimate these parameters directly from databases of RNAs with known
structure, without relying on optical melting data.
Would you summarize the significance of your paper
in layman's terms?
CONTRAfold is not the first method to use machine-learning techniques for
estimating RNA secondary structure models from structural databases. Past
applications of machine learning to the problem of RNA secondary structure
determination focused on generative probabilistic grammar-based models of
RNA secondary structure (in particular, stochastic context-free grammars or
SCFGs).
These methods, however, failed to reach the accuracies of
thermodynamics-based models. As a result, machine-learning methods have
often been considered second-rate approaches to parameter estimation,
useful only in specialized circumstances where more general RNA folding
models do not apply.
By demonstrating that effective parameter estimation from databases of RNAs
with known structure is in fact possible, CONTRAfold provides a promising
alternative to the thermodynamics-based modeling techniques that have
dominated RNA secondary structure modeling for several decades.
How did you become involved in this research, and
were there any problems along the way?
Before this work, our research dealt with applications of CLLMs to pairwise
protein sequence alignment. Based on our successes in protein alignment, we
expected that similar techniques could also succeed in the problem of RNA
secondary structure prediction.
Initially, we worked with grammar-based models of RNA, which were, at the
time, the machine-learning technique of choice for RNA secondary structure
modeling. Grammars turned out to be overly restrictive and cumbersome for
incorporating all the various types of local interactions needed in
modeling RNA secondary structures, and our first attempts at applying our
discriminative learning algorithms gave disappointing results.
While working on these models, however, we realized that we could build a
CLLM using the parameterization of local interaction terms in existing RNA
thermodynamic models as a starting point. By constructing our model in this
way, our algorithm would closely mirror the scoring scheme of the existing
state-of-the-art methods while retaining the flexibility to learn new
parameters via discriminative machine learning. This key insight led to the
CONTRAfold program.
Where do you see your research leading in the
future?
We believe that discriminative machine-learning techniques hold much
promise for computational analysis of RNAs, beyond structure prediction. In
particular, we are interested in developing models for identifying novel
candidates for functional noncoding RNAs in whole genomes. For this task,
we are looking at extensions of the learning algorithms used in CONTRAfold
for distinguishing functional structured RNAs from nonfunctional
transcripts. This supervised learning approach differs from more standard
computational screens in which the free energy of RNAs is compared to the
free energies for randomly shuffled RNA sequences.
Chuong B. Do
Department of Computer Science
Stanford University
Stanford, CA, USA Web
Keywords: functional noncoding RNA genes, CONTRAfold,
CONditional TRAining for RNA secondary Structure Prediction, RNA
secondary structure prediction.