EBI's Ewan Birney: Quest for the Genomic Dragons

AUTHOR COMMENTARIES - 2009
January/February 2009

+enlarge image

EBI's Ewan Birney: Quest for the Genomic Dragons
Science Watch^® Newsletter Interview

When the human genome project was completed in 2003, it provided the DNA sequences of the 3 billion base pairs that make up the genome, but that was effectively all it did. Researchers were left to translate that sequence data into meaningful information about the protein-coding genes that have been the focus of molecular biology for the past 40-odd years. They were left to guess at what this meant for the region of the genome that has traditionally been written off as "junk"—the DNA that sits between genes and inside of genes but plays no apparent role in the coding of proteins.

In 2004 an international collaboration known as ENCODE (Encyclopedia of DNA Elements) set out to address these issues by defining all functional elements in a representative 1% of the human genome—whether in the protein-coding regions or in the erstwhile junk. The results were published in June of 2007 in a summary paper in Nature and in 23 more detailed articles in the journal Genome Research. The Nature article, "Identification and analysis of functional elements in 1% of the human genome," (447[7146]: 799-816, 2007) quickly became a Hot Paper, racking up nearly 300 citations in roughly a year and half and duly taking up residence in this publication's Top Ten in Biology, where it currently ranks at #3. In the process, the paper helped catapult Ewan Birney, a scientist at the European Bioinformatics Institute in Hinxton, U.K., and first author on the ENCODE paper, into the #9 position in the current Thomson Reuter's Essential Science Indicators^SM ranking of the hottest researchers in molecular biology & genetics. Birney’s elite selection of 42 ESI-covered papers in the field since 1998, representing just a portion of his career output of more than 100 reports, has collectively tallied over 13,000 citations—averaging an amazing 307 citations per paper.

Birney, at 36, is already a veteran of a host of genome projects, from the human, mice, and rat genomes to the platypus and Anopheles gambiae, the mosquito that carries malaria. He received his bachelor’s degree in biochemistry from Balliol College Oxford in 1996. He then spent the next four years working with Richard Durbin at the Sanger Centre, where he received his Ph.D. in 2000. Since then, Birney has been a senior scientist at the EBI, where he heads the Nucleotide Data division.

Birney spoke to Science Watch from his office at the EBI, just outside Cambridge.

You started young as a PI at the European Bioinformatics Institute. What was your first major project?

I co-founded, with Tim Hubbard and Michelle Clamp, a project called Ensembl, which was and still is one of the major resources for using genomic information on the web. It presents the human genome to researchers—it's a major access portal. I’m still one of the principal investigators for that project and it still dominates my life quite a lot.

How did ENCODE get started, and what was the original plan?

"'Dark matter of the genome' is a better term than 'junk DNA,'" says Ewan Birney of the European Bioinformatics Institute. "It implies that we don’t know what this stuff does."

ENCODE was one of a series of projects that followed on the human genome project. At the end of 2003, people knew the human genome was going to be finished; they knew the path forward for the mouse and rat genomes. There was a kind of logical follow-up for other important organisms. The question was, what else do we need to do to really enable genomics? One project, for instance, which is kind of orthogonal to ENCODE, is the HapMap project to discover how variations in the genome occur in different humans. That’s one thing we need to know. There are lots and lots of subtle differences. What are they? Let’s just build a catalog.

The other side was understanding the genome better. Although we have some reasonable appreciation of protein-coding genes, everything else was a here-be-dragons kind of thing.

Here be dragons?

You know those old maps, where the known world ends and they just write "here be dragons." We have these huge expanses of genome and we just don’t know what’s going on. This is the noncoding DNA, in the parlance. And we want to know what this stuff is doing.

Is this "junk DNA" we’re talking about?

Well, the phrase "junk DNA" has morphed over time. Back in the 1970s, with the discovery of introns, junk DNA implied these large chunks of DNA that got transcribed into RNA, cut out, and then seemingly thrown away. As people in the 1980s began to rather painstakingly put together big chunks of genomic DNA, they started to see the layout of genes in the genome, and one of the first things they realized was that the protein-coding stuff doesn’t make for much of it. Even in the densest part of the genome, it makes up maybe 10%. On average it makes 2%. Then you also find these disperse repeats, these parasitic elements that are found in every large genome, whether plants, fish, or humans. They have their own set of specific genome parasites, copying themselves happily across genomes. About half of our genome comes from these repeats. The phrase "junk DNA" started to be used interchangeably to mean these disperse repeats selfishly copying themselves, as well as all these other parts of the genome that we didn’t understand. The phrase started to have a life of its own. It’s not a very scientific term, although many scientists use it.

What phrase do you prefer?

"Dark matter of the genome" is a better term. It implies that we don’t know what this stuff does.

So back to ENCODE. How was the collaboration put together, and who decided what techniques to use to analyze the genome?

The project was saying, in effect, let’s just throw the kitchen sink of experimental techniques at this problem of understanding what the non-coding DNA does. Dream up any kind of useful experiment, propose it to NHGRI; peer review then says yes or no, and off we go to basically chart the here-be-dragons part of the genome, to discover what’s going on in the dark matter. When the project started in 2004, we really didn’t know which experiments were going to work and which would be too expensive to do in a whole genome, so the pilot project focused on 1% of the genome, divided up into 44 distinct regions. That sounds like a small percentage, but it’s a lot of DNA. It’s like if you want to study the Atlantic Ocean and you study just 1%, you’ll probably learn a lot about it. Beginning in 2005, a dozen experimental groups were funded to look at this 1%. And one of the key rules was that everybody had to use the same 1%. That was critical. If we didn’t do that—if everybody had chosen their own bit—it would have been a disaster.

How did you end up first author?

Highly Cited Papers by Ewan Birney and Colleagues, Published Since 2000 (Ranked by total citations)
Rank	Papers	Cites
1	E.S. Lander, et al., "Initial sequencing and analysis of the human genome," Nature, 409(6822): 860-921, 2001.	6,409
2	R.H. Waterston, et al., "Initial sequencing and comparative analysis of the mouse genome," Nature, 420(6915): 520-62, 2002.	2,158
3	A. Bateman, et al., "The Pfam protein familes database," Nucl. Acids Res., 30(1): 276-80, 2002.	1,289
4	A. Bateman, et al., "The Pfam protein families database," Nucl. Acids Res., 28(1): 263-6, 2000.	836
5	G.M. Rubin, et al., "Comparative genomics of the eukaryotes," Science, 287(5461): 2204-15, 2000.	811
SOURCE: Clarivate Web of Science^®

I am not first author—the first author is the "ENCODE Project Consortium." I am first in the list of equals after that. About halfway through the project—and this is typical in biology these days—the real headaches shift from the experiments to the bioinformatics. Once you collect the data, you have to aggregate it and store it sensibly. That’s mostly boring plumbing and engineering; it’s very tedious, but you have to get it right. Then the far more interesting aspect of understanding the data—which again is mainly bioinformatics—starts. I was originally funded to do a very small bit in ENCODE, but when everybody else assumed that the integration of the data would just magically happen, some of us from different laboratories essentially put our hands up and said, yes, we will dedicate our own effort to making this work. I was one of them. It turned into a pretty painful two years of work, and by the end, I was the first amongst equals in the author list, though many, many people contributed to the paper. The thing to stress is that the paper did have 308 authors.

So what did ENCODE tell us about the nature of the genome?

ENCODE didn’t have the neatest take-home message, although we did have bullet points in the Nature paper explaining what we had learned. When I discussed all this with a U.K. science journalist, he ended up using the phrase "boffins are baffled," which about captures it.

The major thing we learned is that the genome is a very complex place, including all these here-be-dragons areas, this dark matter. It’s complex and multilayered. We certainly removed the idea that the genome is a simple thing, with very discrete units of genes and discrete regulatory information not in the genes, all very neatly packaged. When you look at the ENCODE data, it doesn’t hold up. Introns of many genes, for instance, are just alive with regulatory information. One of the more challenging things we discovered is that we see lots more RNA being transcribed than we ever expected, and we really don’t have a proper way of understanding and classifying this RNA. So RNA is being transcribed not just in the traditional way, into messenger RNA that makes proteins, but it's also happening in bits between genes and, weirder still, it’s crossing between gene boundaries, going from one region that we’re pretty sure is not involved with protein coding, and then going into a gene or out of a gene.

Can you describe that in a little more detail?

Some RNA transcripts start outside a gene and then go into the intron of another gene, but they don’t actually seem to make an exon. Rather they overlap with the exon of that gene. Some transcripts go the other way: they start inside the intron of one gene, go through that gene a couple of introns, and then stop. Even more amazing, some transcripts start in the intron of one gene and go for one megabase, crossing four or five other genes, and then somehow, in a way we don’t understand, start making exons in another gene. They join two genes which have four genes in between them at a very long genomic distance.

Any chance this is being interpreted incorrectly?

When I started, I was one of the real arch skeptics of this RNA data. I now firmly believe that these RNAs exist. But what they do, if anything, is something I can get into very long debates about. So this will lead into what I think is the most interesting discovery from ENCODE.

Which is?

Well, the genome is alive with what are called DNaseI hypersensitive sites, which are sites on the genome that bind transcription factors. This was not a surprise. It was completely expected that these sites are all over the genome, even in lots of places that didn’t have a gene. The surprise is that only about half the sequences under these DNaseI hypersensitive regions are conserved across mammals. The other half are not. The weird RNA described above is even less conserved than this. That’s pretty odd, because, for protein-coding genes, maybe 95% are conserved. So this is really weird. We just didn’t expect this. What on earth is going on?

So what on earth is going on?

What I believe, and what we put this in the paper as the last of the bullet points, is that, in fact, many of these elements arise by chance, and once having done so, they’re neither good nor bad for the organism. Then they just hang around until, by chance again, they disappear. So this is a very interesting idea, that there is this big pool of neutral stuff that’s biochemically active but not actually important now for the organism.

If you looked again in 10 million years, the 50% that’s not conserved would look entirely different?

That would be the logic. Interestingly enough, very similar results have been seen in Drosophila. One interpretation of this is that maybe biology has tuned itself to create a pool of elements that could be used by evolution in the future. The trouble with that hypothesis, and the reason why theoreticians of evolutionary biology will poo-poo it, is that evolution isn’t supposed to have foresight. It can’t say, for example, that we'd better get evolving these opposable thumbs because we might need them in the future to hold a telephone. Evolution works on the instantaneous here and now. On the other hand, the counter argument to this is that if you take the hard line, you can’t justify sex. You can’t justify why an individual would throw away half its genes to combine with some other individual’s half. Surely that’s a bad idea from the gene's perspective, to roll the dice every time whether it will make it to the next generation. Surely it’s not a good thing for the gene. We hint at this idea in the paper, again without getting stuck in it. But for me, this is the most interesting thing that has come out of ENCODE.

When people wrote about ENCODE, most of them picked up on the idea that we are redefining the word "gene," what with the RNA running in and out of genes. That was the easiest thing to understand and put across. This evolutionary stuff is genuinely harder to get your head around, but, from my point of view, it’s the most important thing.

What can we expect from ENCODE in the future?

It’s being scaled up to cover 100% of the genome. The grants were awarded last year. One of the great things about genomics is that it will cost about the same to do this, perhaps even a little less, than it did to do 1% a few years ago. And we’ll do it twice as fast. The reason is we now have all these ultra-high-throughput sequence machines. That’s just completely changed the game and made these experiments really cost effective.

Related information: Ewan Birney talks with ScienceWatch.com and answers a few questions about his Fast Moving Front in the field of Computer Science in July 2006.

Keywords: Ewan Birney, EBI, European Bioinformatics Institute, ENCODE, junk DNA, introns, RNA transcription, DNaseI, Pfam, Genomics.

2009 : January 2009 - Author Commentaries : EBI's Ewan Birney: Quest for the Genomic Dragons

Previous
left arrow key Next
right arrow key Close Move

AUTHOR COMMENTARIES - 2009