ENCODE Project: Junking the Idea of Junk DNA - May/June 2008

WHAT'S HOT IN... BIOLOGY , May/June 2008

ENCODE Project: Junking the Idea of Junk DNA

by Jeremy Cherfas

Biology Top Ten Papers
Rank	Papers	Cites Nov-Dec 07	Rank Sep-Oct 07
1	E. Bettelli, et al., "Reciprocal developmental pathways for the generation of pathogenic effector T_H17 and regulatory T cells," Nature, 441(7090): 235-8, 11 May 2006. [Harvard Med. Sch., Boston, MA] *040YP	52	3
2	R. Redon, et al., "Global variation in copy number in the human genome," Nature, 444(7118): 444-54, 23 November 2006. [12 institutions worldwide] *108BQ	51	2
3	M. Veldhoen, et al., "TGFß in the context of an inflammatory cytokine milieu supports de novo differentiation of IL-17-producing T cells," Immunity, 24(2): 179-89, February 2006. [MRC Natl. Inst. Med. Res., London, U.K.; Howard Hughes Med. Inst., U. Calif., San Francisco] *014KN	39	5
4	The ENCODE Project Consortium (E. Birney, et al.), "Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project," Nature, 447(7146): 799-816, 14 June 2007. [80 institutions worldwide] *178FV	37	†
5	P.R. Mangan, et al., "Transforming growth factor-ß induces development of the T_H17 lineage," Nature, 441(7090): 231-4, 11 May 2006. [U. Alabama, Birmingham; NIDCD, NIH, Bethesda, MD] *040YP	34	†
6	S.F.A. Grant, et al., *"Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes,"* Nature Genetics, 38(3): 320-3, March 2006. [deCODE genetics, Reykjavik, Iceland; Icelandic Heart Assoc., Kopavogur; Natl. U. Hosp., Reykjavik; U. Penn. Sch. Med., Philadelphia; Ctr. Clin. & Basic Res., Ballerup, Denmark] *015ZG	34	†
7	T.I. Lee, et al., "Control of developmental regulators by polycomb in human embryonic stem cells," Cell, 125(2): 301-3, 21 April 2006. [7 institutions worldwide] *038RT	33	†
8	D.L. Barber, et al., "Restoring function in exhausted CD8 T cells during chronic viral infection," Nature, 439(7077): 682-7, 9 February 2006. [5 U.S. institutions] *010KH	28	8
9	M. Neumann, et al., "Ubiquinated TDP-43 in frontotemporal lobar degeneration and amyotrophic lateral sclerosis," Science, 314(5796): 130-3, 6 October 2006. [11 institutions worldwide] *091LU	28	†
10	C.L. Day, et al., "PD-1 expression on HIV-specific T cells in associated with T-cell exhaustion and disease progression," Nature, 443(7109): 350-4, 21 September 2006. [7 institutions worldwide] *085RL	24	†
SOURCE: Clarivate's Hot Papers Database. Read the Legend.

Things used to be so simple, when DNA produced RNA which produced proteins. Then came a deeper understanding of regulatory genes and processing of RNA and other niceties. Then the first DNA sequences astounded onlookers by scattering sequence that actually coded for proteins like sparse raisins in a rather dull pudding of DNA that apparently did nothing. What was this do-nothing DNA? Junk.

That hubristic notion—that just because the brightest minds in molecular biology couldn't work out what much of the genome did, therefore it was junk—gets soundly knocked on the head by the paper at #4. A massive team called the ENCODE Project Consortium took a detailed look at a small part of the human genome and in so doing upset all kinds of preconceived ideas.

ENCODE, as the name embodies, set out to create an Encyclopedia Of DNA Elements. As proof of concept, the team decided to focus on about 30 megabases of sequence, roughly 1% of the total human genome. About half is in regions that are already pretty well known; the rest is a random sample of 30 regions spread across the genome. The ENCODE pilot project brought together multiple teams to use multiple methods to see what they could find out about the genome. The built-in redundancy of this approach ensures that discoveries are real.

One big surprise is that the human genome is, as the team puts it, "pervasively transcribed." It had been thought that only coding regions and regulatory sequences were turned into RNA. In fact, almost every base in the sequence is associated with at least one piece of RNA. Many of the newly identified pieces of RNA extend transcripts well beyond the stretch that codes for a given gene. Many of the new non-protein-coding sequences discovered are from regions of the genome that, until now, had been thought to be mute. And many of the new transcripts overlap existing regions that do code for proteins.

Another surprise is just how many transcription start sites there are. There are start sites just upstream of genes, obviously, but there are also many others throughout the genome, which could at least partly explain why so much of the DNA is in fact transcribed into RNA. These additional start sites share some characteristics with the promoter sites upstream of genes, but they differ in significant ways too, which indicates different functions that have yet to be elucidated.

There is much else besides, but perhaps the biggest surprise of all concerns conserved sequences. Stretches of DNA in which the sequence is near identical in all individuals, or among several related species, are taken as evidence of evolution selecting against any mutations that adversely affect performance. Conserved sequences thus indicate areas of the DNA with important functions, and by the same token, areas with important functions ought to be highly conserved. Indeed, about 5% of the bases are conserved, and are thus presumably constrained by natural selection. For about 60% of the conserved sequences, there is evidence too of a biological function. That still leaves a lot of conserved sequences that have no known function. Yet.

Some of those may turn out to be functional in some cell types and under some circumstances that just happen not to have been studied to date. More interesting, there are sequences that are obviously functional, but that are not actually conserved. For most non-coding functional elements, about half of the individual sequences seem to be unconstrained across all mammals studied to date. The project team offers five biological reasons that might account for this discovery. Among these is the idea that there are functional elements that perform some biochemical role but that are essentially neutral. This pool of neutral elements, the team suggests, could provide a kind of warehouse of variability on which natural selection might act. If this were the case, the function—what the sequence does—might be conserved while the exact sequence underlying that function might vary from species to species.

Where to now? In October 2007, just four months after the pilot project's paper, the National Human Genome Research Institute put more than $80 million towards letting ENCODE tackle the entire human genome. Additional grants will focus on new experimental and analytical techniques to enable ENCODE to reach its goals. If the larger effort is as successful as the pilot study, molecular biologists, faced with the increased complexity of an ever-deepening understanding of the genome, may find themselves pining for the days when 99% of the sequence could be dismissed as junk.

Dr. Jeremy Cherfas is Science Writer at Bioversity International, Rome, Italy.

Keywords: ENCODE Project, DNA, gene sequencing, junk DNA, human genome, transcription start sites, conserved sequences.

What's Hot In... : What's Hot in Biology - Menu : ENCODE Project: Junking the Idea of Junk DNA - May/June 2008

Previous
left arrow key Next
right arrow key Close Move