E. Bettelli, et al.,
"Reciprocal developmental pathways for the
generation of pathogenic effector TH17 and
regulatory T cells,"Nature,
441(7090): 235-8, 11 May 2006. [Harvard Med. Sch.,
Boston, MA] *040YP
52
3
2
R. Redon, et al., "Global
variation in copy number in the human genome,"Nature, 444(7118): 444-54, 23 November 2006.
[12 institutions worldwide] *108BQ
51
2
3
M. Veldhoen, et al.,
"TGFß in the context of an inflammatory
cytokine milieu supports de novo differentiation of
IL-17-producing T cells,"Immunity,
24(2): 179-89, February 2006. [MRC Natl. Inst. Med.
Res., London, U.K.; Howard Hughes Med. Inst., U.
Calif., San Francisco] *014KN
39
5
4
The ENCODE Project Consortium (E. Birney,
et al.), "Identification and analysis
of functional elements in 1% of the human genome by the
ENCODE pilot project,"Nature,
447(7146): 799-816, 14 June 2007. [80 institutions
worldwide] *178FV
37
†
5
P.R. Mangan, et al.,
"Transforming growth factor-ß induces
development of the TH17
lineage,"Nature, 441(7090): 231-4, 11 May
2006. [U. Alabama, Birmingham; NIDCD, NIH, Bethesda,
MD] *040YP
34
†
6
S.F.A. Grant, et al.,
"Variant of transcription factor 7-like 2
(TCF7L2) gene confers risk of type 2
diabetes,"Nature Genetics,
38(3): 320-3, March 2006. [deCODE genetics, Reykjavik,
Iceland; Icelandic Heart Assoc., Kopavogur; Natl. U. Hosp.,
Reykjavik; U. Penn. Sch. Med., Philadelphia; Ctr. Clin.
& Basic Res., Ballerup, Denmark] *015ZG
34
†
7
T.I. Lee, et al., "Control of
developmental regulators by polycomb in human
embryonic stem
cells,"Cell, 125(2):
301-3, 21 April 2006. [7 institutions worldwide]
*038RT
33
†
8
D.L. Barber, et al.,
"Restoring function in exhausted CD8 T cells
during chronic viral infection,"Nature, 439(7077): 682-7, 9 February 2006. [5
U.S. institutions] *010KH
28
8
9
M. Neumann, et al.,
"Ubiquinated TDP-43 in frontotemporal lobar
degeneration and amyotrophic lateral
sclerosis,"Science, 314(5796):
130-3, 6 October 2006. [11 institutions worldwide]
*091LU
28
†
10
C.L. Day, et al., "PD-1
expression on HIV-specific T cells in associated with
T-cell exhaustion and disease progression,"Nature, 443(7109): 350-4, 21 September 2006.
[7 institutions worldwide] *085RL
Things used to be so simple, when DNA produced RNA which produced proteins.
Then came a deeper understanding of regulatory genes and processing of RNA
and other niceties. Then the first DNA sequences astounded onlookers by
scattering sequence that actually coded for proteins like sparse raisins in
a rather dull pudding of DNA that apparently did nothing. What was this
do-nothing DNA? Junk.
That hubristic notion—that just because the brightest minds in
molecular biology couldn't work out what much of the genome did, therefore
it was junk—gets soundly knocked on the head by the paper at #4. A
massive team called the ENCODE Project Consortium took a detailed look at a
small part of the human genome and in so doing upset all kinds of
preconceived ideas.
ENCODE, as the name embodies, set out to create an Encyclopedia Of DNA
Elements. As proof of concept, the team decided to focus on about 30
megabases of sequence, roughly 1% of the total human genome. About half is
in regions that are already pretty well known; the rest is a random sample
of 30 regions spread across the genome. The ENCODE pilot project brought
together multiple teams to use multiple methods to see what they could find
out about the genome. The built-in redundancy of this approach ensures that
discoveries are real.
One big surprise is that the human genome is, as the team puts it,
"pervasively transcribed." It had been thought that only coding regions and
regulatory sequences were turned into RNA. In fact, almost every base in
the sequence is associated with at least one piece of RNA. Many of the
newly identified pieces of RNA extend transcripts well beyond the stretch
that codes for a given gene. Many of the new non-protein-coding sequences
discovered are from regions of the genome that, until now, had been thought
to be mute. And many of the new transcripts overlap existing regions that
do code for proteins.
Another surprise is just how many transcription start sites there are.
There are start sites just upstream of genes, obviously, but there are also
many others throughout the genome, which could at least partly explain why
so much of the DNA is in fact transcribed into RNA. These additional start
sites share some characteristics with the promoter sites upstream of genes,
but they differ in significant ways too, which indicates different
functions that have yet to be elucidated.
There is much else besides, but perhaps the biggest surprise of all
concerns conserved sequences. Stretches of DNA in which the sequence is
near identical in all individuals, or among several related species, are
taken as evidence of evolution selecting against any mutations that
adversely affect performance. Conserved sequences thus indicate areas of
the DNA with important functions, and by the same token, areas with
important functions ought to be highly conserved. Indeed, about 5% of the
bases are conserved, and are thus presumably constrained by natural
selection. For about 60% of the conserved sequences, there is evidence too
of a biological function. That still leaves a lot of conserved sequences
that have no known function. Yet.
Some of those may turn out to be functional in some cell types and under
some circumstances that just happen not to have been studied to date. More
interesting, there are sequences that are obviously functional, but that
are not actually conserved. For most non-coding functional elements, about
half of the individual sequences seem to be unconstrained across all
mammals studied to date. The project team offers five biological reasons
that might account for this discovery. Among these is the idea that there
are functional elements that perform some biochemical role but that are
essentially neutral. This pool of neutral elements, the team suggests,
could provide a kind of warehouse of variability on which natural selection
might act. If this were the case, the function—what the sequence
does—might be conserved while the exact sequence underlying that
function might vary from species to species.
Where to now? In October 2007, just four months after the pilot project's
paper, the National Human Genome Research Institute put more than $80
million towards letting ENCODE tackle the entire human genome. Additional
grants will focus on new experimental and analytical techniques to enable
ENCODE to reach its goals. If the larger effort is as successful as the
pilot study, molecular biologists, faced with the increased complexity of
an ever-deepening understanding of the genome, may find themselves pining
for the days when 99% of the sequence could be dismissed as
junk.
Dr. Jeremy Cherfas is Science Writer at Bioversity International,
Rome, Italy.