The 1000 Genomes Project Identifies Strategies to Count On

What's Hot in September/October 2011

by Jeremy Cherfas

First, there were the individual human genomes, the "complete" sequences published in 2001. These provided a reference sequence against which variants could be compared to try and match mutations to medical malfunction. Then there were the genome-wide association studies (GWAS), which looked across thousands of individuals in search of genetic associations that weren’t so clear cut (see the January/February 2011 issue). Now there is the 1000 Genomes Project, which adds additional power to GWAS and ushers in a new era of genome studies.

While GWAS do often identify informative DNA variants, they also suffer from what geneticists call missing heritability; the variants they do detect do not account for most of the malfunctions known to be associated with the diseases. A paper at #8 conclusively shows that, at least for human height, most of this heritability is not so much missing as hidden.

Taking all the variants associated with differences in stature into account, not just those that are strongly associated, vastly increases the amount of variability in height that can be accounted for by genetic differences. That study may affect the way GWAS are conducted in future, although it is likely to be eclipsed by the paper at #4, the first results from the 1000 Genomes Project Consortium.

The 1000 Genomes Project bills itself on its website as "a deep catalog of human genetic variation." As its name implies, the Project aims to assemble full sequence data for a large number of individuals. That is still a way off, but in this first pilot phase the Consortium reports on three studies that were intended to develop and compare different strategies for genome-wide sequencing.

COOKIES

From the 1000 Genome Project page, Wiki, and Wiki Commons figure.

The first strategy was a detailed whole-genome sequence of two family trios, mother, father and daughter, one Yoruba family from Ibadan in Nigeria and one European family living in Utah. All were sequenced using the best technology available, and the daughters’ genomes were sequenced in several centers, using different methods. Comparison with their parents’ sequences enabled the Consortium to estimate how often new mutations arise.

The vast majority, around 95%, of new mutations detected in the offspring but not their parents probably arose during the growth of the cells that provided the DNA for sequencing, both in the individual and in cell culture. The mutation rate in reproductive cells was estimated at 1.2 x 10-8 (for the European family) and 1.0 x 10-8 (for the Yoruba family) per base pair per generation, very similar to estimates from other studies.

The second strategy was a low-coverage sequence for 179 individuals from four populations. In all sequencing, small random pieces of DNA are read and the resulting short sequences fed to software that assembles them into a whole. Coverage refers to the number of times any one letter in the sequence is likely to be read. If coverage is high, the small short sequences will contain many overlaps and duplications, which will result in a more complete and more accurate final assembly. The Consortium estimates that a person’s complete sequence requires 28 reads, 28x coverage. Lower coverage, such as the 2-6x of this part of the project, means that sequences are less complete and less accurate, but much quicker and cheaper to obtain.

The low-coverage approach nevertheless uncovered more than 15 million single nucleotide polymorphisms (SNPs), roughly 8 million of which had not been detected before. When some of these relatively common variants were compared with data from previous GWAS of diseases, new variants that were strongly associated with the disease in question showed up. In some cases the new variants were newly discovered, while in others they had not been included in previous reference panels of SNPs; either way, the 1000 Genomes project will increase the power of GWAS.

The third strategy, like the first, constructed a detailed sequence, but focused on those parts of the DNA, the exons, that are actually translated into proteins. Genes were selected at random and then sequenced at high coverage in almost 700 individuals from seven populations. This identifies variants of all frequencies from common to rare, but only within the chosen genes.

A few European samples were sequenced in both the low-coverage phase and the exon phase, offering an opportunity to compare the two approaches. When the low-coverage data were enhanced using information from other sources, such as the most common patterns of variants in other individuals, discovery of SNPs was 15% higher with roughly comparable accuracy, validating the lower-cost approach. When full, detailed sequences from the mothers in the trio studies were downsampled to 4x coverage, variants were still identified in more than 95% of cases.

The 1000 Genomes Project Consortium used the results from this pilot phase to explore many aspects of the basic biology of the human genome, finding evidence of recent selection, differences among populations, and the effect of recombination on mutation rates. Even greater understanding will surely flow from phase two, which will continue 4x coverage and targeted sequencing of exons in 2,500 individuals from seven populations, using blood-derived DNA rather than cultured cells to minimize artefacts.

Dr. Jeremy Cherfas is Senior Science Writer at Bioversity International, Rome, Italy.

Select
What's Hot in Biology Papers

Click the tab above to view Hot Papers.

What's Hot in Biology
Rank	Paper	Cites This Period Mar-Apr 11	Rank Last Period Jan-Feb 11
1	J.J. Qin, et al., “A human gut microbial gene catalogue established by metagenomic sequencing,” Nature, 464(7285): 59-65, 4 March 2010. [14 institutions worldwide] *563GZ	59	†
2	Y. Tanaka, et al., *"Genome-wide association of IL28B* with response to pegylated interferon-a and ribavirin therapy for chronic hepatitis C*," Nature Genetics, 41(10): 1105-9, October 2009. [17 Japanese institutions] 500UG	57	6
3	C. Choudhary, et al., "Lysine acetylation targets protein complexes and co-regulates major cellular functions," Science, 325(5942): 834-40, 14 August 2009. [Max Planck Inst. Biochem., Martinsried, Germany; U. Copenhagen, Denmark] *487AK	51	3
4	The 1000 Genomes Project Consortium (D.L. Altshuler, et al.), "A map of human genome variation from population-scale sequencing," Nature, 467(7319): 1061-73, 28 October 2010. [78 institutions worldwide] *671XW	48	†
5	D.E. Harrison, et al., "Rapamycin fed late in life extends lifespan in genetically heterogeneous mice," Nature, 460(7253): 392-5, 16 July 2009. [7 U.S. institutions] *470MO	44	†
6	L.A. Hindorff, et al., "Potential etiologic and functional implications of genome-wide association loci for human diseases and traits," PNAS, 106(23): 9362-7, 9 June 2009. [NIH, Bethesda, MD] *456CN	44	5
7	H.L Guo, et al., "Mammalian microRNAs predominantly act to decrease target mRNA levels," Nature, 466(7308): 835-40, 12 August 2010. [Whitehead Inst., Cambridge, MA; Howard Hughes Med. Inst.; MIT, Cambridge; U. Calif., San Francisco; Calif. Inst. Quantitative Biosci., San Francisco] *636TT	40	†
8	J. Yang, et al., "Common SNPs explain a large proportion of the heritability for human height," Nature Genetics, 42(7): 565-9, July 2010. [Queensland Inst. Med. Res., Brisbane, Australia; U. Melbourne, Australia; Washington U., St. Louis, MO] *616WO	39	†
9	J. Dupuis, et al., "New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk," Nature Genetics, 42(2): 105-16, February 2010. [176 institutions worldwide] *549WB	38	9
10	S. Geisler, et al., "PINK1/Parkin-mediated mitophagy is dependent on VDAC1 and p62/SQSTM1," Nature Cell Biol., 12(2): 119-31, February 2010. [U. Tubingen, Germany] *550PD	38	†
SOURCE: Thomson Reuters Hot Papers Database. Only papers indexed by Clarivate since May 2009 are tracked. A dagger indicates that the paper was not ranked in the Top Ten during the last period. In the event that two or more papers collected the same number of citations in the most recent bimonthly period, total citations to date determine the rankings.

EMAIL | BACK TO TOP

The 1000 Genomes Project Identifies Strategies to Count On

What's Hot in September/October 2011

What's Hot In...

More Analyses