Mass General's Mark J. Daly on HapMap
and Disease Genes
Featured Science Watch® Newsletter
Computational biologist Mark J. Daly of
Massachusetts General Hospital and the Broad Institute,
Boston, discusses his work, particularly on a catalogue
of genetic variation known as HapMap, a resource which
has assisted in elucidating the genetic underpinnings
of type 2 diabetes and other diseases. The concept of
haplotypes and HapMap has made Daly one of the most
highly cited researchers in biology. He currently ranks
among the top dozen most-cited authors in molecular
biology & genetics inThomson
based on papers published in the last decade.
In 2002, when geneticists began pushing for the funding to
create a new type of genomic map—known as the haplotype map, or
HapMap—the project was controversial, to say the least. Its opponents
described it as a $110 million boondoggle in the making. The project,
however, went ahead. Just three years later, in October 2005, HapMap was
published in Nature, while the data were made freely available
online. Now, less than three years after that, the payoff has been
remarkable. HapMap may have set a new record for the time necessary to
cover the ground from scientific controversy to unambiguous success.
The original proponents of HapMap included some of the biggest names in
genetics and molecular biology, among them Francis Collins of the Human
Genome Project and
Eric Lander of the Whitehead Institute Center for
Genome Research (now part of the Broad Institute). But the initial spark
of inspiration emerged from the experimental insight of one of
Lander’s students, Mark J. Daly. Since then, the concept of
haplotypes and HapMap has made Daly one of the most highly cited
researchers in biology. He currently ranks among the top dozen
most-cited authors in molecular biology & genetics in Thomson
Science IndicatorsSM database, based on papers
published in the last decade.
Furthermore, in this publication's recent survey of high-impact biology
between 2002 and 2006, Daly ranked high among the featured authors, thanks
to 10 top-cited papers (Science Watch, 19: 1-2,
January/February 2008). And the latest update to the
Hot Papers Database includes eight reports from Daly and
colleagues published over the last two years. Daly's past Hot Papers
include the original 2005 HapMap report from Nature (see the
table below, paper #5), which debuted at #1 in the Biology Top Ten in
September/October 2006 issue of Science
Watch and stayed there until March/April 2008, when it reached the
Hot Papers mandatory two-year retirement age.
Daly, 40, received his bachelor of science degree from MIT in 1989 and his
Ph.D. in genetics from Leiden University in 2004. Between 2001 and 2005 he
was also a Pfizer fellow in computational biology at the Whitehead
Institute, where he originally trained with Lander. He is currently an
assistant professor of medicine at Harvard Medical School, an assistant
geneticist at Massachusetts General Hospital, and a senior associate member
of the Broad Institute, where he leads computational biology in medical and
Daly spoke to Science Watch from his office at
Mass General in Boston.
Okay, first question before we proceed: I do have
to ask what exactly a haplotype is.
A haplotype is simply the collection of alleles at variable positions
nearby each other on a chromosome. Imagine if you have variation at one
position on a chromosome where the base could be either A or C, and then,
100 bases later, another variable base at which chromosomes can bear either
G or T. That combination of two alleles carried on a single chromosome (in
this case either A-G, A-T, C-G, or C-T) is what we call a
haplotype—essentially, the genotype of an individual chromosome as
expressed at multiple sites along the chromosome.
How did you come to realize that such a thing
existed? What was your insight, in other words?
I started out as a student in Eric Lander's laboratory in the late 1980s,
before there was a genome project and at a time when we were still holding
out hope that methods of family-based linkage analysis—the tool that
works on these rare Mendelian, single-gene disorders—would work for
more complex diseases as well. That’s where my initial effort in
computational biology was in those early years.
The idea then emerged experimentally. We were studying a particular region
of chromosome 5 that had been implicated in Crohn’s disease. We came
to focus in that region on a set of genes that we had high confidence were
involved in the disease. When we looked at the sequence, though, we
didn’t find any obvious Mendelian-like, smoking-gun mutations. We
began to look more broadly at the sequences from a number of patients, and
we discovered this unexpected correlation between alleles in that region.
In other words, if we found a single nucleotide polymorphism (SNP) anywhere
in that region, we would be able to predict, with different degrees of
certainty, what other SNPs would be found throughout that region. In some
cases this was true for SNPs hundreds of kilobases away. What this
reflected was that, despite many polymorphic sites in the region, there
were actually a surprisingly small number of haplotypes shared by everyone
in the population.
This was very much unexpected, and it suggested to us that we could
possibly accomplish much of what we needed to find disease genes by
choosing only a small number of polymorphisms in a particular region and
using those as surrogates or proxies for the many other remaining
polymorphisms in that region. By studying that particular region on
chromosome 5 in more detail, and then other regions across the genome, we
and other groups were able to come upon this model demonstrating
considerable structure in the relationship between polymorphisms discovered
in the genome. That, in turn, led us very quickly to the formation of the
HapMap project, cataloguing genetic variations in a fashion that could then
be used by medical geneticists to study any region of the genome
efficiently and thoroughly.
The project was very controversial at the outset.
Why were some geneticists so adamant that it wouldn’t
There wasn't an enormous amount of empirical data at the beginning. When we
first started studying this region on chromosome 5, the results were quite
unexpected. There really wasn’t a great deal of data from human
populations with respect to what a high density of genetic variation looks
like across tens or hundreds of kilobases of the genome. To propose then
that there were these correlations or structures to the data was to suggest
that aspects of some of the more traditional work that had been done to
model human genetic variation might not have been completely accurate. So
one controversy was simply over the fact that we lacked conclusive data on
many of these questions. What we began to observe and describe in a few
anecdotal regions originally was not completely consistent with the genetic
model of the day. All of that eventually harmonized. There’s little
controversy on any of these grounds now because we have much greater
appreciation of the origins of human genetic-variation patterns.
In particular, a primary point of debate early on was whether recombination
hotspots were really present throughout the genome and whether they were
required to explain the data we were observing. This was all resolved by
much more detailed studies and analysis of the HapMap data and finding
that, indeed, much of human recombination occurs in discrete hotspots.
What do you mean by a recombination
You can imagine that recombination between two chromosomes could happen
anywhere along the chromosome, creating new combinations of alleles, or new
haplotypes, that span the crossover point. Since humans have been around
for tens of thousands of generations, if recombination did happen anywhere
in the chromosome, it would be constantly creating new assortments of
alleles, and we wouldn’t see the significant correlations of SNPs
that we observed. So what turns out to be the case is that from generation
to generation recombination tends to happen preferentially at particular
points on the chromosome: hotspots. That leaves long segments of the
genome—tens to hundreds of kilobases, in some cases—where
recombination essentially never happens. In those segments, where
recombination never occurs, there’s a great deal of redundancy in
genetic information provided by each individual polymorphism.
Is there a reliable theory for what determines a
Cited Papers by Mark J. Daly and
Colleagues, Published Since
(Ranked by total citations)
Genome Sequencing Consortium (R.H.
Waterston, et al.),
"Initial sequencing and
comparative analysis of the mouse
420(6915): 520-62, 2002.
Gabriel, et al., "The
structure of haplotype blocks in the
human genome,"Science, 296(5576): 2225-9,
SNP Map Working Group (R.
Sachidanandam, et al.),
"A map of human genome sequence
variation containing 1.42 million
polymorphisms,"Nature, 409(6822): 928-33,
B. Fry, J. Maller,
"Haploview: Analysis and
visualization of LD and haplotype
Altshuler, et al.,
"A haplotype map of the
human genome,"Nature, 437(7063):
There are a number of theories, but nothing’s been demonstrated to
explain the whole picture yet. Certainly nothing as simple as a universal
signature sequence pattern. There are correlations with specific repetitive
elements of DNA, but nothing that explains very conclusively why a certain
region is a hotspot. It may have to do more with the structure of how DNA
is packaged in the chromosome and where it’s open enough to receive
recombination machinery. There are likely to be epigenetic influences.
HapMap was published in 2005. How would you
characterize its success so far?
It’s been an invaluable tool in the development of the genome-wide
association genotyping arrays that are now in use in large-scale studies,
and it continues to play a critical role in the analysis of those data. In
essence, it’s one of the components that have spearheaded what can
only be described as a revolution in our ability to discover genes for
complex human diseases. Just two or three years ago, we had only a handful
of conclusive associations to complex diseases. In the studies just
published in the last two years, there are clearly in excess of a hundred
new, validated associations—and evidence that we've only scratched
the surface! That number will continue to go up by leaps and bounds as the
technology improves and as groups work more collaboratively to bring larger
and larger sample sizes to bear on the challenges in different disease
Prior to HapMap, the percentage of purported
associations that were replicated in later studies was extremely low.
Has that changed significantly with HapMap?
When I cite a number in excess of a hundred, these are all genes that meet
the most rigorous level of statistical significance and have been
conclusively replicated. I expect that number to go up considerably,
because I’m not even counting the number of promising studies that
are just coming to publication for the first time. That’s the biggest
shift in the field: findings are published and there’s a dramatic
improvement in the rigor of the analyses and replication in these studies.
It’s become quite clear now why the literature prior to the last
couple of years had such a spotty performance. The studies being done now
involve very, very large sample sizes, and the effects being discovered
that are reliably replicated are weak ones. Up until recently, even if the
researchers had guessed the genes correctly, the studies were typically not
done in sample sizes large enough to document the modest effect
associations. Consequently, most of what was published early on were
statistical fluctuations, which is what inevitably happens when
you’re not testing enough SNPs and when the studies are not
adequately powered to find the true effects you’re looking for.
Are there specific diseases for which HapMap has
been particularly useful?
The diseases for which this method has been most successful are those in
which there have been multiple scans, usually three or more studies
undertaken simultaneously, and then the results or the data pooled to
perform more comprehensive assessments—type 2 diabetes, for instance.
There are some new publications on genes related to lipid levels—HDL
and LDL cholesterol and triglycerides—and some on adult stature that
will be coming out soon. Crohn’s disease has seen some particularly
noteworthy results. But these conditions have had a lot of effort placed in
them. Coronary artery disease and
breast cancer also have some very noteworthy
results. It remains to be seen how extensible that is to other diseases.
We’re very keen to see how tractable or intractable psychological
diseases are—bipolar disorder and
schizophrenia, for example. As yet, the experiments
are not quite so advanced there as they are in autoimmune disease or in
cardiovascular disease. For many diseases, they’re just now
getting underway, or maybe one study has been completed. In most cases
we’ve found the greatest success after two or three groups have
agreed to work together and have merged scan data and pooled resources
for definitive follow-up studies. We have high hopes as the technology
continues to improve; genotyping technology is dramatically improved
from what we had to work with when the first studies began, say a year
and a half ago.
Could you play devil’s advocate for a second
and tell us where these analyses are most likely to go wrong, to lead
to erroneous results, if they do so?
These studies can go awry in a number of ways. Most of them involve
inadequate attention to study-design quality and data quality. It’s
not so much the sophistication of the analysis, but that the really
successful studies have been those that paid very close attention to the
quality and accuracy of the lab work, of the randomization procedures for
cases and controls in the lab, etc. They’ve paid keen attention to
quality control of the data coming out of the lab. One problem is that many
studies are done with very new genotyping technology, because everyone is
keen do their studies with the latest chips. So, almost by definition, many
studies will be done with genotyping products and algorithms that have only
existed for a couple of months. This can be a problem if careful attention
is not paid to every element, from DNA through the lab process through
computational analysis. You have to pay real attention to DNA quality and
laboratory procedures, and scrutinize the data quality in every possible
way. That’s the hallmark of the successful studies.
Should we expect these studies to someday clarify
the environmental influences of diseases like type 2 diabetes and
What people have to realize is that these gene-discovery projects simply
give pointers to genes and regions that may be involved. We then have to
embark on much more detailed studies of those genes and regions to identify
the precise causal variants and what they’re doing. When we get to
that point, we can begin to ask questions: does this potentially
synchronize with environmental covariates? Does it open up targets for
therapeutics and so forth? But that’s a long way off. There’s a
lot of follow-up work to do—although that’s a good problem to
have. We’re very enthusiastic about the results to date and the
potential for increasing our understanding of these diseases. But
there’s a lot of work to do. We don’t want to get too far ahead
Keywords: Mark J. Daly, haplotypes, HapMap, Eric
Lander, recombination, disease genes.