Catherine Lozupone & Rob Knight on Developing UniFrac
Fast Moving Fronts Commentary, March 2011
Article: UniFrac: a new phylogenetic method for comparing microbial communities
Authors: Lozupone, C;Knight, R |
Catherine Lozupone & Rob Knight talk with ScienceWatch.com and answer a few questions about this month's Fast Moving Fronts paper in the field of Microbiology.
Would you summarize the significance of your paper in layman's terms?
This paper allowed us to combine Darwin's insight, that a universal phylogenetic tree is the best way to organize biological diversity, with the concept of a metric for measuring differences between the microbes in environmental samples. We call this metric UniFrac. To illustrate this principle, if we think of different bacterial assemblages as different baskets of fruit, UniFrac would consider a basket that contained Granny Smith and Braeburn apples to be more similar to a basket that contained Fuji and Golden Delicious apples, than to one that contained oranges and grapefruit (whereas other measures would consider all three baskets to be equally different because no fruit is exactly the same in different baskets).
Like many key ideas in science, this is not a very complex idea, and UniFrac is certainly not a complex diversity measure: it simply measures the distance between two bacterial assemblages as the fraction of branch length in a phylogenetic tree that is unique (leading to sequences from only one assemblage) versus shared (leading to sequences from both assemblages). Before we introduced UniFrac, however, microbial ecologists usually compared the microbes in different environments in ways that did not take into account that closely related organisms are more likely to have similar biological properties, thus providing a less clear picture of how environmental samples relate to each other.
Does it describe a new discovery, methodology, or synthesis of knowledge?
This paper primarily describes a new methodology for bioinformatics analysis of sequence data, but also is a synthesis of knowledge, since we illustrated UniFrac's effectiveness by analyzing published sequence libraries from the bacterial communities in marine sediment, water, and ice. We were able to demonstrate that by comparing sequences from environmental samples across diverse environments that were generated from different research groups, we could provide insights that went beyond the analysis of each of these environments in isolation.
"This paper is highly cited primarily because UniFrac has become a standard method that researchers in the microbial ecology field use to analyze their sequence data."
We thus relied on the public databases of knowledge from many different investigators to make new discoveries. Part of what made this possible is that all organisms have DNA, and the DNA sequences can be related to each other in ways that other information about the organisms cannot. Unfortunately, these databases, such as NCBI's SRA (Sequence Read Archive), have not been able to keep up with the flood of sequence data now being produced, making it difficult to fully realize the potential for new insights that these sequences bring.
Another key problem, both then and now, is storing the sample information so that it can be accessed easily. In an expanded analysis that we performed in 2007 (Lozupone and Knight, PNAS, 104:11436-40, 2007), that included 111 papers, we had to read each paper to decipher which sequences were associated with each type of sample surveyed. This was a great deal of work, but a properly organized database could do it in seconds!
Why do you think your paper is highly cited?
This paper is highly cited primarily because UniFrac has become a standard method that researchers in the microbial ecology field use to analyze their sequence data. Microbial ecologists typically evaluate microbial communities by sequencing their genes directly from environmental samples. To perform comparisons of diversity with sequences, one can bin the sequences based on sequence identity and compare how many "bins" are shared, or with UniFrac, use more of the information in these sequences by utilizing a phylogenetic tree that represents their degree of similarity. UniFrac thus allows researchers to perform more powerful analyses using information that they already have.
UniFrac is also widely applied because we made it easy for people to use. Shortly after this initial introduction of UniFrac, we worked together with Micah Hamady, another Ph.D. student in Rob Knight's lab at the time, to create a user-friendly web interface.
We have also made major efforts to keep UniFrac in scale with next-generation sequencing technologies. Once 454 pyrosequencing became available, we updated our UniFrac code so that it could efficiently process larger trees, and together with Micah, introduced a new "Fast UniFrac" web interface that also had enhancements in visualization .
We also understood that this large number of sequences produced difficulties with the upstream processing steps needed to make a phylogenetic tree for analysis with UniFrac, and the Knight lab thus produced a tool called QIIME to meet this need, that also implements a great deal of additional functionality for analyzing microbial community sequences. Most recently, we have moved QIIME (and therefore UniFrac) to TeraGrid and to Amazon's EC2 cloud, thus enabling people without their own supercomputers to perform large-scale analyses.
This has made UniFrac analysis accessible to essentially everyone, from big groups who want to run very large datasets through their own compute resources using QIIME, to a researcher with little computational expertise or resources, who can run an analysis using Knight lab or other computer resources, free-of-charge, using a convenient web interface.
UniFrac is also highly cited because of the broadness of its user base. Although UniFrac is most often used to analyze 16S ribosomal RNA gene libraries from microbial communities, it can be used to analyze any type of organism using phylogenetic trees made from any type of gene. It has thus also been used outside of the community of microbial ecologists who we originally targeted for the application. As an example, we were delighted to see that Dr. Walter Fitch used UniFrac to cluster populations of influenza A H5N1, the bird flu virus, based on their gene sequences. UniFrac helped to reveal biogeographical patterns that were used to understand which factors affect the virus' spread (Wallace and Fitch, PLoS One, 3: e1697, 2008).
How did you become involved in this research, and how would you describe the particular challenges, setbacks, and successes that you've encountered along the way?
This work was part of my (Catherine Lozupone's) Ph.D. work, which was done under the guidance of Rob Knight. At the time, all of my prior research involved laboratory experiments, including a Master's degree project in the field of microbial ecology. This project coupled my biological research interests with my desire to gain the computer programming and biostatistical expertise needed to enter the field of bioinformatics.
Rob conceptualized UniFrac, partly on the basis of conversations with members of Norm Pace's lab, including Scott Kelley, Kirk Harris, and Jeff Walker, and suggested that I implement it as a rotation project before I joined a specific lab. When I joined Rob's lab, the main project that I was supposed to work on was to examine Type III secretion in Salmonella.
"This paper allowed us to combine Darwin's insight, that a universal phylogenetic tree is the best way to organize biological diversity, with the concept of a metric for measuring differences between the microbes in environmental samples."
However, I got less and less excited about the lab work for that project, and more and more excited about the idea of using UniFrac to perform a global analysis of bacterial community sequences that had been deposited in Genbank for my Ph.D. thesis (Lozupone and Knight, PNAS, 104:11436-40, 2007). This just goes to show that as a graduate student, you should follow your passion rather than what you're "supposed" to be working on!
Developing UniFrac was essentially how I first learned computer programming, and so was associated with all of the challenges that come with learning a completely new skill. Rob provided amazing guidance along every step of the way, however, which in retrospect made acquiring this skill surprisingly easy. Collaborating with a computer scientist Ph.D. student with expertise in user interface design and computer cluster administration helped move this work beyond something that was "a good idea" to something that thousands of people actually use.
Note from Rob: I told Cathy at the time that she could switch projects but that I thought she was making a big mistake and would be sorry when the Type III secretion project got on the cover of Science. Of course, we never published anything or ever got a grant on Type III secretion, whereas UniFrac has really taken off. Cathy's prescient decision about what to work on really changed my lab's focus—it's important to listen to your graduate students when they have great ideas!
Where do you see your research leading in the future?
I am currently a joint post doctoral researcher in the labs of Dr. Rob Knight and Dr. Jeffrey Gordon at Washington University in St. Louis. My research focus is on the complex community of bacteria that inhabit the human gastrointestinal tract. As a biologist, this field excites me because gut bacteria have diverse and important impacts on our health, and so providing a better understanding of their function has enormous implications. As a bioinformatician, this field is exciting as it is on the forefront in research leveraging massive amounts of sequence data: new informatics techniques pioneered here can also be applied in other systems.
I am currently using community survey information to determine which commensal gut bacteria have distribution patterns that suggest they are opportunistic pathogens. Analysis of the genome sequences of these suspected pathogens has revealed that they have virulence factors known to be important for other pathogens that their phylogenetic relatives do not.
Do you foresee any social or political implications for your research?
As UniFrac is widely used by a broad group of scientists, the overall implications of its application are difficult to predict. Through its improved ability to relate microbial communities to one another, it could facilitate discoveries that impact human health, especially given that in addition to causing disease themselves, human-associated microbes affect our sensitivity to pharmaceuticals (Clayton, et al., PNAS, 106:14728-33, 2009), can protect us from infection, and can be used to cure diseases that antibiotics can't, such as infection with Clostridium difficile (Garborg, et al., Scand. J. Infect. Dis. 42:11-12, 2010).
The application of UniFrac also has the potential to impact conservation biology, such as through the GEOBON project, which uses phylogenetic information to prioritize which nature reserves to protect).
Catherine A. Lozupone
Department of Chemistry and Biochemistry
University of Colorado at Boulder
Boulder, CO, USA
Rob Knight
Department of Chemistry and Biochemistry
University of Colorado at Boulder
Boulder, CO, USA
KEYWORDS: UniFrac, PHYLOGENETICS, MICROBIAL COMMUNITIES, RIBOSOMAL RNA SEQUENCES, SEA-ICE BACTERIA, ARCTIC OCEAN, DIVERSITY, BACTERIOPLANKTON, SEDIMENTS, WATER, BIODIVERSITY, BIOGEOGRAPHY, POPULATIONS.