Archive ScienceWatch



Jill Mesirov & Eric Lander talk with and answer a few questions about this month's Fast Moving Front in the field of Computer Science. The authors have also sent along images of their work.
Mesirov Article: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles
Authors: Subramanian, A;Tamayo, P;Mootha, VK;Mukherjee, S;Ebert, BL;Gillette, MA;Paulovich, A;Pomeroy, SL;Golub, TR;Lander, ES;Mesirov, JP
Journal: PROC NAT ACAD SCI USA, 102 (43): 15545-15550 OCT 25 2005
Addresses: MIT, Broad Inst, 320 Charles St, Cambridge, MA 02141 USA.
MIT, Broad Inst, Cambridge, MA 02141 USA.
Harvard Univ, Cambridge, MA 02141 USA.
Harvard Univ, Sch Med, Dept Syst Biol, Boston, MA 02446 USA.
Duke Univ, Inst Genome Sci & Policy, Ctr Interdisciplinary Engn Med & Appl Sci, Durham, NC 27708 USA.

Why do you think your paper is highly cited?

The paper described the methodology, and announced the availability of a software implementation, for a new paradigm for extracting biological insights from genome-wide transcription profiles which was first introduced in Mootha, VK et al. "PGC-1alpha responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes," Nat. Genet. 34:267-73, 2003. Until that time, the common approach to analyzing such data was at the level of individual genes.

Figure 1: +details
Click figure to enlarge and read description.

Gene Set Enrichment Analysis (GSEA) derives its power by considering sets of genes that correspond to a biological process, a chromosomal location, or have a common regulatory pattern. GSEA addresses a number of problems associated with the single-gene approach to interpreting expression data. It enables the identification of subtle but relevant biological differences in phenotypes, aids in the interpretation of data through the use of annotated sets in the accompanying Molecular Signatures Database (MSigDB) and, importantly, its results tend to be more reproducible across data sets and technology platforms.

There are thousands of users of the GSEA software and the related, annotated MSigDB. It has become the method of choice for the initial analysis of global expression data much as BLAST (Basic Local Alignment Search Tool) is commonly used for sequence data.

Would you summarize the significance of your paper in layman's terms?

Eric S. Lander

Photo: ©John Nikolai

The method and the software implementation help biomedical researchers to identify the underlying biological processes associated with cellular states. Through this approach they may gain insights into the mechanisms of disease.

How did you become involved in this research and were there any particular problems encountered along the way?

This research began with the work described in the Mootha, VK et al. Nat. Genet. paper that sought to identify significant differences between samples of patients with type II diabetes and those with normal glucose tolerance. While single-gene approaches showed no significant difference between the two, GSEA was able to identify a set of genes associated with the oxidative phosphorylation pathway that was significantly down-regulated in the diabetic patients.

The problems we faced were associated with generalizing the method for use with other data sets. In particular, the initial approach identified sets of genes that exhibited any nonrandom behavior, while the method described in the cited paper detects sets of genes that are coordinately up- or down-regulated.

Given expression data for samples of two different cellular states or phenotypes, GSEA starts by ranking the genes according to their correlation with those states. The goal of GSEA is to determine whether a set of genes corresponding to a biological pathway or process or a cytogenetic band are randomly distributed throughout the list or over-represented at the top or bottom. Sets related to the distinction between the states should tend to show the latter behavior.

An "enrichment score" for the gene set is calculated by walking down the list of genes and increasing the score, by an amount proportional to the gene's correlation with the phenotype of interest, if gene is in the set, and decreasing the score if gene is not in the set. Significance of the score is estimated by a permutation test. The use of the weighted step identifies sets of genes that are over-represented at the top (or bottom) of a gene list, rather than other non-random behaviors as would be the case with the Kolmogorov-Smirnov statistic.

Eric S. Lander, Ph.D.
Broad Institute of Massachusetts Institute of Technology and Harvard
Cambridge, MA, USA

Jill P. Mesirov, Ph.D.
Associate Director, Chief Informatics Officer
Director, Computational Biology and Bioinformatics
Broad Institute of Massachusetts Institute of Technology and Harvard
Cambridge, MA, USA

Keywords: genome-wide expression profiles, genome-wide transcription profiles, Gene Set Enrichment Analysis (GSEA), Molecular Signatures Database (MSigDB), BLAST (Basic Local Alignment Search Tool), biological pathway, cytogenetic band, Kolmogorov-Smirnov statistic.

2008 : May 2008 - Fast Moving Fronts : Jill Mesirov & Eric Lander