by Gina Shaw Figure 1. Preparing samples for sequencing. |
First, we sequenced the human genome. Now, the human proteome is firmly in the crosshairs of science, although since it is exponentially larger than the genome, it will take much longer for all initial sequencing to be done. Next on the agenda: how about the human microbiome?
Microbial cells outnumber human cells by a factor of about ten to one, but these microbial communities have been almost completely ignored by research. What do they do? How do they affect our growth and development, disease and wellness? With the exception of a few nasty pathogens like
E. coli , the short answer is: we really don't know.
In fiscal 2007, the National Institutes of Health began funding projects aimed at building a data resource for the new Human Microbiome Project, including sequencing the genomes of 200 microbes that have been isolated from the human body. Four sequencing centers — at Baylor, the Broad Institute, the J. Craig Venter Institute, and Washington University — will generate the data for this part of the project.
The rise of metagenomics
Still in its infancy, the Human Microbiome Project will build on a foundation constructed largely by environmental scientists and microbiologists who have been studying microbial genomic diversity
in situ , rather than by cultivation-based methods — a technique known as metagenomics.
Figure 2. A comparison view generated using MEGAN2, comparing metagenomic data from three ancient DNA studies published in the last couple of years: Mammoth, Cave Bear and Neanderthal. Click to enlarge. |
Instead of using conventional DNA analysis — culturing identical cells in a lab — metagenomics sequences genetic material en masse from uncultured, environmental samples, allowing researchers to study organisms that are not easily cultured in laboratories (that's more than 99% of microbes), and revealing an amazing amount of microbial diversity that had previously gone unrecognized.
In the April 3, 2008 edition of
Nature , Elizabeth Dinsdale et al., published a metagenomic mother lode — 15 million sequences from 45 distinct microbiomes and, for the first time, 42 distinct viromes. The multi-center team, based at San Diego State University, showed strongly discriminatory metabolic profiles across environments.
Another fascinating finding involved phage—viruses that infect bacteria. "Phage are even more numerous than bacteria, and our analysis showed that phage were also carrying a wider array of genes than first thought. That means because phage infect bacteria, put their DNA in and make bacteria replicate, they're introducing new genetic material into the bacteria population and may control the way bacteria are acting in a community. So metagenomic analysis tells us that the influence of phage is much greater than we first thought."
Massively-parallel pyrosequencing
What did Dinsdale's team used 454 Sequencing, the massively-parallel pyrosequencing system from 454 Life Sciences, to generate these findings.
"The biggest advantage is that we don't have to clone the sequences," says Dinsdale. "We take the DNA and it goes directly into sequencing without the cloning step, actually placing the sequences onto a bead and replicating them within an individual well. This saves time, which is hugely important given the amount of data we have — one sample is about 300,000 sequences." It took 3.5 weeks of computer time just to do the initial blast against the NCBI database for the 15 million sequences covered in the
Nature paper.
The fact that the 454 system does not require cloning also allows Dinsdale and her colleagues to sequence pieces of the bacterial genome that just won't clone for various reasons. "There are some parts of the genome that you would never see from a cloned sequence," she says.
An often-quoted limitation of the 454 system, short sequence reads, doesn't bother Dinsdale. "First, their sequence reads have gotten longer, and our paper has shown we can obtain a good description of the microbial environment using a shorter sequence read in any case. Without having to grow the bacteria in many different cultures — a high carbon environment, a high silicon environment, and so on — we were able to obtain a fairly accurate picture of the microbial genes carried across the whole community, and interpret what sorts of functions they would be able to do."
Signature genes
Figure 3. MEGAN’s microbial attributes window. Click to enlarge. |
The biggest barrier for most metagenomic researchers, Dinsdale says, is computing capacity. In less than a year since her team first ran their analysis of their data, the number of characterized metagenomes exploded from 87 to more than 800. "The amount of data will keep increasing, because it's such a window into the microbial world that hasn't been seen before."
Publicly accessible, computer-based tools aim to help with some of that data overload. One of them is Signature (
http://www.cmbi.ru.nl/signature), a Web server that allows the input of genes from an unknown source, such as a metagenomics sample.
"The Web server then tells you if your input contains "signature genes" for clades in the tree of life, signature genes are only found in one clade, but within that clade are widely distributed," explains one of the lead developers of Signature, Bas E. Dutilh, of the Center for Molecular and Biomolecular Informatics at Nijmegen Center for Molecular Life Sciences, Radboud University, Nijmegen Medical Centre, in the Netherlands. "If you find many signature genes for a certain clade, it is likely that a species from that clade is represented in your sample."
The signature genes on which the server is based — 8,362 signature genes specific for 112 prokaryotic taxa — can answer phylogenetic questions on the basis of gene content even where complete genomes are not available. "By now, there are so many completed genomes available, that together, they have some predictive power for the ones that are not yet sequenced," explains Dutilh. If all the complete genomes that are available agree that a certain gene is a signature for a clade, then it is highly unlikely that we will find it in a species outside that clade. So conversely, if we find this signature gene in a certain species, then it is highly likely that it belongs in that clade. As the signature genes approach can be used for incomplete genomes, it opens the possibility of using the phylogenetic power in gene content in cases where we can not obtain a complete genome sequence, like for "un-culturable" species or metagenomic samples. In principle, finding one signature gene should already be enough."
Most methods that assess species distribution in a sequence sample do so on the basis of universal marker genes that contain a good phylogenetic signal. But, Dutilh notes, while there are only a few good phylogenetic marker genes, there are thousands of signature genes. "By finding which signature genes taxonomically identify a sample, Signature immediately pinpoints the genes that characterize the clade. So Signature can also be useful to find out what genes actually characterize a specific group of species."
For example, Dutilh's team has used the signature gene approach to analyze the genome of Kuenenia stuttgartiensis, a planctomycete that oxidizes ammonium anaerobically. By testing which clade shares most signature genes with Kuenenia and the other Planctomycetes, they discovered that they are most closely related to the
Chlamydiae "One of the signature genes we discovered for this
Planctomycetes-Chlamydiae superphylum is a gene that is highly similar to the 60-kDa cysteine-rich outer membrane protein of Chlamydiae, and they share more properties of their cell envelopes," says Dutilh." So Signature is not only useful to find the taxonomic relatives of a new species, but it also immediately identifies those genes that are characteristic of the clade."
Analysis of metagenomic datasets
A complementary tool is MEGAN (MEtaGenome ANalyzer), a computer program that allows optimized analysis of large metagenomic datasets in order to identify the species they contain. Co-developers Daniel Huson, of the Center for Bioinformatics at Tübingen University in Germany, and Stephan Schuster, of Penn State University, designed MEGAN to compare DNA sequence fragments from an environmental sample with gene sequences from GenBAnk in NCBI.
Fragments of DNA from an environmental sample, such as ocean waters or soil, are compared against databases of known DNA sequences using BLAST or another algorithmic bioinformatics tool to assemble the segments into discrete comparable sequences. MEGAN is then used to compare the resulting sequences with gene sequences from GenBank in NCBI. Originally designed to investigate the DNA of a mammoth found in the Siberian permafrost, and described in a 2007 paper in
Genome Research MEGAN has since garnered hundreds of registered users. "The main use is probably to obtain a first analysis of a new dataset, which is then subsequently followed up by more detailed analyses. In some sequencing centers, MEGAN is routinely used to check whether a sequence project contains the types of sequences that it supposed to do," says Huson.
Huson and his team are now at work on MEGAN 2.0, which will aim at comparative analysis of multiple datasets. "Also, we are about to release a companion program MetaSim, that can be used to simulate a metagenome sequencing project, controlling both the exact mix of genomes and also the type of sequencing technology used," he says.
Analysis of metagenomic data sets is still in its infancy, says Huson. "So a tool like MEGAN, which aims particularly at interactive exploration of data, rather than trying to cook a final answer, will help people to discover and understand how such an analysis should best be done."
A long way to go
The main message of metagenomics so far, according to Huson, is that the tree of life has a lot more branches than we thought. "Only about 300,000 species are represented by at least one gene sequence in the NCBI databases; however, nearly two million species have already been named and metagenome projects are confirming that the true spectrum of different types of organisms is probably much, much larger.