Bioscience Technology

100 Enterprise Drive
Rockaway, NJ, 07866



Filling The Gaps In The Human Genome

The human genome was sequenced at the beginning of this decade and it was thought that the 20,000 or so genes comprising the genome were all known. However, researchers are now finding out that the gene catalog is far from complete. A recent paper published in Genome Research (Dec. 2007) by Adam Siepel and his collaborators identifies approximately 300 new human genes that were missed in the initial sequencing studies.

Most functional elements (such as protein-coding genes, cis-regulatory elements, and RNA genes) are fairly well conserved in vertebrate genomes, and they appear in the same positions when orthologous regions of genomes are aligned. As a result, statistical methods that look across species can detect unannotated elements by their characteristic patterns of evolutionary change. (Source: Adam Siepel, Cornell University)
“The way most genes are found is through large-scale sequencing of messenger RNA (mRNA) sequences and the problem with that is you have a strong bias toward the more abundant mRNA’s,” says Siepel, an assistant professor in the Department of Biological Statistics and Computational Biology at Cornell University. “So our starting hypothesis for this project was that there have to be some genes out there that nobody’s been able to find through these mRNA-based methods.” Using large-scale computer clusters at Cornell’s Center for Advanced Computing, Siepel and his team compared and contrasted vast amounts of gene data from diverse species such as human, mouse, rat, and chicken, to make predictions on genes missing in the existing sequence of the human genome.

Siepel had been working for several years on developing methods for finding genes using comparative sequence data and testing their accuracy against sets of known genes. In this study he tested his methodology on the whole genome to see if he could find missing genes purely by examining evolutionary signatures from multiple mammalian genomes using comparative methods. “So that was part A,” says Siepel. “Then part B was to go after these genes experimentally and show that they were indeed expressed and spliced.” So in collaboration with experimental biologists Siepel followed up on his predictions with very detailed large-scale analyses, using PCR-based methods, to show that the predicted genes did exist and worked to establish their functional significance.

The Genome Research study reported the existence of somewhere between 164 and 327 novel human genes missing from existing genome databases (the number varying depending on how the known genes are defined). “It was kind of surprising to see how incomplete the representations of some of these genes were in the databases,” says Siepel. For the most part, the genes that they found were either expressed in tissues in specific ways and at specific times in the development cycle or were ones that were expressed at low levels. “That wasn’t unexpected, but the particular classes of genes that we found were kind of interesting,” says Siepel. For instance, the two dominant sets of genes missing were those encoding for extracellular proteins and motor proteins, which was totally unexpected. “Many sequences that we found turned out to be large extensions of known genes, which was also somewhat of a surprise.”

The gene data from the study is now accessible via the University of California at Santa Cruz (UCSC) genome browser, which is one of the most widely used, publicly available browsing tools for genomic data. “We’ve actually created a version of the UCSC browser specifically for this Mammalian Gene Collection (MGC) project called the MGC browser, which people can access,” says Siepel. Although the data is publicly available, the gene sequences identified exist only in parts. “So there’s follow-up work going on as part of the MGC project to get full-length transcripts for these genes. Once the full-length transcripts are available, they will get incorporated into various genome databases.”

There are other projects ongoing in Siepel’s lab, that continue to examine the impact of evolution on genomics and vice versa. While the Genome Research paper focused on genes that are conserved across mammalian species Siepel is also looking at ways to identify genes that are only partially conserved between species or not conserved at all. “It’s a lot harder to use evolutionary information and yet allow for these changes. At the same time it is interesting to identify these genes that differ from one species to the other, as they may help distinguish the genomes in those species.” Siepel is also working on publishing his work on positive selection in mammalian genomes. “We are identifying genes that have been under evolutionary pressure to change, where new forms of these genes have been favored by natural selection over the previous version and we are doing a comprehensive examination of positively selected genes in all available mammalian genomes.”

After earning an undergraduate degree in agricultural and bio-engineering from Cornell University in 1994, Adam Siepel ventured out West and worked on an HIV database project at Los Alamos National Laboratory. “I really liked this combination of biology and computer science and got involved in bioinformatics, long before it was actually called bioinformatics,” says Siepel. He left Los Alamos in 1996 and pursued a Master’s degree in computer sciences at the University of New Mexico while working as a software engineer at the National Center for Genome Resources. He moved to the University of California in Santa Cruz for doctoral studies. “In the course of doing my Ph.D., I realized that in some sense the heart of bioinformatics was statistics,” says Siepel. “So I started doing more and more statistics and ended up in this area where genomics, computer science and statistics meet, which is a fascinating area and of course very timely.” His team at Cornell University is made up of people with equally diverse backgrounds in computer science, mathematics, biology and genetics. “It’s nice to have the group be more heterogeneous because they learn more from each other that way,” says Siepel.



© 2008 Advantage Business Media. All rights reserved.
Use of this website is subject to its terms of use.
New Privacy Policy