Bioscience Technology
100 Enterprise Drive Rockaway, NJ, 07866

|
 |

Clawing Out Clusters in Temporal Gene Expression
by Mike May
Microarrays can track the activity of many genes under a variety of circumstances. But how can scientists compare data on dozens of genes at different time points and in different subjects? That question turned into a PhD thesis for Yang Huang, who worked under Martin Farach-Colton at Rutgers, the State University of New Jersey.
In clustering data, LABSTER (pink) produced a far lower error rate than 13 vector-based methods (purple). (Image - Yang Huang.) Click to enlarge |
“Right now, some groups perform experiments in temporal gene expression for many patients,” says Huang. Such studies produce a temporal gene-expression matrix (TGEM) for each patient, where the columns represent the points in time when expression was measured, and each gene gets its own row. So an experiment including n genes and m time points makes a TGEM with nm cells. Moreover, if an experiment includes N subjects, it leads to a total of nmN cells of data. Put in real terms, a microarray experiment that tracks 100 genes at 6 time points in 1,000 patients churns out 600,000 cells. Moreover, most researchers would prefer to compare the matrices themselves. That is, a scientist would ask: What is similar or different between
in this examplethe 1,000 TGEMs, each composed of 600 cells?
As Huang explains, the goal of such research is that “we want to find how many classes the data include.” In other words, the scientists want to find the similar patients, those who respond with similar expression profiles under similar circumstances. If, for example, the data set includes patients who took a specific drug for the same disease, who did this drug help? This information would uncover classes of responders and nonresponders.
In the summer of 2006, Huang started working on a solution to this class-discovery problem with TGEMs. At that time, says Huang, many researchers tried to solve the problem by looking at a gene-expression vector for each patient, where a vector only contains expression data at one time point. “A more comprehensive approach,” says Huang, “is to measure the expression for the same patient at several different time points.” To examine such TGEMs, Huang started developing LABSTER, which stands for lattice-based clustering. Instead of looking at just one point in time per patient, LABSTER compares TGEMs for many patients. “It can handle hundreds of genes,” Huang says. By comparing the expression of genes across time and patients, LABSTER identifies classes in the data, such as the responders and nonresponders mentioned above.
The process behind LABSTER involves a series of data manipulations. The data get transformed in a couple of steps, all designed to prepare it for a Galois lattice (see sidebar). First, the continuous data from each TGEM is converted into discrete values. Each discrete matrix is then transformed to a binary matrix, from which a Galois lattice is constructed. Then, LABSTER calculates the distances between the lattices, basically a measurement of similarity. Finally, the program uses the calculated distances to make clusters of TGEMs essentially subjects with similar gene-expression profiles. Those clusters make up the classes in the data.
To test out LABSTER, Huang started with simulation data. The results looked promising, because the program did find the classes in the simulated data. To be useful, though, LABSTER must show the same class-discovery capabilities on real data.
To find such data, Huang turned to the January 2005 PloS Biology and a study by Sergio E. Baranzini of the School of Medicine at the University of California, San Francisco, and his colleagues. They examined data from time-series microarrays generated from 70 genes in 52 multiple sclerosis patients who were treated with recombinant human interferon beta (rIFN). In those data, Baranzini’s team found just two classes: good and poor responders.
Huang turned LABSTER loose on data from 27 of patients, those with complete data for seven time points. Among those patients, Baranzini and his colleagues had found 19 good responders and 8 poor ones. For the most part, LABSTER placed the same patients in the same classes, with an error rate of just 7.4 percent. But Huang warns, “There are many ways to measure the error rate.” Then he adds, “We did compare our results with vector-based clustering methods and found our results were the best.” In fact, LABSTER classified the patients better than 13 other methods, some of which gave error rates as high as 40 percent.
More work lies ahead, such as developing faster techniques for computing the Galois lattices and making even better use of the temporal information in data. Moreover, Huang expects to apply his technique to forms of data beyond gene expression. He will continue some of those investigations at the National Center for Biotechnology Information.
Why go Galois?
Using Galois lattices is not new. “These lattices are used to organize a set of objects with different attributes,” says Huang. “People use Galois lattices to organize Web pages, retrieve information, and represent various knowledge concepts.” For example, François Rioult of the Université of Caen in France and his collaborators applied Galois lattices to gene-expression matrices in 2003.
“We used this lattice,” says Huang, “because our target is class discovery in gene-expression matrices.” These Galois lattices create a new representation of the data from temporal gene-expression matrices. Specifically, the Galois lattices make it easier, mathematically speaking, to measure the differences between the matrices. Without the organizational help from the Galois lattices, making such measurements can turn into a mess.
In clustering data, LABSTER (pink) produced a far lower error rate than 13 vector-based methods (purple). (Image courtesy of Yang Huang.)
Yang Huang.
Why go Galois?

Using Galois lattices is not new. “These lattices are used to organize a set of objects with different attributes,” says Huang. “People use Galois lattices to organize Web pages, retrieve information, and represent various concepts.” For example, François Rioult of the Université of Caen in France and his collaborators applied Galois lattices to gene-expression matrices in 2003.
“We used this lattice,” says Huang, “because our target is class discovery in gene-expression matrices.” These lattices create a new representation of the data from temporal gene-expression matrices. Specifically, they make it easier, mathematically speaking, to measure the differences between the matrices.
|
|