Bioscience Technology
100 Enterprise Drive Rockaway, NJ, 07866

|
 |

Big Data, Big Science
Pioneers of biological computing are making it easier for researchers sort through and dig deeper into data.
By Gina Shaw
There are approximately 20,000 genes in the human genome. These genes, in turn, code for hundreds of thousands of proteins. Ultimately, that means that there are literally trillions of possible interactions among genes and proteins within cells. And there, in essence, is one of the key problems of current bioscience research: we don’t have too little information, but way too much of it. Or, at least much more than we can effectively use to answer critical questions about biology, pathophysiology, the etiology of disease, and the effectiveness of therapies.
In reality, the sheer magnitude of the data sets is not the heart of the matter. It’s what researchers need to do with those data sets. “It takes us on the order of two weeks of supercomputing time to get a full analysis of a complex network from 190 megabytes of data,” says Andrea Califano, PhD, professor of biomedical informatics at Columbia University Medical Center, New York. “It would take the National Security Agency about four or five seconds to go through the same amount of data and look for keywords. The problem is that what we are looking for is not linear and straightforward, but extraordinarily complex.”
Dr. Andrea Califano in his lab. |
So the challenge for today’s bioinformatics experts is not so much how to store, organize, or manage these vast data sets, but how to integrate the information in them to generate useful answers to scientific questions. That’s the raison d’etre for Columbia’s National Center for Multi-Scale Analysis of Genetic and Cellular Networks (MAGNet), part of a nationwide network of seven centers created by the National Institutes of Health (NIH) to develop the computational and scientific infrastructure needed to make effective use of the massive amounts of data coming out of the Human Genome Project and other major research undertakings.
MAGnet-ized
Created in September, 2005 with an $18.5 million, five-year grant from the NIH, MAGnet is part of a larger interdepartmental center at Columbia, the Center for Computational Biology and Bioinformatics (C2B²), which brings together biology and the computational and physical sciences in efforts such as the modeling of regulatory, signaling, and metabolic networks. It’s all about what Califano, who directs MAGnet, calls an integrated view of biology, in which data does not sit in isolated chunks but talks to other bits of data via advances in computational biology.
“Consider a large data set—say, about 100 megabytes of gene expression profile data,” says Califano. “For a library, that quantity of data would be minute. But from it, we can extract an extraordinarily large amount of processed ‘meta data.’ You don’t necessarily want to store it anywhere, but maybe just hold it locally while doing analysis and extracting a final result, such as the analysis of what pathways or networks of interaction in a cell may be.”
Some of the techniques for leveraging these oceans of information have been developed, and others are still in their infancy. C2B² takes a global approach: instead of looking at a particular data modality, such as sequence data or gene expression analysis, it focuses on how to bring a variety of different data modalities together and see where all that information points when it’s studied in the context of a network of interactions.
Califano suggests a comparison that would hit home with millions of New Yorkers. “If I gave you information about traffic patterns in New York, and a list of streets where there is traffic congestion, you might have a lot of information—but if you have no idea what the street grid looks like, it would be extraordinarily difficult for you to make sense of it,” he says. “But if I give you a map that shows you which streets intersect, and you can see that 40% of congested streets are in a tight location on the map, you can understand that there’s a portion of Manhattan that’s not working well. If I also give you flood information, fire information, electricity flow, and other types of information, you can start putting together an integrated view of a day in Manhattan and distinguish a good day from bad.”
That’s exactly what C2B², aided by computational advances from MAGnet, aims to do at the cellular level and the organism level. “We’re mapping out questions such as what pathways are dysregulated in a certain cancer and how a particular drug perturbation will affect those pathways, resolving or complicating the situation.”
Califano and his colleagues, including MAGnet’s co-director, Barry Honig, PhD, have already developed a number of novel computational methods and tools that sort through genomic, proteomic, and network information in new and targeted ways. These include first-of-their-kind genome-wide computational techniques that can identify all the post-transcriptional modulators of a given transcription factor.
“It’s relatively easy today to figure out targets at the genetic level—transcription factors that activate or deactivate the expression of genes,” says Califano. “It’s much harder to figure out, in the cell, how molecules interact to send signals through the membrane down into the nucleus.”
But the new methods from Columbia drill down in just that way. “I can give you a list of all the druggable proteins you can hit to shut down or to activate a particular protein in a particular cell,” Califano says. Few enough laboratories are doing this at the computational level in human cells—but still fewer have biochemically—validated all their computational predictions, which C2B² does.
Turned on its head
The emerging complexity of molecular interactions in the cell calls for a new level of sophistication in the design of genome-wide computational approaches. Click to enlarge. |
Another experimental method developed at C2B² turns traditional approaches to the study of gene expression in cancer—in this case, glioblastoma—on their head. Typically, genetic research in oncology involves studying the expression of genes, whether in tumors or in normal tissue, and then compiling a list of those genes that are differentially expressed. Within that list, there may be particular genes of interest—culprits responsible for the malignant phenotype, targets overexpressing certain proteins, all potential drug targets.
But Honig, Califano and their team approach the question from an entirely different angle. “Rather than asking what genes in a signature differentiate normal cells from tumor cells, we’ve developed a computational algorithm that identifies the regulators—the small group of genes that make this happen,” says Califano. In glioblastoma, the Columbia scientists have identified a small cluster of six transcription factors that seem to regulate the malignant signature. “If you modify two of them, you see a strong, invasive phenotype, completely consistent with the glioblastoma signature.”
This, says Califano, changes the paradigm for cancer research. “Instead of looking through lists, you say ‘What’s upstream of these lists?’” He offers another analogy. “If you see that people are wearing a particular T-shirt, you look for who launched the trend, rather than who’s wearing the T-shirt. We’re looking for root causes, rather than downstream effects.”
The integration of “big data” at Columbia is about to get bigger. The medical center has just received a $12 million grant from New York State to build a large supercomputing center in cancer biology, which will ultimately be populated by 3000 CPUs.
All of the information generated by these interlocking efforts will ultimately be widely accessible thanks to the cancer Biomedical Informatics Grid (caBIG), an open-solutions development community created by the National Cancer Institute that currently includes 50-plus cancer centers as well as more than 30 federal, nonprofit, and industry organizations. Columbia runs one of the “grid nodes” for caGRID, caBIG’s underlying infrastructure—geWorkbench, an open-source software platform for genomic data integration.
Launched in February, 2004, “in recognition of the fact that the exponential increase in information about cancer was overwhelming the existing information technology infrastructure,” (according the the caBIG Web site) caBIG is essentially one giant, open-source, biomedical computing development network. It’s been dubbed the World Wide Web of cancer research. It’s organized into virtual “workspaces,” teams developing and refining technologies in key areas of interest, such as interoperable clinical trials management tools, interfaces to integrate between biomedical informatics applications and data, tools for extracting and sharing information from imaging data, and systems to track and mine information from tissue samples.
“The data scientists generate through our research is distributed around the world,” says Califano. “What is needed is a transparent way of moving all that information to where it’s most convenient for the people who are trying to reach it, whether it’s academic researchers or oncologists in clinical practice. Some of the tools being developed through caBIG will allow scientists to treat the Internet as a vast grid resource. More broadly, it’s an attempt to create a strong and unified approach to managing these large volumes of data.”
About the Author
Gina Shaw has been writing about health, medicine, and science for more than 10 years. Her work has also appeared in scientific and consumer publications.
|