Speak, RNA

Charles Perou views scanned gene-expression microarray images with LIsa Cary and Katherine Hoadley in his lab at the University of North Carolina at Chapel Hill.UNC LINEBERGER COMPREHENSIVE CANCER CENTER

If you want to know what a cell, tissue, or organism is doing, molecularly speaking, you don’t look at its genome; you’ve got to look further downstream.

One option: the transcriptome. One step beyond the genome, a transcriptome represents the sum total of RNAs expressed in a cell (or tissue, or organ, or organism) under a given set of conditions. If the genome is a set of instructions for all the proteins and regulatory molecules an organism can produce, the transcriptome indicates which ones it actually produces. And these days, collecting such data is almost trivial.

Generated by using either gene-expression microarrays or “next-generation” DNA sequencing (an application called “RNA-Seq”), transcriptome data can be used to pinpoint genes whose...

Such analyses are, if not exactly simple, then certainly manageable for the researchers who do them routinely. But what about the rest of the scientific community, those with biological questions that gene-expression analysis would address, but without the informatics staff and experimental expertise? It turns out you don’t necessarily need to run your own experiments; there’s data galore out there already. And you don’t need to be a bioinformatics expert, either; in many cases, researchers’ questions can be addressed by simple reanalysis of existing data sets.

“I think you can go pretty far without a bioinformatician,” says Joseph Pickrell, a graduate student in human genetics at the University of Chicago who collects and analyzes RNA-seq data sets, “as long as you have a way to ‘sanity check’ yourself and make sure the genes you think are differentially expressed aren’t due to a confounding factor.”

The Scientist spoke with transcriptome and bioinformatics experts to find out how scientists who lack formal bioinformatics training can make the most of online databases.

Finding expression data sets

Before you can reanalyze a data set, you have to find it. Researchers sometimes post gene-expression data sets on their own websites, but the primary repositories for such data are the Gene Expression Omnibus (GEO) database at the National Center for Biotechnology Information and ArrayExpressat the European Bioinformatics Institute. Papers generally include a GEO or ArrayExpress accession number (eg., GEO: GSE10083), which you can enter in a search box on the home page.

You also can go the other way. References in NCBI’s PubMed database that include transcriptome data sets include a “GEO DataSets” link that will take you to the relevant record. For example, for GEO record GSE10083, the corresponding PubMed article is PMID: 20420666, “Aryl hydrocarbon receptor (AHR)-regulated transcriptomic changes in rats sensitive or resistant to major dioxin toxicities.” (BMC Genomics, 11:263-278, 2010).

What’s my favorite gene doing in this data set?

GEO SURFING: To probe the power of the GEO database, enter “lung cancer and smoking” in the main search page (top). The resulting hits include clickable “heat maps,” providing a global view of gene expression (middle; highly expressed genes shown in fuschia). Alternatively, query the activity of a single gene in the dataset to see how its expression differs between smokers and nonsmokers (bottom; 7 smokers shown at right).WWW.NCBI.NLM.NIH.GOV/GEO

The simplest question a researcher can ask of a gene-expression data set is whether a gene is up- or downregulated. GEO group leader of curation, Tanya Barrett suggests the following exercise:

Suppose you were interested in genes that are affected by lung cancer and smoking. Type “lung cancer and smoking” into the “Query DataSets” search box on the main GEO page, and click Go. At the time of this writing, the search returned nearly 750 hits.

Select the first record, GDS3309, “Cigarette smoking effect on the nasal epithelium,” and click on the record number. The next page provides information on the study, a full citation of the published paper with a link to the PubMed abstract, and the specific array design. As indicated in the “sample count” field, GDS3309 includes 15 separate data sets; click on the “Sample Subsets” button at the top of the page to discover that this number breaks down to eight controls and seven experimental samples. Under “Cluster Analysis” is a searchable and clickable “heat map,” which clusters the genes based on similar expression patterns (with expression levels represented from fuchsia [high] to green [low]).

Select the “Data Analysis Tools” button at the top of the page. Four tools are available. Under “Find Genes,” enter CYP1B1 in the “Find gene name or symbol” box. That search returns data on the four Affymetrix probes corresponding to that gene. (Affymetrix Gene Chip microarrays generally represent each transcript with multiple short oligonucleotides.) At the right of each record is a clickable bar graph, which in these cases shows that expression of CYP1B1, which codes for a cytochrome P450 enzyme, is significantly higher in the nasal epithelia of the seven “smoking” samples than in the eight controls.

Of course, GEO is not a full-fledged array-analysis tool, Barrett explains; its primary mission is as a data repository. The reason such an analysis is possible with GDS3309 is because her team has essentially precomputed the answers on this data set; that is, they have pre-run a set of standard analyses on the data and made the results available. But not all GEO data sets have been subjected to such treatment; in fact, according to Barrett, only about 10% have been processed so far. If your data set is not among those, you’ll need to download and analyze those data yourself.

Several file formats are available; which you choose depends on the analysis tool you’ll be using. A popular choice for commercial array analysis is Agilent Technologies’ GeneSpring software. For those who prefer freeware, Charles Perou, a professor of genetics and pathology at the University of North Carolina at Chapel Hill, who posts his lab’s gene-expression data on the UNC Microarray Database, recommends either one of many Bioconductor packages or SAM, an Excel add-in that can perform “simple supervised analyses,” he says. (A supervised analysis is one that ranks genes based on their correlation with a particular variable, such as metastasis, prognosis, and so on.)

What other genes have a similar expression pattern to my gene?

Once you see what your gene is doing under a given set of biological conditions, you may want to identify other similarly behaving genes. Such genes, Barrett says, could be co-regulated or part of the same pathway. “It’s a good way of trying to infer the function for a gene,” she says.

With pre-analyzed GEO DataSets, answering this question is relatively straightforward. Returning to our example: in data set GDS3309, enter CYP1B1 in the “Find Genes” search box. The list of results contains four entries, each of which includes a “Profile Neighbors” link, which allows you to find similar expression patterns. Selecting the link for the first CYP1B1 probe in the list retrieves 11 hits, including such genes as TIMP3 and FGF13, all of which are more highly expressed in the smoking samples than in controls.

To perform the same analysis on samples that GEO hasn’t pre-analyzed, you can download expression data sets and analyze them yourself. Barrett recommends downloading array data in “Series Matrix” format, a tab-delimited text format. Such files can then be read and processed in GeneSpring, SAM, dChip, and so on. Or, check out one of the many online expression analysis tools, including Oncomine (free for academic users), which deals with cancer-related data sets, and NextBio (free basic package), which has broader coverage. In Oncomine, for instance, you can find all stored data sets in which CYP1B1 is among the top 1 percent most differentially expressed genes.

What about RNA-Seq data sets?

Naturally, all the analyses that can be run on array data can also be performed using RNA-Seq data sets. Because sequencing data sets, with their millions upon millions of short reads, are so much richer than array data—not to mention less biased (arrays can only see what their probes allow them to see)—there’s more a researcher can glean from them than mere transcript abundance. “It’s silly to just use RNA-Seq data as counting data,” says Eric Olivares, a staff scientist at Pacific Biosciences and founder of SEQanswers.com.

One example is alternative splice site usage, the relative expression of different splicing isoforms. Arul Chinnaiyan, in whose lab at the University of Michigan School of Medicine Oncomine was developed, explains that RNA-Seq is considerably harder to analyze than arrays, because the data sets are larger and more complicated, experimental design is so variable, and also because the field is so young and dynamic. As a result, there are few, if any, really user-friendly tools for next-gen RNA sequencing analysis. “That’s certainly a major issue,” he says.

But there are a handful of options, including the sartorially inspired software trio of Bowtie, TopHat, and Cufflinks (www.cbcb.umd.edu/software). “That’s about as user-friendly as it gets at this point,” says Pickrell. Bowtie aligns short reads to a reference human genome; TopHat (which actually uses Bowtie) aligns RNA-Seq reads to a mammalian reference genome to identify splice junctions; and Cufflinks “assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples,” according to the software’s website.

These tools are command-line driven—that is, they have no graphical user interface, explains Cole Trapnell, the Harvard postdoc who wrote TopHat and Cufflinks and helped to develop Bowtie. “The target audience is people who are relatively experienced bioinformaticians.” But the tools shouldn’t be too difficult for nonveterans, he adds.

First, download the RNA-Seq FASTQ datafiles (which contain the raw-sequence reads and associated quality metrics); on GEO, look for SRA files, which can be converted to FASTQ format using the SRA Toolkit software, says Barrett. Align the resulting files to a reference genome “index,” a kind of preformatted genome file, using TopHat. Then pass the resulting data files to Cuffdiff (part of the Cufflinks package) along with a set of gene annotations, to produce a data table that can be loaded into Excel or another plotting or visualization tool (Trapnell favors Hadley Wickham’s ggplot2 plotting and visualization library for the R computing environment [had.co.nz/ggplot2]).

Another command-line option is the “Flux Capacitor,” developed by Micha Sammeth in the lab of Roderic Guigó at the Center for Genomic Regulation in Barcelona, Spain. Given a set of aligned sequence reads and an annotated genome, says Guigó, the software “deconvolutes” the reads based on the expression of individual exons, “so I can have an idea of the relative abundance of each splice form.”

For those who prefer not to deal with the command line, the Galaxy project at Pennsylvania State University provides a graphical interface for TopHat and Cufflinks. Or, for real point-and-click functionality, the cloud-based service DNAnexus can perform alternative splicing and other RNA-Seq analyses for a fee.

Transcribed from the experts

Much as it would be nice to believe you can just download a data set and start studying it, the truth is it’s not that simple. “These are still complicated data sets and there is a learning curve,” says Charles Perou, a professor of genetics and pathology at the University of North Carolina at Chapel Hill. “It’s like repairing your car. You could go to the parts store, buy a part, and fix your car; that doesn’t mean you’re going to fix it in the optimal way. You’re still going to need some expert advice and input.”

As a result, don’t biocompute alone, he advises. Collaborate with a bioinformatician or statistician, or at least have them double-check your work and logic. “And it’s an iterative process,” he adds: don’t just run the analysis once and forget it; there are multiple algorithms and variables, so see how tinkering with the parameters changes results.

Kai Wang, a bioinformatician at the Harvard School of Public Health, offers another point: be careful when combining different data sets of the same tissues (such as ovarian cancer) to perform large meta-analyses; there’s simply too much variability—different array designs, analysis algorithms, sample preparation methods, and so on. “Any array data has lots of noise,” he says. “But if you compare arrays done by two different people, labs, or platforms, you increase the risk of noise,” he says, an observation that’s even more true of sequencing data sets.

Be wary also of odd-man-out findings, says Arul Chinnaiyan of the University of Michigan School of Medicine, such as those found in only one ovarian cancer data set, say, but not in others. “If you interrogate the entire [Oncomine] database, you have so much data that it could just be a false discovery,” he says.

Finally, says SEQanswers’ Eric Olivares, test your processes by first seeing if you can replicate the original paper’s findings before branching out into your own work. “It’s a question of believability: are you in the ballpark of making biologically relevant results?”

Interested in reading more?

Receive full access to digital editions of The Scientist, as well as TS Digest, feature stories, more than 35 years of archives, and much more!

Already a member?