Genomic analysis by a biologist: RNA-Seq

considerations and analysis walk-thru

To begin, I would like to reference RNA-seqlopedia, a website that goes into great detail about RNA-seq experimentation and analysis. If you are new to RNA-seq, I would strongly recommend visiting this website before you begin. Of course there's many other discussions (recommendations) rolling out continually (e.g., "A survey of best practices for RNA-seq data analysis"), which will certainly evolve as additional RNA-seq experiments are done, analyzed and reported.

An amazingly under appreciated consideration that has come up through discussions I have had with other researchers is the type of RNA species being analyzed. Such information is imperative, for it will determine the library preparation kit needed (e.g., see the numerous Illumina RNA-seq library kits), and of course, the eventual analysis procedure.

On a related note, as I was conducting RNA-seq analysis for a fellow researcher using the defined transcriptome analysis software, eXpress, I received this error:

WARNING: The observed alignments appear disporportionately in the reverse-forward order (2231075 vs. 154840). If your library is strand-specific, you should use the --rf-stranded option to avoid incorrect results.

The error states that my data displayed a 93.5% "strandedness" (= 2231075/ (2231075+154840), and that I should define the stranded nature within the analysis parameters for accurate quantitation. Unknown by the researcher was the fact that the company who performed the sequencing used a stranded kit while mRNA (poly-A) libraries!

It's not my purpose to simply repeat what others have published in refereed articles, but to provide a step-by-step description of how I have conducted RNA-seq analysis. At the end of this post I have listed useful references.

Aligning and counting against a specific transcriptome

Much of what I describe below are the default procedures described in the eXpress "Getting Started" page.

Step 1: Preparation of transcriptome index.
There are many resources from which you can download defined transcriptome data (e.g., a single file harboring fasta sequences for all predicted gene transcripts). Some of the sites I have used in the past include Ensembl's Biomart, UCSC table browser, and the UCSC goldenPath repository.

In the example below I used the UCSC goldenPath to obtain the hg19 transcriptome, noting the download date as the site states "sequence data is updated once a week via automatic GenBank updates". Specifically, I downloaded refMrna (i.e., refMrna.fa.gz) from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/ on 4-1-2015.

I created the index from the directory that harbored the single transcriptome fasta file (refMrna.fa), as outline in "Getting Started" on the eXpress site.

1	bowtie-build --offrate 1 hg19_refMrna.fa hg19_refMrna

Note, the "1" at the beginning of the command is not part of the actual command, but is included within the HTML generated by the "Source code beautifier". The number is present when commands are retained on a single line (not wrapped). Such line numbering is useful to reference the location of specific 'chunks' of code (e.g., see my Python page).

Step 2: Download RNA-seq data from Geodataset.
I am using the GEO Data Set GSE51005 linked to Rajan et al. 2014 ("Identification of a candidate prognostic gene signature by transcriptome analysis of matched pre- and post-treatment prostatic biopsies from patients with advanced prostate cancer", PMID: 25519703).

I used prefetch from the SRA Toolkit to download the sra files:

SRR993713 = tax-Pre-R1 (Pre treatment sample from patient 1)
SRR993714 = tax-Pre-R2 (Pre treatment sample from patient 2)
SRR993715 = tax-Pre-R3 (Pre treatment sample from patient 3)
SRR993716 = tax-Pre-R4 (Pre treatment sample from patient 4)
SRR993717 = tax-Pre-R5 (Pre treatment sample from patient 5)
SRR993718 = tax-Pre-R6 (Pre treatment sample from patient 6)
SRR993719 = tax-Post-R1 (Post treatment sample from patient 1)
SRR993720 = tax-Post-R2 (Post treatment sample from patient 2)
SRR993721 = tax-Post-R3 (Post treatment sample from patient 3)
SRR993722 = tax-Post-R4 (Post treatment sample from patient 4)
SRR993723 = tax-Post-R5 (Post treatment sample from patient 5)
SRR993724 = tax-Post-R6 (Post treatment sample from patient 6)

Once downloaded, the sra files were 'unpacked'. Since I'm analyzing pair-end sequence data, I included "--split-files" within the fastq-dump command of the SRA Toolkit, producing 2 separate files for read 1 (filename_1.fastq) and read 2 (filename_2.fastq).

Step 3: Align fastq files to bowtie indexed transcriptome.
Following the eXpress tutorial specifications for aligning to the transcriptome.

For example, to analyze the tax-Pre-R1 (SRR993713) sample, the command is:

1	bowtie -p 8 -aS -X 800 --offrate 1 /path/to/bowtie_index/refMrna -1 /path/to/SRR993713_1.fastq -2 /path/to/SRR993713_2.fastq /path/to/bowtie/SRR993713.sam

I simply repeat the above commands for each additional sample.

Step 4: Determine alignment level to assess "quality" of data.
Because I would like to use bamtools 'stats' to derive alignment summary statistics, I first convert my sam outputs to bam using samtools with the simple command like:

samtools view -bS SRR993713.sam > SRR993713.bam

Here's an example of the command and 'stats' summary from bamtools:

bamtools stats -in SRR993713.bam

**********************************************
Stats for BAM file(s):
**********************************************

Total reads: 83368008
Mapped reads: 60750352 (72.8701%)
Forward strand: 52992832 (63.5649%)
Reverse strand: 30375176 (36.4351%)
Failed QC: 0 (0%)
Duplicates: 0 (0%)
Paired-end reads: 83368008 (100%)
'Proper-pairs': 60750352 (72.8701%)
Both pairs mapped: 60750352 (72.8701%)
Read 1: 41684004
Read 2: 41684004

Singletons: 0 (0%)

The table below displays select 'stats' for all 12 samples.

Sample	Treatment	Total Reads	Forward Reads	% of Total	Reverse Reads	% of Total	"Proper-pairs"	% of Total
SRR993713	tax-Pre-R1	83,368,008	52,992,832	63.6%	30,375,176	36.4%	60,750,352	72.9%
SRR993714	tax-Pre-R2	92,867,382	56,198,835	60.5%	36,668,547	39.5%	73,337,094	79.0%
SRR993715	tax-Pre-R3	81,215,374	55,126,265	67.9%	26,089,109	32.1%	52,178,218	64.2%
SRR993716	tax-Pre-R4	75,367,502	53,597,091	71.1%	21,770,411	28.9%	43,540,822	57.8%
SRR993717	tax-Pre-R5	83,472,200	53,294,021	63.8%	30,178,179	36.2%	60,356,358	72.3%
SRR993718	tax-Pre-R6	84,656,488	53,892,263	63.7%	30,764,225	36.3%	61,528,450	72.7%
SRR993719	tax-Post-R1	86,477,582	55,714,604	64.4%	30,762,978	35.6%	61,525,956	71.1%
SRR993720	tax-Post-R2	84,390,810	55,074,961	65.3%	29,315,849	34.7%	58,631,698	69.5%
SRR993721	tax-Post-R3	71,676,108	55,134,602	76.9%	16,541,506	23.1%	33,083,012	46.2%
SRR993722	tax-Post-R4	86,768,734	55,464,261	63.9%	31,304,473	36.1%	62,608,946	72.2%
SRR993723	tax-Post-R5	85,550,272	55,984,158	65.4%	29,566,114	34.6%	59,132,228	69.1%
SRR993724	tax-Post-R6	86,818,842	56,755,408	65.4%	30,063,434	34.6%	60,126,868	69.3%

Two samples, SRR993716 and SRR993721, received low alignment levels, and thus, warrant further investigation. As the point of this blog page is to provide an example of the RNA-Seq analysis procedure, I will analyze all twelve samples, ignoring the low alignment level.

Step 5: Derive count data, etc. from eXpress.
Following the default recommendations outlined withn the eXpress tutorial, to analyze the SRR993713 (tax-Pre-R1) sam file, the command is:

1	express -o /path/to/eXpress/SRR993713/ /path/to/bowtie_index/hg19_refMrna.fa /path/to/bowtie/SRR993713.sam

Again, I simply repeat this step for all other samples.

Important note! I created separate directories for each sample prior to invoking eXpress, to which the eXpress output files will be deposited (params.xprs and results.xprs). This is necessary, because eXpress does not seem to append output files with a sample name (at least, this has been my experience).

Step 6: Determine significantly changed transcripts from the eXpress results.
This step has been given much consideration, leading to numerous reviews (see below), guidelines, and interesting blog titles. The mathematics is best left up to bioinformaticians, after which us biologists can validate predictions experimentally. For example, to derive an empirically-determined false discovery rate (FDR), one can conduct qRT-PCR analysis against 10+ randomly chosen gene transcripts predicted (determined) to be significantly differentially regulated by the RNA-Seq analysis software. Note that I underline randomly, for this is the legitimate means of selecting candidates to determine an empirical FDR. That is, picking gene transcripts of interest, those displaying the most extreme differences in transcript abundance between treatment and control and/or the lowest p-values is not random selection!

A couple obvious "normalization" considerations are inherent differences among samples such as read depth (raw sequencing depth, differing quality/ filtering), and the fact that longer transcripts have a greater probability of being sampled within the dataset. GC content is another commonly considered attribute (e.g., GC content can impact the efficacy of cDNA synthesis, and thus, influence inclusion within the derived sequencing library). It's always a good practice to read the articles (or website) related to the tool you plan to use so that you are aware of biases and normalization procedures included (or excluded) within the software.

In the current sample I used the popular RNA-seq differential expression tool, DESeq2 (PMID: 25516281), which is implemented within R* (https://www.r-project.org/).

As stated on the "Getting Started" section of the eXpress website, "if you use a tool such as DESeq2 or EdgeR, use the effective counts" (note, the eff_counts column is located within the 'results.xprs' output). Moreover, you will need to use non-decimal count values, so I rounded the eff_counts values within excel (i.e., when I used eff_count values with decimal values, R gave me this error message "some values in assay are not integers".)

I also filtered for gene transcripts with ≥1 count in Pre and/or Post eff_counts for all patients. That is, I do not want to exclude instances were a transcript may be 'absent' in one condition but 'present' in the other (e.g., 0 rounded eff_counts in the Pre sample, and 14 rounded eff_counts in the Post sample). This criteria left 31,615 of the 49,836 gene transcripts considered in the analysis (i.e., 63%).

In many instances, researchers limit there analysis to gene transcripts with ≥1 count in all samples. I think this may leave out some interesting findings, eliminating gene transcript that are rarely sampled ('absent') in one condition but 'present' in the other. Indeed, in the current example, if I limited my analysis exclusively to gene transcripts that are ≥1 counts in all samples" many significantly altered gene transcripts implicated with relevant biological functions would have not found. Importantly, such 'absent'/ 'present' gene transcripts can be validated with qRT-PCR, or other technique.

DESeq2 analysis with R

# The "#" symbol are used to comment out that which follows it, providing a means for describing the commands below.
# Note, these commands were derived from examples provided within the edX course "KIx: KIexploRx Explore Statistics with R" and the DESeq2 manual.

# Once you open R, specify the directory in which your rounded eXpress eff_count data resides (option within "Misc" drop down menu, on mac), and load the DESeq2 package as specified below.
library("DESeq2")

# Upload rounded eXpress eff_counts data, assigning data to "counts" object; specifying that the input has a header and name identifiers (gene transcripts) are located in row 1.
counts = read.delim("eXpress_roundedEffCounts_atLeastOnePatientCountNonZero.txt", header=TRUE, row.names=1)

# Assign experimental information (exp.info) through a data.frame that defines 2 factors, "patient" and "condition", specifying that the "patient" factor repeats (rep) 2 times ("Pre" and "Post"), and the "condition" factor repeats (rep) 6 times (6 patients). Note, the "c" function means "combine", which you can learn about in basic R tutorials.
exp.info <-data.frame(patient=rep(c("01","02","03", "04", "05", "06"),2), condition=c(rep("Pre",6),rep("Post",6)))

# Assigning the colnames from the counts object to rownames of exp.info (make sure the samples are in the same order in both tables).
rownames(exp.info) <- colnames(counts)

# Construct a DESeqDataSet (dds) object by referencing count, exp.info, and a patient by condition design (i.e., an 'interaction' between patient and condition, asking for the condition effect as the result output, while controlling for the patient effect).
dds <- DESeqDataSetFromMatrix(countData=counts, colData=exp.info,design=~patient+condition)

# Begin the differential expression analysis. You will note various processes occurring once this is invoked (e.g., "estimating size factors").
dds <-DESeq(dds)

# Generate a results table through calling the results function on dds
res <- results(dds)

# Pull out the results for the gene transcripts that received an adjusted p-value (padj) <= 0.05, and assign it to an object called res.sig
res.sig <- res[which(res$padj<=0.05),]

# Order the res.sig from lowest to highest accepted padj
res.sig <- res.sig[order(res.sig$padj),]

# Write a tab-delimited output file called "significant_diff_expr_eXpress_roundedEffCounts.txt" from the res.sig object, and include gene transcript names.
write.table(res.sig,file="significant_diff_expr_eXpress_roundedEffCounts.txt", sep="\t",quote=F,row.names=T)

# To derive a table with all the results, regardless of significance, along with normalized count data, first, create resdata by merging results (res) and the normalized counts from the dds object, merged by shared gene transcript id (row.names).
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds,normalized=T)), by='row.names',sort=F)

# Next, assign the gene transcript 'id' to the names of the resdata object
names(resdata)[1] <- 'id'

# Finally, write out a comma separated value table called "eXpress_DESeq2_results-with-normalized.csv" from the resdata object
write.csv(resdata, file="eXpress_DESeq2_results-with-normalized.csv")

Of the ~31k gene transcripts, 340 reached the 5% FDR.

We can also use R present the results using a volcano plot.

I upload a tab-delimited text file harboring the results for all 31,615 gene transcripts:

Gene log2FC pvalue padj
PRC1 -2.38415641 7.91E-15 2.00E-10
UBE2C -3.245693248 5.12E-12 6.47E-08
C6orf48 -4.239368669 1.18E-11 9.95E-08
TPM1 1.553878476 8.10E-10 0.00000512
BEND3 -1.614446753 9.23E-09 0.0000467
.
.
.
TXNL4A -0.942489148 0.158779242 NA
POU5F1P5 -0.0998629 0.87417003 NA
LOC105377348 -0.076570288 0.895288831 NA
LOC105377348 -1.0867126 0.084847087 NA
LOC105447645 -0.096241599 0.876440652 NA
LOC105372795 -0.494178696 0.446793868 NA

Creating a simple Volcano Plot with R

res <- read.table("eXpress_DESeq2_for_volcano_plot.txt", header=TRUE)
head(res)
with(res, plot(log2FC, -log10(pvalue), pch=20, main="eXpress DESeq2 Results", xlim=c(-5,5)))
axis(1, at = seq(-5, 5, by = 1))
with(subset(res, padj<=0.05 & log2FC<=-1), points(log2FC, -log10(pvalue), pch=20, col="green"))
with(subset(res, padj<=0.05 & log2FC>=1), points(log2FC, -log10(pvalue), pch=20, col="red"))
library(calibrate)
with(subset(res, padj<.01 & abs(log2FC)>2.32), textxy(log2FC, -log10(pvalue), labs=Gene, cex=.5))

We can compare our eXpress/ DESeq2 results to that reported by Rajan et al. (2014) using a xy plot within R.

Creation of simple xy plot with R

1
2
3

library(lattice)
comparison <- read.table("All_mine_linked_to_Rajan_et_al_singificant_Results.txt", header = TRUE)
xyplot(log(comparison$Mine,2)~log(comparison$Rajan_etal,2), xlim=c(-5,5), ylim=c(-5,5), xlab="log2FC_Rajan_etal", ylab="log2FC_Mine", main="Comparison between My and Rajan et al. log2FC", type=c("p", "r"), col="black")

As can be seen in the xy plot above, there's much agreement.

Importantly, analyzing the list of significantly effected gene transcripts with the web tool Genecodis (determines statistically enriched functions from an input gene list), many significantly-enriched functions are associated with docetaxel's mode of action (see "Mode of action of docetaxel – a basis for combination with novel anticancer agents", PMID: 12972359), an observation also made by Rajan et al. (2014).

Item	Description	NG	TNG	NGR	TNGR	Hyp*	Genes
Kegg:04210	Apoptosis	4	300	86	34208	0.044201	AIFM1,BID,BAX,PRKACB
GO:0007010	cytoskeleton organization (BP)	7	300	106	34208	0.00479571	PAK4,KRT8,CAPZB,DES,TPM1,ARHGAP10,KRT4
GO:0000278	mitotic cell cycle (BP)	9	300	304	34208	0.0455652	CDC20,CENPM,TUBG2,E2F3,CCP110,NEK2,DHFR,PSMB5,UBE2C

NG = Number of annotated genes in the input list
TNG = Total number of genes in the input list
NGR = Number of annotated genes in the reference list
TNGR = Total number of genes in the reference list (here, used the Ensembl reference list)
Hyp* = Corrected hypergeometric pValue

It's important to note, many gene products assigned to these functions would have been left out of the analysis, and results, if I utilized the ≥1 count in all samples criteria (see SQLite database screengrab below).

KEGG:04210, Apoptosis

Aligning and counting against a genome

A very popular tool for aligning RNA-Seq reads is Tophat (see manual). Here, I will use Tophat to align the reads to the hg19 genome, followed by deriving counts using the HTSeq 'count' tool against a hg19 gtf (Gene Transfer Format) file (text file that harbors gene information, such as identifiers, genomic coordinates, etc.).

Step 1: Align, process, and receive hit files via Tophat

Tophat uses bowtie to align fastq reads to a genome index (note hyperlinks to 'Pre-built Indexes' for many model organisms on the bowtie homepage).

1	tophat -p 8 -G /path/to/bowtie_index/hg19.gtf -o /path/to/tophat/SRR993713/ /path/to/bowtie_index/hg19 /path/to/SRR993713_1.fastq /path/to/SRR993713_2.fastq

I have specified an output folder with the sample identifier ("/path/to/tophat/SRR993713/") as Tophat outputs numerous files for each sample, using a non-specified file name (e.g., "accepted_hits.bam"), and thus, to keep each sample's results separate one needs to create individual output folders. Tophat produces various output files (e.g., "accepted_hits.bam", "junctions.bed", "insertions.bed", etc.). The only file we will use for HTSeq-count is the "accepted_hits.bam" file.

I repeated the above command for each additional sample.

Step 2: Sort the accepted_hits.bam output
As specified on the HTSeq-count wesbite, you need to sort the bam file prior to using HTSeq-count. I use samtools to accomplish this:

1	samtools sort /path/to/tophat/SRR993713/accepted_hits.bam /path/to/tophat/SRR993713/sorted_SRR993713_accepted_hits

This command will produce a single sorted bam file called "sorted_SRR993713_accepted_hits.bam".

As always, I simply repeat this for each additional sample.

Step 3: Derive counts
Please read through the HTSeq-count website description of the 3 different 'modes' of how the software addresses how a read will overlap a gene region. Here, I use the union mode, which essentially equates to only counting reads that overlap a single gene (i.e., no ambiguity).

For this analysis, you will need HTSeq and Python 2.7 installed, your sorted_accepted_hits.bam, and a gtf file. There are various locations from which one can download a gtf file (e.g., from Gencode).

1	htseq-count -f bam -r pos --stranded=no -m union /path/to/tophat/SRR993713/sorted_SRR993713_accepted_hits.bam /path/to/bowtie_index/hg19.gtf > htseq_count_sorted_SRR993713_accepted_hits.txt

-f <format> --> bam
-r <order> --> For paired-end data, the alignment have to be sorted either by read name or by alignment position.
--stranded=<yes/no/reverse> --> not stranded
-m <mode> --> Mode to handle reads overlapping more than one feature. Here = union (no ambiguous assignments, for example, the read overlaps 2 or more genes)
Note the ">" character just prior to the output file name. This simply means to "pipe" out the results into this file.

HTSeq-count will produce a single column text file with a count for each feature, followed by a simple summary of alignment results. For example:
A0A183 0
A0A4Z1 5
A0A539 0
A0A542 0
A0A543 0
.
.
.
uc031tkv.1 27
uc031tkw.1 0
uc031tkx.1 0
uc031tkz.1 5
__no_feature 6020067
__ambiguous 9947408
__too_low_aQual 0
__not_aligned 0

__alignment_not_unique 3658700

I repeat these commands for each additional sample.

Step 4: Determine significantly changed transcripts using DESeq2
As I did for the eXpress eff_count data, I filtered the data so as to only include gene transcripts that have ≥1 count in the Pre and/or Post for all patients. This lead to 14,744 accepted gene transcripts out of 65,723.

library("DESeq2")
counts = read.delim("tophat_HTSeq-counts_atLeastOnePatientCountNonZero.txt", header=TRUE, row.names=1)
exp.info <-data.frame(patient=rep(c("01","02","03","04","05","06"),2), condition=c(rep("Pre",6),rep("Post",6)))
rownames(exp.info) <- colnames(counts)
dds <- DESeqDataSetFromMatrix(countData=counts, colData=exp.info,design=~patient+condition)
dds <-DESeq(dds)
res <- results(dds)
res.sig <- res[which(res$padj<0.05),]
res.sig <- res.sig[order(res.sig$padj),]
write.table(res.sig,file="significant_diff_expr_with_names_tophat_HTSeq-counts.txt", sep="\t",quote=F,row.names=T)
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds,normalized=T)), by='row.names',sort=F)
names(resdata)[1] <- 'id'
head(resdata)
write.csv(resdata, file="tophat_HTSeq-counts_DESeq2_results-with-normalized.csv")

DESeq2 analysis resulted in 73 significantly altered gene transcripts (5% FDR).

The significantly affected gene transcripts showed great agreement with the eXpress results.

produced with MeV

Note, different annotations likely limited the ability to map all 73 gene transcripts determined to be significantly altered from the Tophat/ HTSEq analysis. I used the Gene ID Conversion tool of DAVID, to convert the gtf identifiers to official gene symbol. These official gene symbols then were linked to RefSeq identifiers used in the eXpress analysis.

* I plan to create a post related to what I call, "the necessary evil", R, in the future. For now, there are numerous free tutorials (e.g., swirl, datacamp, online courses on edX). I call R the "the necessary evil", for the vast majority of bioinformatics programs (packages) are implemented within R, which, in my opinion is not user-unfriendly software. The creators of many R packages write 'Vignettes' to provide an examples of their use, although, the semantics of implementing their tools with your own data is often confusing, and not readily implemented without a concerted effort.

RNA-Seq

2 comments: