Genomic analysis by a biologist: Analyzing lots of data with Relational Databases and SQL

The first time I was tasked to analyze large amounts of data, I thought there had to be a better way than to stumble through numerous spreadsheets. For example, I was to analyze 24 total treatments using Illumina's HT12 BeadArrays that have >45,000 probes per array (8 cell line/treatment/time points)! During my bioinformatics fellowship at the University of California Irvine I learned of relational databases, which I thought would perfectly address this issue. After checking out some books from my local library on relational database and the way one queries them (Structured Query Language, SQL), I quickly realized how powerful relational databases could be, allowing for ready acquisition of results via comparisons, contrasts, associations, limitations, etc. (note, one needs to "relate" datasets through shared "keys"; e.g., official gene symbol). The book I especially like is "Learning SQL" by Alan Beaulieu.

Using a created relational database, I was able to analyze, hypothesize, and reach conclusions with my coauthors resulting in this paper (as well as many others):

Lin, J. Z., Sieglaff, D. H., Yuan, C., Su, J., Arumanayagam, A.S., Firouzbakht, S., Cantu Pompa, J.J., Reynolds, F.D., Zhou, X., Cvoro, A., and P. Webb. 2013. Gene specific actions of thyroid hormone receptor subtypes. PLoS One 8(1):e52407

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0052407

How was this accomplished?

(1) Raw data was obtained from GenomeStudio (Illumina's array software), raw data was background-subtracted, normalized, and analyzed using lumi and limma within R.
// more on microarray analysis in another post //

(2) Limma results were uploaded into a SQLite database using the Firefox add-on SQLite Manager (a Firefox web browser add-on that provides a GUI interface for the SQLite program).

Note, SQLite is an awesome open-source "standalone" relational database engine that is platform independent, and easily transferable as a single file (e.g., transfer it via email, dropbox, thumb-drive, etc.). SQLite databases are also readily manipulated at the command prompt, and in some instances (e.g., "heavy" queries), the command line option will work whereas manipulation within the SQLite Manager GUI will not (I have had SQLite Manager freeze in some instances).

For example, using this SQL (formatted using http://hilite.me/):

SELECT trb.Probe_ID "ILMN Probe_ID", gene.GeneName, trb.FC_T3TRB_vs_CtrlTRB "FC_TRB+T3", trb.FC_CHXTRB_vs_CtrlTRB "FC_TRB+CHX", trb.FC_CHXT3TRB_vs_CtrlTRB "FC_TRB+CHX+T3", ego.GO_Term, ego.GO_Termname, ego.GO_EvidenceCode 
FROM LIMMA_HepG2TRB_CHX trb INNER JOIN Ensembl_ilmn_hWGv3 ilmn
ON trb.Probe_ID = ilmn.Probe_ID AND (ROUND(FC_T3TRB_vs_CtrlTRB, 1) >= 2 OR ROUND(FC_T3TRB_vs_CtrlTRB, 2) <= 0.5) AND
(ROUND(FC_DiffCHXT3, 1) >= 2 OR ROUND(FC_DiffCHXT3, 2) <= 0.5) AND ROUND(BHAdj_Pvalue_T3TRB_vs_CtrlTRB, 2) <= 0.05 AND
ROUND(BHAdj_Pvalue_DiffCHXT3, 2) <= 0.05
INNER JOIN Ensembl_GeneName gene
ON ilmn.EnsemblGeneID = gene.EnsemblGeneID
INNER JOIN Ensembl_GO ego
ON ilmn.EnsemblGeneID = ego.EnsemblGeneID AND ego.GO_Termname LIKE '%transcription%' AND ego.GO_EvidenceCode = "IDA";

I received this result (click on image to enlarge)::

In this example I am linking 4 separate tables within the database:

1. "LIMMA_HepG2TRB_CHX"
2. "Ensembl_ilmn_hWGv3"
3. "Ensembl_GeneName"
4. "Ensembl_GO"

based upon the shared "keys" as follows:

A. "LIMMA_HepG2TRB_CHX" and "Ensembl_ilmn_hWGv3" ON 'Probe_ID'
B. "Ensembl_ilmn_hWGv3" and "Ensembl_GeneName" ON 'EnsemblGeneID'
C. "Ensembl_ilmn_hWGv3" and "Ensembl_GeneName" ON 'EnsemblGeneID'
D. "Ensembl_GeneName" and "Ensembl_GO" ON 'EnsemblGeneID'

Further, the data was filtered from the first linked table ("LIMMA_HepG2TRB_CHX") for (1) only significant "direct gene targets" (i.e., a retained 2-fold (or greater) Thyroid hormone (T3) response in the presence of cycloheximide (CHX, a protein translation inhibitor) and a Benjamini-Hochberg 5% FDR (or less) AND from the fourth linked table ("Ensembl_GO") for (2) Gene Ontology with 'transcription' within it's term AND (3) 'Inferred from Direct Assay (IDA)' (see GO evidence codes).

You'll note that genes (probes) are repeated in the result (e.g., highlighted ELF3 rows), which is caused by the presence of 'transcription' within numerous Gene Ontologies. Note, the hierarchal nature of Gene Ontologies will also produce such an effect.

The nice feature of these databases is you can add and subtract at will. You are only limited by your imagination. If someone else publishes a related study, or you happen upon a gene list (or other shared trait) of particular interest, just add it to your database and determine how these genes were effected within your experiment. It's that easy, and powerful!

Genomic analysis by a biologist

Thursday, January 21, 2016

Analyzing lots of data with Relational Databases and SQL

No comments:

Post a Comment