Genomic analysis by a biologist: Relational Databases and SQL

The first time I was asked to analyze large amounts of genomic data, I thought there had to be a better way than stumbling through numerous spreadsheets to coalesce significant findings. The answer was to find a way to "relate" numerous datasets (tables, spreadsheets), in addition to filtering for specific criteria (e.g., significantly affected gene transcripts, genes with shared ontologies, etc.). I learned that a Relational Database would readily serve this function, and along with reading some books on how to 'join' and filter such tabular data across the relational database, Structured Query Language (SQL), I quickly realized how powerful a relational databases could be. The book I especially like is "Learning SQL" by Alan Beaulieu.

My opportunity to use this powerful tool came when I was asked to analyze a dataset with 24 different conditions (6 cell lines x 4 treatments) for changes in gene transcript abundance as interpreted from Illumina (ILMN) HT12 BeadArrays (>45,000 probes/ array)! I first needed to limit the results to gene transcripts that displayed significant changes following treatment, in addition to deriving some biological relationships (e.g., shared gene ontologies/ functions).

Using the relational database approach I was able process, analyze, hypothesize, and reach conclusions with my co-authors as outlined in this article:

Lin, J. Z., Sieglaff, D. H., Yuan, C., Su, J., Arumanayagam, A.S., Firouzbakht, S., Cantu Pompa, J.J., Reynolds, F.D., Zhou, X., Cvoro, A., and P. Webb. 2013. Gene specific actions of thyroid hormone receptor subtypes. PLoS One 8(1):e52407

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0052407

The steps for such analysis"

(1) Raw array data from GenomeStudio (Illumina's array software) was background-subtracted, normalized, and analyzed using lumi and limma within R. A subset of relevant output that was uploaded to the relational database includes:

• Array probe-ids

• Change in transcript abundance relative to a control (e.g., non-treated cell line)

• Statistical assessment (e.g., p-value, FDR)

(2) The limma results were uploaded into a SQLite* database using the Firefox GUI add-on SQLite Manager. Additional tables with the database allowed the 'relation' between limma data and meaningful biological information such as:

• Identifying genes covered by the ILMN Probe_IDs:

--> LIMMA_ table x Ensembl_GeneName table

• Determine ontologies (functions) of the genes:

--> Ensembl_GeneName table x Ensembl_GO table

Below is a SQL where I relate (join) 4 separate tables in the database based upon shared "keys":

LIMMA_HepG2TRB_CHX (contains ILMN Probe_IDs and their treatment effect and significance).
Ensembl_ilmn_hWGv3 (contains ILMN Probe_IDs and associated EnsemblGeneID)
Ensembl_GeneName (contains EnsemblGeneID and associated Official GeneName)
Ensembl_GO (contains EnsemblGeneID).

Filtering on:

Being a "direct target" as defined by retaining a significant ≥2-fold Thyroid hormone (T3) response (≤5% FDR) in the presence of cycloheximide (CHX, a protein translation inhibitor)
Having 'transcription' within it's Gene Ontology term and 'Inferred from Direct Assay (IDA)' evidence (see GO evidence codes).

SELECT trb.Probe_ID "ILMN Probe_ID", gene.GeneName, trb.FC_T3TRB_vs_CtrlTRB "FC_TRB+T3", trb.FC_CHXTRB_vs_CtrlTRB "FC_TRB+CHX", trb.FC_CHXT3TRB_vs_CtrlTRB "FC_TRB+CHX+T3", ego.GO_Term, ego.GO_Termname, ego.GO_EvidenceCode 
FROM LIMMA_HepG2TRB_CHX trb INNER JOIN Ensembl_ilmn_hWGv3 ilmn
ON trb.Probe_ID = ilmn.Probe_ID AND (ROUND(FC_T3TRB_vs_CtrlTRB, 1) >= 2 OR ROUND(FC_T3TRB_vs_CtrlTRB, 2) <= 0.5) AND
(ROUND(FC_DiffCHXT3, 1) >= 2 OR ROUND(FC_DiffCHXT3, 2) <= 0.5) AND ROUND(BHAdj_Pvalue_T3TRB_vs_CtrlTRB, 2) <= 0.05 AND
ROUND(BHAdj_Pvalue_DiffCHXT3, 2) <= 0.05
INNER JOIN Ensembl_GeneName gene
ON ilmn.EnsemblGeneID = gene.EnsemblGeneID
INNER JOIN Ensembl_GO ego
ON ilmn.EnsemblGeneID = ego.EnsemblGeneID AND ego.GO_Termname LIKE '%transcription%' AND ego.GO_EvidenceCode = "IDA";

Formatted using http://hilite.me/

With this SQL I receive the results displayed below (from SQLite within SQLite Manager).

click on image to enlarge

In the provided screen-grab, you'll note that genes (probes) are repeated (see highlighted ELF3 rows), as 'transcription' is associated within numerous Gene Ontologies, simply the result of the hierarchal nature of Gene Ontologies.

The nice feature of these databases is you can add, subtract, or correct data, and repeat your queries to derive new or corrected results. For example, if someone else publishes a related study, or you happen upon a gene list (or other shared trait) of particular interest, just add this information to your pre-existing database and query! It's that easy, and powerful!

* Note, SQLite is an awesome open-source "standalone" relational database engine that is platform independent, and easily transferable as a single file (e.g., transfer it via email, dropbox, thumb-drive, etc.). SQLite databases are also readily manipulated at the command prompt, and in some instances (e.g., "heavy" queries), the command line option will work whereas manipulation within the SQLite Manager GUI will not (I have had SQLite Manager freeze in some instances).

Genomic analysis by a biologist

Relational Databases and SQL

No comments:

Post a Comment