My opportunity to use this powerful tool came when I was asked to analyze a dataset with 24 different conditions (6 cell lines x 4 treatments) for changes in gene transcript abundance as interpreted from Illumina (ILMN) HT12 BeadArrays (>45,000 probes/ array)! I first needed to limit the results to gene transcripts that displayed significant changes following treatment, in addition to deriving some biological relationships (e.g., shared gene ontologies/ functions).
Using the relational database approach I was able process, analyze, hypothesize, and reach conclusions with my co-authors as outlined in this article:
Lin, J. Z., Sieglaff, D. H., Yuan, C., Su, J., Arumanayagam, A.S., Firouzbakht, S., Cantu Pompa, J.J., Reynolds, F.D., Zhou, X., Cvoro, A., and P. Webb. 2013. Gene specific actions of thyroid hormone receptor subtypes. PLoS One 8(1):e52407
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0052407The steps for such analysis"
(1) Raw array data from GenomeStudio (Illumina's array software) was background-subtracted, normalized, and analyzed using lumi and limma within R. A subset of relevant output that was uploaded to the relational database includes:
• Array probe-ids
• Change in transcript abundance relative to a control (e.g., non-treated cell line)
• Statistical assessment (e.g., p-value, FDR)
(2) The limma results were uploaded into a SQLite* database using the Firefox GUI add-on SQLite Manager. Additional tables with the database allowed the 'relation' between limma data and meaningful biological information such as:
(2) The limma results were uploaded into a SQLite* database using the Firefox GUI add-on SQLite Manager. Additional tables with the database allowed the 'relation' between limma data and meaningful biological information such as:
• Identifying genes covered by the ILMN Probe_IDs:
--> LIMMA_ table x Ensembl_GeneName table
• Determine ontologies (functions) of the genes:
--> Ensembl_GeneName table x Ensembl_GO table
Below is a SQL where I relate (join) 4 separate tables in the database based upon shared "keys":
- LIMMA_HepG2TRB_CHX (contains ILMN Probe_IDs and their treatment effect and significance).
- Ensembl_ilmn_hWGv3 (contains ILMN Probe_IDs and associated EnsemblGeneID)
- Ensembl_GeneName (contains EnsemblGeneID and associated Official GeneName)
- Ensembl_GO (contains EnsemblGeneID).
Filtering on:
- Being a "direct target" as defined by retaining a significant ≥2-fold Thyroid hormone (T3) response (≤5% FDR) in the presence of cycloheximide (CHX, a protein translation inhibitor)
- Having 'transcription' within it's Gene Ontology term and 'Inferred from Direct Assay (IDA)' evidence (see GO evidence codes).
1 2 3 4 5 6 7 8 9 | SELECT trb.Probe_ID "ILMN Probe_ID", gene.GeneName, trb.FC_T3TRB_vs_CtrlTRB "FC_TRB+T3", trb.FC_CHXTRB_vs_CtrlTRB "FC_TRB+CHX", trb.FC_CHXT3TRB_vs_CtrlTRB "FC_TRB+CHX+T3", ego.GO_Term, ego.GO_Termname, ego.GO_EvidenceCode FROM LIMMA_HepG2TRB_CHX trb INNER JOIN Ensembl_ilmn_hWGv3 ilmn ON trb.Probe_ID = ilmn.Probe_ID AND (ROUND(FC_T3TRB_vs_CtrlTRB, 1) >= 2 OR ROUND(FC_T3TRB_vs_CtrlTRB, 2) <= 0.5) AND (ROUND(FC_DiffCHXT3, 1) >= 2 OR ROUND(FC_DiffCHXT3, 2) <= 0.5) AND ROUND(BHAdj_Pvalue_T3TRB_vs_CtrlTRB, 2) <= 0.05 AND ROUND(BHAdj_Pvalue_DiffCHXT3, 2) <= 0.05 INNER JOIN Ensembl_GeneName gene ON ilmn.EnsemblGeneID = gene.EnsemblGeneID INNER JOIN Ensembl_GO ego ON ilmn.EnsemblGeneID = ego.EnsemblGeneID AND ego.GO_Termname LIKE '%transcription%' AND ego.GO_EvidenceCode = "IDA"; |
Formatted using http://hilite.me/
With this SQL I receive the results displayed below (from SQLite within SQLite Manager).
In the provided screen-grab, you'll note that genes (probes) are repeated (see highlighted ELF3 rows), as 'transcription' is associated within numerous Gene Ontologies, simply the result of the hierarchal nature of Gene Ontologies.
click on image to enlarge
In the provided screen-grab, you'll note that genes (probes) are repeated (see highlighted ELF3 rows), as 'transcription' is associated within numerous Gene Ontologies, simply the result of the hierarchal nature of Gene Ontologies.
The nice feature of these databases is you can add, subtract, or correct data, and repeat your queries to derive new or corrected results. For example, if someone else publishes a related study, or you happen upon a gene list (or other shared trait) of particular interest, just add this information to your pre-existing database and query! It's that easy, and powerful!
* Note, SQLite is an awesome open-source "standalone" relational database engine that is platform independent, and easily transferable as a single file (e.g., transfer it via email, dropbox, thumb-drive, etc.). SQLite databases are also readily manipulated at the command prompt, and in some instances (e.g., "heavy" queries), the command line option will work whereas manipulation within the SQLite Manager GUI will not (I have had SQLite Manager freeze in some instances).
No comments:
Post a Comment