Genomic analysis by a biologist: Quality Control of NGS data

Starting with good data

Given that your next-generation sequencing (NGS) data will be used to address and derive hypotheses, it's best to begin with reliable quality data ("junk in, junk out", right?). Thus, one wants to assess the quality of the data, and if necessary, trim away the undesirable sequence, while still retaining that which is usable.

Quality Assessment
From my experience, the most popular and widely used program is FastQC, an easily used GUI written in JAVA. Please read through the FastQC website for details (Andrews, S. (2010)), and note that they provide examples of both "Good" and "Bad" outputs.

I first look at the "Per base sequence quality" bar-graph. Quality scores in NGS data are given in Phred values, a popular cut-off being Q30, but some believe Q20 is acceptable. Q-values are simply the 10*-log10 of the p-value (see Illumina's description), the higher the Q-value, the lower the probability of an incorrectly identified base. Convince yourself by plopping p-values and the equation into excel to see how Q-values are calculated (e.g., 10*(-log(0.01, 10)) = 20).

I have be told that sequencing probabilities (Q-values) were determined empirically. That is, the company's own experiments likely told them the "confidence" one could have in "calling a base" at a particular signal value (e.g., Illumina sequencing = fluorescence signal, ThermoFisher Scientific's Ion Torrent technology = detection of pH change). As with microarray technology, noise can certainly influence such measurements. You'll note that the Q-scores go down as the sequencing reaction progresses 5' to 3'. This reduction in quality as sequencing progresses is a common feature of Sanger sequencing.

The other feature of the FastQC output I look at is the duplication levels (see "Sequence Duplication Levels" chart in report), which can indicate PCR-mediated artifacts that occurred during library preparation. Obviously, it's preferential to have a low duplication rate, otherwise the variation one analyzes within the experiment will be less 'biological' and more 'technical'.

An additional assessment I heard that is used is to simply determine the percentage of reads that align (map) to it's respective genome. Once you have the resulting BAM alignment file (binary SAM file), you can run a stats software tool such as bamtools stats command to obtain such information.

Quality-trimming
The fastq file is a 4-line per sequence (read) format (1. header, 2. sequence, 3. optional delimiter, 4. quality scores), see below for an example, as well as it's wiki.

@SRR993713.354 FCD0LFYACXX:8:1101:18466:2233 length=90
TGGGTCTAAAGGTGGACTGTGGTACCCGGGGGCAGGAAGAGGGACAGTTAGGCCTAGATGTAAGGAGACCACTATGCTGCTTCTTCATCT
+SRR993713.354 FCD0LFYACXX:8:1101:18466:2233 length=90
CCCFFFFFHHHHAEHIIIIFHIFHHIIIIIIIGGIHFHHFFFF>BDDDDDDDCDDDDDDDDDDDDDDBDDDDDDDDCECDDDDDDDEEED

It's important to note that each base has it's own individual quality score, which is used during the quality trimming process. You'll note that the quality scores (line 4) within the fastq files are not "20", "30", "22", etc., but ASCII characters, that correspond to specific Q-scores (see Fastq wiki). The important thing to note is that trimming tools will know what the ASCII characters mean (as long as you tell them the technology and/or metric used; e.g., Illumina).

I use Sickle to trim away poor-quality sequence, but of course, there are many more open-source trimming programs available. Note that these programs may use different methodology in deciding what sequences to trim.

Reference:

Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Genomic analysis by a biologist

Quality Control of NGS data

No comments:

Post a Comment