Many of my projects have required repetitive processing of large amounts of experimental data, something well-suited for a programmatic scripting language. To me, Python is an intuitive scripting language, having a similar feel to the flow of 'natural' language. Python is "object oriented", and readily employs functions, something that makes 'coding' much easier (for me).
I am essentially self-taught, reading and working through examples provided within books (e.g., "Think Python: How to Think Like a Computer Scientist"), and what I picked-up from online Python courses (e.g., edX Learn to Program Using Python). Another great way to learn Python is to conduct a Google search for the task(s) you want to accomplish, you'll find many helpful methods within discussion boards and online tutorials.
Two instances in which my custom Python scripts were used for data analysis for co-authored scientific publications were: (1) grouping treatment effects within defined patterns (PMID: 23300972), and (2) calculating a correlation between two separate treatments (PMID: 22067320).
Instance 1, grouping treatment effects into patterns:
When biologists conduct transcriptome analysis, whether it be microarray or RNA-seq, the simple question is whether their treatment(s) caused gene transcripts to increase or decrease in abundance (aka, "up-regulation" and "down-regulation"). My task was to organize treatment effects into pre-defined patterns that include down, up or no regulation (e.g., 0.5, 2 and 1-fold change to control). These 3 transcript abundance effects could then be grouped into all possible combinations across treatments. For example, 3 separate treatments (A, B, C) with the 3 possible effects would equate to 27 possible combinations (3^3):
Pattern
|
A
|
B
|
C
|
pattern_0
|
0.5
|
0.5
|
0.5
|
pattern_1
|
0.5
|
0.5
|
1
|
pattern_2
|
0.5
|
0.5
|
2
|
pattern_3
|
0.5
|
1
|
0.5
|
pattern_4
|
0.5
|
1
|
1
|
pattern_5
|
0.5
|
1
|
2
|
pattern_6
|
0.5
|
2
|
0.5
|
pattern_7
|
0.5
|
2
|
1
|
pattern_8
|
0.5
|
2
|
2
|
pattern_9
|
1
|
0.5
|
0.5
|
pattern_10
|
1
|
0.5
|
1
|
pattern_11
|
1
|
0.5
|
2
|
pattern_12
|
1
|
1
|
0.5
|
pattern_13
|
1
|
1
|
1
|
pattern_14
|
1
|
1
|
2
|
pattern_15
|
1
|
2
|
0.5
|
pattern_16
|
1
|
2
|
1
|
pattern_17
|
1
|
2
|
2
|
pattern_18
|
2
|
0.5
|
0.5
|
pattern_19
|
2
|
0.5
|
1
|
pattern_20
|
2
|
0.5
|
2
|
pattern_21
|
2
|
1
|
0.5
|
pattern_22
|
2
|
1
|
1
|
pattern_23
|
2
|
1
|
2
|
pattern_24
|
2
|
2
|
0.5
|
pattern_25
|
2
|
2
|
1
|
pattern_26
|
2
|
2
|
2
|
A heatmap version, derived through MeV (MultiExperiment Viewer):
For example, we could determine the closest matching pattern to an observed treatment effect of 1.56, 2.7, 0.24 fold-change relative to the control. Using Euclidean distance as the metric by which to determine similarity between patterns, we calculate the distances between the treatment effect vector (1.56, 2.7, 0.24) and all possible pattern vectors (see table below), determining the minimal Euclidean distance between vectors (= best match).
Table. Euclidean distances for all comparisons (sorted min to max Euclidean distance).
We can see that the best matching pattern is "pattern_24" (2, 2, 0.5), which makes sense since the treatment effect vector is ("up", "up", "down").
It would be preferential not to calculate these distances one by one, but to automate the task, which is what Python (or other scripting language) can easily accomplish.
The steps below depict the methodical building of code that will result in the final script that will take two input files ("gene_fold_changes.txt" and "patterns.txt"), and determine the 'matching' pattern (minimal Euclidean distance) for each gene transcript fold change vector. Following the "building blocks" concept, each step is a repeat of the previous step with additional code appended.
Step 1: build a function that calculates Euclidean distance when provided 2 vectors of the same length:
Result:
0.8667179472008182
If that above looks foreign to you, there's a fantastic site, Visualize Python, that will walk you through the program step-by-step. Simply copy-and-paste that above within the website, to observe the calculation of the Euclidean distance between the treatment effect (result) and pattern_24 (pattern) vectors. You will need to specify the above code is written in Python 3.
Please note the obvious indented nature of the Python script, which defines the order and procedural relations (steps) for the program. The indentation reminds me of hierarchal outlines, another feature of Python that I find appealing (intuitive), which reads much like pseudocode, making the mental walk through of the program much easier.
Step 2: Add all possible patterns to the patterns list (now, a multidimensional list), and loop through the Euclidean distance function to calculate the distances between the single result vector and all patterns:
Result:
results pattern Euclidean Distance
[1.56, 2.7, 0.24] [0.5, 0.5, 0.5] 2.456
[1.56, 2.7, 0.24] [0.5, 0.5, 1] 2.558
[1.56, 2.7, 0.24] [0.5, 0.5, 2] 3.01
.
.
.
[1.56, 2.7, 0.24] [2, 2, 0.5] 0.867
[1.56, 2.7, 0.24] [2, 2, 1] 1.123
[1.56, 2.7, 0.24] [2, 2, 2] 1.945
Here, I introduce a loop (lines 18-19), that plugs each individual pattern 'p' into the euclidean_distance function, calculating the distance between each 'p' and the single result vector. This script was used to derive Table P1 above.
Step 3: Add additional result vectors to the results list (also a multidimensional list), looping through both 2D lists, calculating Euclidean distance for each comparison:
Result:
results pattern Euclidean Distance
[1.56, 2.7, 0.24] [0.5, 0.5, 0.5] 2.4559
[1.56, 2.7, 0.24] [0.5, 0.5, 1] 2.5576
[1.56, 2.7, 0.24] [0.5, 0.5, 2] 3.0102
.
.
.
[0.88, 0.67, 0.5] [2, 2, 0.5] 1.7388
[0.88, 0.67, 0.5] [2, 2, 1] 1.8092
[0.88, 0.67, 0.5] [2, 2, 2] 2.2964
As you can see in lines 18-20, we now have a nested loop, which passes each individual result vector 'r' and each pattern vector 'p' into the euclidean_distance function to calculate the distance. Here's a good youtube video displaying the basics of a nested loop. The above script would produce 7 x 27 = 189 results. Now you can see how valuable programmatic scripting can be. Just image if you wanted to analyze the results of 500 gene transcripts against these 27 patterns (= 13,500 Euclidean distances!).
Step 4: Identifying the minimal Euclidean distance from the comparisons between all patterns and a single result vector.
Result:
[2, 2, 0.5] 0.8667
The above script is an extension of the script provided in Step 2, adding a step below that will identify and pull-out the minimal Euclidean distance. To accomplish this I first initiate a new empty list ("euc = []", line 17). As the loop proceeds, the euclidean_distance function produces the distances which are added (appended) to the "euc" list. Once all distances are added, the minimal (min) Euclidean distance is identified within the "euc" list, and its 'index' (location) noted, by which I grab the 'minimal' pattern from the original "patterns" list (here, [2, 2, 0.5] = index 24). Important note!, Python indexes start at 0 not 1, from left to right, which is why the tables above specify "pattern_0" through "pattern_26".
Step 5: Obtaining the minimal Euclidean distance for multiple results:
Because one wants to associate a specific results vector (i.e., gene transcript), to its 'matching' pattern, I need to introduce the dictionary python data type. The two features of a dictionary are the "key" and "value". Each key has it's own value(s), and note, keys need to be unique. Here, the key is the gene transcript, and the value, the results vector.
Result:
The above script simply loops through each key and value (gene and results vector, respectively) against each pattern, deriving the 'matching' pattern (minimal Euclidean distance), and prints out a tab-delimited result for each key, value pair.
Step 6: Reading in the results and patterns from input files, and loop through all patterns and results to calculate the Euclidean distances:
This step simply reads the results and patterns from 2 separate files, populating the "results_dictionary" and "patterns" list with that contained. The files are tab-delimited text files which I am calling '2-fold_3-by-3_patterns.txt' and 'HepG2_significant-T3_6h.txt', culled from PMID: 23300972.
Here's what the file look like (Note, no headers on first line of either file):
2-fold_3-by-3_patterns.txt
0.5 0.5 0.5
0.5 0.5 1
0.5 0.5 2
0.5 1 0.5
.
.
.
2 1 2
2 2 0.5
2 2 1
2 2 2
HepG2_significant-T3_6h.txt
IL4R 1.024474886 2.219374904 1.49
PFKFB4 1.104901159 3.226603156 2.15
UBASH3B 0.792117728 0.307732139 0.528
EGFLAM 0.959569797 1.520448169 2.48
ACSL3 0.95956302 1.75749976 2.2
.
.
.
CLIP1 1.229766122 1.791323367 2.24
DNMT3L 2.353985909 2.163907179 3.01
ABCC6 1.244322488 1.759081551 2.04
PLAUR 0.934067161 2.489250673 3.2
TOX3 1.066779442 0.466628681 0.505
Instance 2, correlating treatment effects:
I was tasked to determine the similarity (or difference) between two Thyroid Hormone Receptor ligands (one natural = T3, and the other synthetic = GC-1). Perusing the internet and refereed articles, I found that correlation metrics such as Spearman's rank correlation coefficient and Pearson correlation have been used for such a task (e.g., determining the similarity in fold-change effect between 2 different treatments).
I used Python to create a Spearman's rank correlation coefficient function as outlined within the supplemental section of the paper (PMID: 22067320).
Pattern
|
Vector
|
Euclidean_distance
|
pattern_24
|
2, 2, 0.5
|
0.866717947
|
pattern_15
|
1, 2, 0.5
|
0.933380951
|
pattern_25
|
2, 2, 1
|
1.123031611
|
pattern_16
|
1, 2, 1
|
1.175244655
|
pattern_6
|
0.5, 2, 0.5
|
1.296610967
|
pattern_7
|
0.5, 2, 1
|
1.480270246
|
pattern_21
|
2, 1, 0.5
|
1.775161964
|
pattern_12
|
1, 1, 0.5
|
1.808645902
|
pattern_22
|
2, 1, 1
|
1.913426246
|
pattern_13
|
1, 1, 1
|
1.944530792
|
pattern_26
|
2, 2, 2
|
1.944530792
|
pattern_17
|
1, 2, 2
|
1.975145564
|
pattern_3
|
0.5, 1, 0.5
|
2.02019801
|
pattern_4
|
0.5, 1, 1
|
2.142708566
|
pattern_8
|
0.5, 2, 2
|
2.170529889
|
pattern_18
|
2, 0.5, 0.5
|
2.258583627
|
pattern_9
|
1, 0.5, 0.5
|
2.28499453
|
pattern_19
|
2, 0.5, 1
|
2.368797163
|
pattern_10
|
1, 0.5, 1
|
2.393992481
|
pattern_0
|
0.5, 0.5, 0.5
|
2.455850158
|
pattern_23
|
2, 1, 2
|
2.486201923
|
pattern_14
|
1, 1, 2
|
2.510219114
|
pattern_1
|
0.5, 0.5, 1
|
2.557576978
|
pattern_5
|
0.5, 1, 2
|
2.666683333
|
pattern_20
|
2, 0.5, 2
|
2.851525907
|
pattern_11
|
1, 0.5, 2
|
2.872490209
|
pattern_2
|
0.5, 0.5, 2
|
3.010182719
|
We can see that the best matching pattern is "pattern_24" (2, 2, 0.5), which makes sense since the treatment effect vector is ("up", "up", "down").
It would be preferential not to calculate these distances one by one, but to automate the task, which is what Python (or other scripting language) can easily accomplish.
The steps below depict the methodical building of code that will result in the final script that will take two input files ("gene_fold_changes.txt" and "patterns.txt"), and determine the 'matching' pattern (minimal Euclidean distance) for each gene transcript fold change vector. Following the "building blocks" concept, each step is a repeat of the previous step with additional code appended.
Step 1: build a function that calculates Euclidean distance when provided 2 vectors of the same length:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | from math import sqrt pattern = [2, 2, 0.5] result = [1.56, 2.7, 0.24] def euclidean_distance(a, b): distance_sqrd = [] for i in range(len(a)): x, y = a[i], b[i] # extract the two ordinates diff_sqrd = (x - y) * (x - y) distance_sqrd.append(diff_sqrd) sum_distance_sqrd = sqrt(sum(distance_sqrd)) return sum_distance_sqrd print(euclidean_distance(result, pattern)) |
Result:
0.8667179472008182
If that above looks foreign to you, there's a fantastic site, Visualize Python, that will walk you through the program step-by-step. Simply copy-and-paste that above within the website, to observe the calculation of the Euclidean distance between the treatment effect (result) and pattern_24 (pattern) vectors. You will need to specify the above code is written in Python 3.
Please note the obvious indented nature of the Python script, which defines the order and procedural relations (steps) for the program. The indentation reminds me of hierarchal outlines, another feature of Python that I find appealing (intuitive), which reads much like pseudocode, making the mental walk through of the program much easier.
Step 2: Add all possible patterns to the patterns list (now, a multidimensional list), and loop through the Euclidean distance function to calculate the distances between the single result vector and all patterns:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | from math import sqrt results = [1.56, 2.7, 0.24] patterns = [[0.5, 0.5, 0.5], [0.5, 0.5, 1], [0.5, 0.5, 2], [0.5, 1, 0.5], [0.5, 1, 1], [0.5, 1, 2], [0.5, 2, 0.5], [0.5, 2, 1], [0.5, 2, 2], [1, 0.5, 0.5], [1, 0.5, 1], [1, 0.5, 2], [1, 1, 0.5], [1, 1, 1], [1, 1, 2], [1, 2, 0.5], [1, 2, 1], [1, 2, 2], [2, 0.5, 0.5], [2, 0.5, 1], [2, 0.5, 2], [2, 1, 0.5], [2, 1, 1], [2, 1, 2], [2, 2, 0.5], [2, 2, 1], [2, 2, 2]] def euclidean_distance(a, b): distance_sqrd = [] for i in range(len(a)): x, y = a[i], b[i] # extract the two ordinates diff_sqrd = (x - y) * (x - y) distance_sqrd.append(diff_sqrd) sum_distance_sqrd = sqrt(sum(distance_sqrd)) return sum_distance_sqrd print("results", "pattern", "Euclidean Distance", sep="\t") for p in patterns: print(results, p, round(euclidean_distance(results, p), 4), sep="\t") |
Result:
results pattern Euclidean Distance
[1.56, 2.7, 0.24] [0.5, 0.5, 0.5] 2.456
[1.56, 2.7, 0.24] [0.5, 0.5, 1] 2.558
[1.56, 2.7, 0.24] [0.5, 0.5, 2] 3.01
.
.
.
[1.56, 2.7, 0.24] [2, 2, 0.5] 0.867
[1.56, 2.7, 0.24] [2, 2, 1] 1.123
[1.56, 2.7, 0.24] [2, 2, 2] 1.945
Here, I introduce a loop (lines 18-19), that plugs each individual pattern 'p' into the euclidean_distance function, calculating the distance between each 'p' and the single result vector. This script was used to derive Table P1 above.
Step 3: Add additional result vectors to the results list (also a multidimensional list), looping through both 2D lists, calculating Euclidean distance for each comparison:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | from math import sqrt results = [[1.56, 2.7, 0.24], [2, 4, 0.6], [1.2, 0.8, 4], [1.3, 6, 2.5], [6.5, 1, 0.9], [0.43, 2.1, 2], [0.88, 0.67, 0.5]] patterns = [[0.5, 0.5, 0.5], [0.5, 0.5, 1], [0.5, 0.5, 2], [0.5, 1, 0.5], [0.5, 1, 1], [0.5, 1, 2], [0.5, 2, 0.5], [0.5, 2, 1], [0.5, 2, 2], [1, 0.5, 0.5], [1, 0.5, 1], [1, 0.5, 2], [1, 1, 0.5], [1, 1, 1], [1, 1, 2], [1, 2, 0.5], [1, 2, 1], [1, 2, 2], [2, 0.5, 0.5], [2, 0.5, 1], [2, 0.5, 2], [2, 1, 0.5], [2, 1, 1], [2, 1, 2], [2, 2, 0.5], [2, 2, 1], [2, 2, 2]] def euclidean_distance(a, b): distance_sqrd = [] for i in range(len(a)): x, y = a[i], b[i] # extract the two ordinates diff_sqrd = (x - y) * (x - y) distance_sqrd.append(diff_sqrd) sum_distance_sqrd = sqrt(sum(distance_sqrd)) return sum_distance_sqrd print("results", "pattern", "Euclidean Distance", sep="\t") for r in results: for p in patterns: print(r, p, round(euclidean_distance(r, p), 4), sep="\t") |
Result:
results pattern Euclidean Distance
[1.56, 2.7, 0.24] [0.5, 0.5, 0.5] 2.4559
[1.56, 2.7, 0.24] [0.5, 0.5, 1] 2.5576
[1.56, 2.7, 0.24] [0.5, 0.5, 2] 3.0102
.
.
.
[0.88, 0.67, 0.5] [2, 2, 0.5] 1.7388
[0.88, 0.67, 0.5] [2, 2, 1] 1.8092
[0.88, 0.67, 0.5] [2, 2, 2] 2.2964
As you can see in lines 18-20, we now have a nested loop, which passes each individual result vector 'r' and each pattern vector 'p' into the euclidean_distance function to calculate the distance. Here's a good youtube video displaying the basics of a nested loop. The above script would produce 7 x 27 = 189 results. Now you can see how valuable programmatic scripting can be. Just image if you wanted to analyze the results of 500 gene transcripts against these 27 patterns (= 13,500 Euclidean distances!).
Step 4: Identifying the minimal Euclidean distance from the comparisons between all patterns and a single result vector.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | from math import sqrt results = [[1.56, 2.7, 0.24]] patterns = [[0.5, 0.5, 0.5], [0.5, 0.5, 1], [0.5, 0.5, 2], [0.5, 1, 0.5], [0.5, 1, 1], [0.5, 1, 2], [0.5, 2, 0.5], [0.5, 2, 1], [0.5, 2, 2], [1, 0.5, 0.5], [1, 0.5, 1], [1, 0.5, 2], [1, 1, 0.5], [1, 1, 1], [1, 1, 2], [1, 2, 0.5], [1, 2, 1], [1, 2, 2], [2, 0.5, 0.5], [2, 0.5, 1], [2, 0.5, 2], [2, 1, 0.5], [2, 1, 1], [2, 1, 2], [2, 2, 0.5], [2, 2, 1], [2, 2, 2]] def euclidean_distance(a, b): distance_sqrd = [] for i in range(len(a)): x, y = a[i], b[i] # extract the two ordinates diff_sqrd = (x - y) * (x - y) distance_sqrd.append(diff_sqrd) sum_distance_sqrd = sqrt(sum(distance_sqrd)) return sum_distance_sqrd for r in results: euc = [] for p in patterns: euc.append(euclidean_distance(r, p)) print(patterns[euc.index(min(euc))], round(min(euc), 4), sep='\t') |
Result:
[2, 2, 0.5] 0.8667
The above script is an extension of the script provided in Step 2, adding a step below that will identify and pull-out the minimal Euclidean distance. To accomplish this I first initiate a new empty list ("euc = []", line 17). As the loop proceeds, the euclidean_distance function produces the distances which are added (appended) to the "euc" list. Once all distances are added, the minimal (min) Euclidean distance is identified within the "euc" list, and its 'index' (location) noted, by which I grab the 'minimal' pattern from the original "patterns" list (here, [2, 2, 0.5] = index 24). Important note!, Python indexes start at 0 not 1, from left to right, which is why the tables above specify "pattern_0" through "pattern_26".
Step 5: Obtaining the minimal Euclidean distance for multiple results:
Because one wants to associate a specific results vector (i.e., gene transcript), to its 'matching' pattern, I need to introduce the dictionary python data type. The two features of a dictionary are the "key" and "value". Each key has it's own value(s), and note, keys need to be unique. Here, the key is the gene transcript, and the value, the results vector.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | from math import sqrt results_dictionary = {"SPOCK": [1.56, 2.7, 0.24], "LCTL": [2, 4, 0.6], "PYCARD": [1.2, 0.8, 4], "DLGAP1": [1.3, 6, 2.5], "TRIB1":[6.5, 1, 0.9], "VANGL1": [0.43, 2.1, 2], "SMURF1": [0.88, 0.67, 0.5]} patterns = [[0.5, 0.5, 0.5], [0.5, 0.5, 1], [0.5, 0.5, 2], [0.5, 1, 0.5], [0.5, 1, 1], [0.5, 1, 2], [0.5, 2, 0.5], [0.5, 2, 1], [0.5, 2, 2], [1, 0.5, 0.5], [1, 0.5, 1], [1, 0.5, 2], [1, 1, 0.5], [1, 1, 1], [1, 1, 2], [1, 2, 0.5], [1, 2, 1], [1, 2, 2], [2, 0.5, 0.5], [2, 0.5, 1], [2, 0.5, 2], [2, 1, 0.5], [2, 1, 1], [2, 1, 2], [2, 2, 0.5], [2, 2, 1], [2, 2, 2]] def euclidean_distance_part(a, b): distance_sqrd = [] for i in range(len(a)): x, y = a[i], b[i] # extract the two ordinates diff_sqrd = (x - y) * (x - y) distance_sqrd.append(diff_sqrd) sum_distance_sqrd = sqrt(sum(distance_sqrd)) return sum_distance_sqrd print("gene", "results", "pattern_number", "pattern", "euclidean_distance", sep='\t') for gene, results in results_dictionary.items(): euc = [] for p in patterns: euc.append(euclidean_distance_part(results, p)) print(gene, results, euc.index(min(euc)), patterns[euc.index(min(euc))], round(min(euc), 4), sep ='\t') |
Result:
gene results pattern_number pattern euclidean_distance
SPOCK [1.56, 2.7, 0.24] 24 [2, 2, 0.5] 0.8667
DLGAP1 [1.3, 6, 2.5] 17 [1, 2, 2] 4.0423
SMURF1 [0.88, 0.67, 0.5] 9 [1, 0.5, 0.5] 0.2081
PYCARD [1.2, 0.8, 4] 14 [1, 1, 2] 2.0199
VANGL1 [0.43, 2.1, 2] 8 [0.5, 2, 2] 0.1221
TRIB1 [6.5, 1, 0.9] 22 [2, 1, 1] 4.5011
LCTL [2, 4, 0.6] 24 [2, 2, 0.5] 2.0025
The above script simply loops through each key and value (gene and results vector, respectively) against each pattern, deriving the 'matching' pattern (minimal Euclidean distance), and prints out a tab-delimited result for each key, value pair.
Step 6: Reading in the results and patterns from input files, and loop through all patterns and results to calculate the Euclidean distances:
This step simply reads the results and patterns from 2 separate files, populating the "results_dictionary" and "patterns" list with that contained. The files are tab-delimited text files which I am calling '2-fold_3-by-3_patterns.txt' and 'HepG2_significant-T3_6h.txt', culled from PMID: 23300972.
Here's what the file look like (Note, no headers on first line of either file):
2-fold_3-by-3_patterns.txt
0.5 0.5 0.5
0.5 0.5 1
0.5 0.5 2
0.5 1 0.5
.
.
.
2 1 2
2 2 0.5
2 2 1
2 2 2
HepG2_significant-T3_6h.txt
IL4R 1.024474886 2.219374904 1.49
PFKFB4 1.104901159 3.226603156 2.15
UBASH3B 0.792117728 0.307732139 0.528
EGFLAM 0.959569797 1.520448169 2.48
ACSL3 0.95956302 1.75749976 2.2
.
.
.
CLIP1 1.229766122 1.791323367 2.24
DNMT3L 2.353985909 2.163907179 3.01
ABCC6 1.244322488 1.759081551 2.04
PLAUR 0.934067161 2.489250673 3.2
TOX3 1.066779442 0.466628681 0.505
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | from math import * pattern_file = open("2-fold_3-by-3_patterns.txt", 'r').readlines() results_file = open("HepG2_significant-T3_6h.txt", 'r').readlines() patterns = [] for line in pattern_file: temp = line.split() patterns.append(list(map(float, temp))) results_dict = {} for line in results_file: temp = line.split() results_dict[temp[0]] = list(map(float,temp[1:])) def euclidean_distance_part(a, b): distance_sqrd = [] for i in range(len(a)): x, y = a[i], b[i] # extract the two ordinates diff_sqrd = (x - y) * (x - y) distance_sqrd.append(diff_sqrd) sum_distance_sqrd = sqrt(sum(distance_sqrd)) return sum_distance_sqrd print("gene", "pattern_number", "pattern", "euclidean_distance", sep='\t') for gene, results in results_dict.items(): euc = [] for p in patterns: euc.append(euclidean_distance_part(results, p)) print(gene, results, euc.index(min(euc)), patterns[euc.index(min(euc))], (min(euc)), sep='\t') |
Instance 2, correlating treatment effects:
I was tasked to determine the similarity (or difference) between two Thyroid Hormone Receptor ligands (one natural = T3, and the other synthetic = GC-1). Perusing the internet and refereed articles, I found that correlation metrics such as Spearman's rank correlation coefficient and Pearson correlation have been used for such a task (e.g., determining the similarity in fold-change effect between 2 different treatments).
I used Python to create a Spearman's rank correlation coefficient function as outlined within the supplemental section of the paper (PMID: 22067320).
Spearman's rank correlation coefficient analysis
To determine whether there are differences between the microarray data received from T3 and GC-1 treatments, we used the Spearman's rank order correlation coefficient, rs.
where d is the difference between the probe-set rank in one dataset (T3) and that of the other (GC-1), and n is the number of probe-sets being compared.
We decided to conduct the analysis with Affymetrix probe-set identifications instead of aggregating probe-sets into unified identifiers (e.g., Unigene) to eliminate the impact of splice variants.
Affymetrix probe-sets for both datasets were ordered based upon fold induction of ligand-treated verses vehicle. We analyzed both the entire Affymetrix probe-set list ("complete probe-set list"), and probe-sets receiving an adjusted p-value <= 0.05 in both datasets ("significant probe-sets").
To determine if the calculated correlation coefficient, rs, is greater than would be expected by chance, we performed empirical permutation analysis, where the rank ordering of the genes in one list is randomly shuffled, followed by rs calculation. The permutation analysis was done 10,000 times using a custom Python script. The p-value is simply the proportion of times that the rs value in the shuffled list is greater than or equal to the correlation coefficient calculated for the experimental dataset.
Using this procedure within the aforementioned paper, I received a highly significant p-value (< 0.0001, = 1/10,000), because there was no instance that the permuted rankings produced a greater correlation than that calculated for the experimental set.
For the current example, I will use another study that we published that compared the anti-androgen effects of a clinically-utilized drug, flutamide, to a novel chemistry, MJC13 (Suh et al. 2015. Similarities and Distinctions in Actions of Surface-Directed and Classic Androgen Receptor Antagonists.PLoS One 2015, 10(9):e0137103, PMID: 26332122). The commonly used prostate cancer cell line, LNCaP, was treated with DHT (testosterone) alone or in combination with one of the 2 anti-androgen chemistries. As can be seen in the line plot below, the anti-androgens behaved very similarly.
Input is a 3-column, tab-delimited text file (Flutamide_MJC13_Effects.txt):
ILMN_1671486 0.619053839 0.655653049
ILMN_1661637 0.546773417 0.593758193
ILMN_2211739 0.628942245 0.614651425
ILMN_1687768 1.232997126 1.121925249
ILMN_1713952 0.639418101 0.567393595
ILMN_1731785 0.296007271 0.320239937
ILMN_2270015 0.740351356 0.732066055
.
.
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | import random from decimal import Decimal file = open('Flutamide_MJC13_Effects.txt', 'r').readlines() data_1 = [] # a nested 2-dimensional list that will be appended with column1 (probe_ID) and column2 (foldChange_1) data_2 = [] # a nested 2-dimensional list that will be appended with column1 (probe_ID) and column3 (foldChange_2) for line in file: temp = line.split() data_1.append([temp[0], float(temp[1])]) data_2.append([temp[0], float(temp[2])]) foldChange_1 = [] # a list that will be appended with each foldChange_1 for d_1 in data_1: foldChange_1.append(d_1[1]) sorted_foldChange_1 = sorted(foldChange_1) # sort foldChange_1 list from lowest to highest foldChange_2 = [] # a list that will be appended with each foldChange_2 for d_2 in data_2: foldChange_2.append(d_2[1]) sorted_foldChange_2 = sorted(foldChange_2) # sort foldChange_2 list from lowest to highest probes_ranked_1 = [] # a list that will be appended with data_1 probe_ID ranked in order according to sorted_foldChange_1 for s_fc_1 in sorted_foldChange_1: for d_1 in data_1: if s_fc_1 == d_1[1]: probes_ranked_1.append(d_1[0]) rankings_2 = [] # a nested 2-dimensional list that will be appended with Probe_ID and rank according to sorted_foldChange_2 count = 0 for s_fc_2 in sorted_foldChange_2: for d_2 in data_2: if s_fc_2 == d_2[1]: count = count + 1 rankings_2.append([d_2[0], count]) relative_rankings_2 = [] # a list that will derive the ranking (index location) of data_2 (rankings_2) relative to data_1 (probes_ranked_1), which will be used for subsequent correlation analysis count = 0 for probes_1 in probes_ranked_1: for r_2 in rankings_2: if probes_1 == r_2[0]: count = count + 1 relative_rankings_2.append(rankings_2.index(r_2)+1) straight_rank = [] # a list of numbers 1-81 (length of current input); will be used for subsequent experimental (actual) and permuted correlation analysis count = 0 for p in range(len(probes_ranked_1)): count = count + 1 straight_rank.append(count) def subtracted(a, b): # a function that takes 2 input lists calculating the differences for each index (i.e., list_1[0] - list_2[0], list_1[1] - list_2[1], etc.), returns differences as a list subtracted = [] for i in range(len(a)): x, y = a[i], b[i] # extract the two ordinates difference = x - y subtracted.append(difference) return subtracted def spearmans (a, b): # a function takes the differences list and calculates and returns the Spearman's rank correlation coefficient differences = subtracted(a, b) # using the above 'subtracted' function to derive differences sqr_diffs = [] for d in differences: dsq = d**2 sqr_diffs.append(dsq) sum_sqr_diffs = sum(sqr_diffs) numerator = 6*sum_sqr_diffs list_length = len(a) denominator = (list_length*(list_length**2-1)) spears = Decimal(1- (Decimal(numerator)/Decimal(denominator))) return spears SRCC = spearmans(straight_rank, relative_rankings_2) # calling the spearmans function print("Spearman's rank correlation coefficient = ", float(SRCC)) copy = straight_rank[:] # creating a copy of the original 'straight_rank' list (1-81), so the copy can be permuted, leaving thee 'straight_rank' alone all = [] # a nested 2-dimensional list that will harbor all the permuted rank lists (here, 1000 total) count = 0 permutations = 1000 while count < permutations: shuffled = [] count += 1 random.shuffle(copy) shuffled_copy= copy shuffled.extend(shuffled_copy) all.append(shuffled) results = [] # a list harboring all calculated Spearman's rank correlation coefficients for the straight_rank against each permuted rank for i in all: test_shuffled = spearmans(straight_rank, i) results.append(test_shuffled) greater = 0 # count the instances that the Spearman's rank correlation coefficient of the permuted data is greater than the actual ("experimental") SRCC for r in results: if abs(r) >= SRCC: greater = greater + 1 if greater == 0: print("p-value < ", 1/permutations) # will print if no SRCC from permuted ranks were greater than the actual SRCC, the p-value simply being 1/ the number of permutations else: print("p-value = ", Decimal(greater/permutations)) # will print if there was at least one permuted rank SRCC that was greater than actual ("experimental"), the p-value = "number greater than actual/the number of permutations" |
Result:
Spearman's rank correlation coefficient = 0.9459801264679314
p-value < 0.001
No comments:
Post a Comment