~=== Python

the intuitive scripting language

Many of my projects have required repetitive processing of large amounts of experimental data, something well-suited for a programmatic scripting language. To me, Python is an intuitive scripting language, having a similar feel to the flow of 
'natural' language. Python is "object oriented", and readily employs functions, something that makes 'coding' much easier (for me).

I am essentially self-taught, reading and working through examples provided within books (e.g., "Think Python: How to Think Like a Computer Scientist"), and what I picked-up from online Python courses (e.g., edX Learn to Program Using Python). Another great way to learn Python is to conduct a Google search for the task(s) you want to accomplish, you'll find many helpful methods within discussion boards and online tutorials.

Two instances in which my custom Python scripts were used for data analysis for co-authored scientific publications were: (1) grouping treatment effects within defined patterns (PMID: 23300972), and (2) calculating a correlation between two separate treatments (PMID: 22067320).

Instance 1, grouping treatment effects into patterns:
When biologists conduct transcriptome analysis, whether it be microarray or RNA-seq, the simple question is whether their treatment(s) caused gene transcripts to increase or decrease in abundance (aka, "up-regulation" and "down-regulation"). My task was to organize treatment effects into pre-defined patterns that include down, up or no regulation (e.g., 0.5, 2 and 1-fold change to control). These 3 transcript abundance effects could then be grouped into all possible combinations across treatments. For example, 3 separate treatments (A, B, C) with the 3 possible effects would equate to 27 possible combinations (3^3):



Pattern
A
B
C
pattern_0
0.5
0.5
0.5
pattern_1
0.5
0.5
1
pattern_2
0.5
0.5
2
pattern_3
0.5
1
0.5
pattern_4
0.5
1
1
pattern_5
0.5
1
2
pattern_6
0.5
2
0.5
pattern_7
0.5
2
1
pattern_8
0.5
2
2
pattern_9
1
0.5
0.5
pattern_10
1
0.5
1
pattern_11
1
0.5
2
pattern_12
1
1
0.5
pattern_13
1
1
1
pattern_14
1
1
2
pattern_15
1
2
0.5
pattern_16
1
2
1
pattern_17
1
2
2
pattern_18
2
0.5
0.5
pattern_19
2
0.5
1
pattern_20
2
0.5
2
pattern_21
2
1
0.5
pattern_22
2
1
1
pattern_23
2
1
2
pattern_24
2
2
0.5
pattern_25
2
2
1
pattern_26
2
2
2

A heatmap version, derived through MeV (
MultiExperiment Viewer):


For example, we could determine the closest matching pattern to an observed treatment effect of 1.56, 2.7, 0.24 fold-change relative to the control. Using Euclidean distance as the metric by which to determine similarity between patterns, we calculate the distances between the treatment effect vector (1.56, 2.7, 0.24) and all possible pattern vectors (see table below), determining the minimal Euclidean distance between vectors (= best match). 

Table. Euclidean distances for all comparisons (sorted min to max Euclidean distance).

Pattern
Vector
Euclidean_distance
pattern_24
2, 2, 0.5
0.866717947
pattern_15
1, 2, 0.5
0.933380951
pattern_25
2, 2, 1
1.123031611
pattern_16
1, 2, 1
1.175244655
pattern_6
0.5, 2, 0.5
1.296610967
pattern_7
0.5, 2, 1
1.480270246
pattern_21
2, 1, 0.5
1.775161964
pattern_12
1, 1, 0.5
1.808645902
pattern_22
2, 1, 1
1.913426246
pattern_13
1, 1, 1
1.944530792
pattern_26
2, 2, 2
1.944530792
pattern_17
1, 2, 2
1.975145564
pattern_3
0.5, 1, 0.5
2.02019801
pattern_4
0.5, 1, 1
2.142708566
pattern_8
0.5, 2, 2
2.170529889
pattern_18
2, 0.5, 0.5
2.258583627
pattern_9
1, 0.5, 0.5
2.28499453
pattern_19
2, 0.5, 1
2.368797163
pattern_10
1, 0.5, 1
2.393992481
pattern_0
0.5, 0.5, 0.5
2.455850158
pattern_23
2, 1, 2
2.486201923
pattern_14
1, 1, 2
2.510219114
pattern_1
0.5, 0.5, 1
2.557576978
pattern_5
0.5, 1, 2
2.666683333
pattern_20
2, 0.5, 2
2.851525907
pattern_11
1, 0.5, 2
2.872490209
pattern_2
0.5, 0.5, 2
3.010182719



We can see that the best matching pattern is "pattern_24" (2, 2, 0.5), which makes sense since the treatment effect vector is ("up", "up", "down").

It would be preferential not to calculate these distances one by one, but to automate the task, which is what Python (or other scripting language) can easily accomplish.

The steps below depict the methodical building of code that will result in the final script that will take two input files ("gene_fold_changes.txt" and "patterns.txt"), and determine the 'matching' pattern (minimal Euclidean distance) for each gene transcript fold change vector. Following the "building blocks" concept, each step is a repeat of the previous step with additional code appended.

Step 1: build a function that calculates Euclidean distance when provided 2 vectors of the same length:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from math import sqrt

pattern = [2, 2, 0.5]
result = [1.56, 2.7, 0.24]

def euclidean_distance(a, b):
    distance_sqrd = []
    for i in range(len(a)):
        x, y = a[i], b[i] # extract the two ordinates
        diff_sqrd = (x - y) * (x - y) 
        distance_sqrd.append(diff_sqrd)    
    sum_distance_sqrd = sqrt(sum(distance_sqrd))
    return sum_distance_sqrd
    
print(euclidean_distance(result, pattern)) 

Result:
0.8667179472008182

If that above looks foreign to you, there's a fantastic site, Visualize Python, that will walk you through the program step-by-step. Simply copy-and-paste that above within the website, to observe the calculation of the Euclidean distance between the treatment effect (result) and pattern_24 (pattern) vectors. You will need to specify the above code is written in Python 3.

Please note the obvious indented nature of the Python script, which defines the order and procedural relations (steps) for the program. The indentation reminds me of hierarchal outlines, another feature of Python that I find appealing (intuitive), which reads much like pseudocode, making the mental walk through of the program much easier.

Step 2: Add all possible patterns to the patterns list (now, a multidimensional list), and loop through the Euclidean distance function to calculate the distances between the single result vector and all patterns:



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from math import sqrt

results = [1.56, 2.7, 0.24]

patterns = [[0.5, 0.5, 0.5], [0.5, 0.5, 1], [0.5, 0.5, 2], [0.5, 1, 0.5], [0.5, 1, 1], [0.5, 1, 2], [0.5, 2, 0.5], [0.5, 2, 1], [0.5, 2, 2], [1, 0.5, 0.5], [1, 0.5, 1], [1, 0.5, 2], [1, 1, 0.5], [1, 1, 1], [1, 1, 2], [1, 2, 0.5], [1, 2, 1], [1, 2, 2], [2, 0.5, 0.5], [2, 0.5, 1], [2, 0.5, 2], [2, 1, 0.5], [2, 1, 1], [2, 1, 2], [2, 2, 0.5], [2, 2, 1], [2, 2, 2]]

def euclidean_distance(a, b):
    distance_sqrd = []
    for i in range(len(a)):
        x, y = a[i], b[i] # extract the two ordinates
        diff_sqrd = (x - y) * (x - y) 
        distance_sqrd.append(diff_sqrd)    
    sum_distance_sqrd = sqrt(sum(distance_sqrd))
    return sum_distance_sqrd    

print("results", "pattern", "Euclidean Distance", sep="\t")
    
for p in patterns:
   print(results, p, round(euclidean_distance(results, p), 4), sep="\t")

Result:
results pattern Euclidean Distance
[1.56, 2.7, 0.24] [0.5, 0.5, 0.5] 2.456
[1.56, 2.7, 0.24] [0.5, 0.5, 1] 2.558
[1.56, 2.7, 0.24] [0.5, 0.5, 2] 3.01
.
.
.
[1.56, 2.7, 0.24] [2, 2, 0.5] 0.867
[1.56, 2.7, 0.24] [2, 2, 1] 1.123
[1.56, 2.7, 0.24] [2, 2, 2] 1.945

Here, I introduce a loop (lines 18-19), that plugs each individual pattern 'p' into the euclidean_distance function, calculating the distance between each 'p' and the single result vector. This script was used to derive Table P1 above.

Step 3: Add additional result vectors to the results list (also a multidimensional list), looping through both 2D lists, calculating Euclidean distance for each comparison:



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from math import sqrt

results = [[1.56, 2.7, 0.24], [2, 4, 0.6], [1.2, 0.8, 4], [1.3, 6, 2.5], [6.5, 1, 0.9], [0.43, 2.1, 2], [0.88, 0.67, 0.5]]

patterns = [[0.5, 0.5, 0.5], [0.5, 0.5, 1], [0.5, 0.5, 2], [0.5, 1, 0.5], [0.5, 1, 1], [0.5, 1, 2], [0.5, 2, 0.5], [0.5, 2, 1], [0.5, 2, 2], [1, 0.5, 0.5], [1, 0.5, 1], [1, 0.5, 2], [1, 1, 0.5], [1, 1, 1], [1, 1, 2], [1, 2, 0.5], [1, 2, 1], [1, 2, 2], [2, 0.5, 0.5], [2, 0.5, 1], [2, 0.5, 2], [2, 1, 0.5], [2, 1, 1], [2, 1, 2], [2, 2, 0.5], [2, 2, 1], [2, 2, 2]]

def euclidean_distance(a, b):
    distance_sqrd = []
    for i in range(len(a)):
        x, y = a[i], b[i] # extract the two ordinates
        diff_sqrd = (x - y) * (x - y) 
        distance_sqrd.append(diff_sqrd)    
    sum_distance_sqrd = sqrt(sum(distance_sqrd))
    return sum_distance_sqrd    

print("results", "pattern", "Euclidean Distance", sep="\t")
   
for r in results:    
 for p in patterns:
  print(r, p, round(euclidean_distance(r, p), 4), sep="\t")

Result:
results pattern Euclidean Distance
[1.56, 2.7, 0.24] [0.5, 0.5, 0.5] 2.4559
[1.56, 2.7, 0.24] [0.5, 0.5, 1] 2.5576
[1.56, 2.7, 0.24] [0.5, 0.5, 2] 3.0102
.
.
.
[0.88, 0.67, 0.5] [2, 2, 0.5] 1.7388
[0.88, 0.67, 0.5] [2, 2, 1] 1.8092
[0.88, 0.67, 0.5] [2, 2, 2] 2.2964

As you can see in lines 18-20, we now have a nested loop, which passes each individual result vector 'r' and each pattern vector 'p' into the euclidean_distance function to calculate the distance. Here's a good youtube video displaying the basics of a nested loop. The above script would produce 7 x 27 = 189 results. Now you can see how valuable programmatic scripting can be. Just image if you wanted to analyze the results of 500 gene transcripts against these 27 patterns (= 13,500 Euclidean distances!).

Step 4: Identifying the minimal Euclidean distance from the comparisons between all patterns and a single result vector.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from math import sqrt

results = [[1.56, 2.7, 0.24]]

patterns = [[0.5, 0.5, 0.5], [0.5, 0.5, 1], [0.5, 0.5, 2], [0.5, 1, 0.5], [0.5, 1, 1], [0.5, 1, 2], [0.5, 2, 0.5], [0.5, 2, 1], [0.5, 2, 2], [1, 0.5, 0.5], [1, 0.5, 1], [1, 0.5, 2], [1, 1, 0.5], [1, 1, 1], [1, 1, 2], [1, 2, 0.5], [1, 2, 1], [1, 2, 2], [2, 0.5, 0.5], [2, 0.5, 1], [2, 0.5, 2], [2, 1, 0.5], [2, 1, 1], [2, 1, 2], [2, 2, 0.5], [2, 2, 1], [2, 2, 2]]

def euclidean_distance(a, b):
    distance_sqrd = []
    for i in range(len(a)):
        x, y = a[i], b[i] # extract the two ordinates
        diff_sqrd = (x - y) * (x - y) 
        distance_sqrd.append(diff_sqrd)    
    sum_distance_sqrd = sqrt(sum(distance_sqrd))
    return sum_distance_sqrd
    
for r in results:
    euc = []
    for p in patterns:
        euc.append(euclidean_distance(r, p))
    print(patterns[euc.index(min(euc))], round(min(euc), 4), sep='\t') 

Result:
[2, 2, 0.5] 0.8667

The above script is an extension of the script provided in Step 2, adding a step below that will identify and pull-out the minimal Euclidean distance. To accomplish this I first initiate a new empty list ("euc = []", line 17). As the loop proceeds, the euclidean_distance function produces the distances which are added (appended) to the "euc" list. Once all distances are added, the minimal (min) Euclidean distance is identified within the "euc" list, and its 'index' (location) noted, by which I grab the 'minimal' pattern from the original "patterns" list (here, [2, 2, 0.5] = index 24). Important note!, Python indexes start at 0 not 1, from left to right, which is why the tables above specify "pattern_0" through "pattern_26".

Step 5: Obtaining the minimal Euclidean distance for multiple results:

Because one wants to associate a specific results vector (i.e., gene transcript), to its 'matching' pattern, I need to introduce the dictionary python data type. The two features of a dictionary are the "key" and "value". Each key has it's own value(s), and note, keys need to be unique. Here, the key is the gene transcript, and the value, the results vector. 


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from math import sqrt

results_dictionary = {"SPOCK": [1.56, 2.7, 0.24], "LCTL": [2, 4, 0.6], "PYCARD": [1.2, 0.8, 4], "DLGAP1": [1.3, 6, 2.5], "TRIB1":[6.5, 1, 0.9], "VANGL1": [0.43, 2.1, 2], "SMURF1": [0.88, 0.67, 0.5]}

patterns = [[0.5, 0.5, 0.5], [0.5, 0.5, 1], [0.5, 0.5, 2], [0.5, 1, 0.5], [0.5, 1, 1], [0.5, 1, 2], [0.5, 2, 0.5], [0.5, 2, 1], [0.5, 2, 2], [1, 0.5, 0.5], [1, 0.5, 1], [1, 0.5, 2], [1, 1, 0.5], [1, 1, 1], [1, 1, 2], [1, 2, 0.5], [1, 2, 1], [1, 2, 2], [2, 0.5, 0.5], [2, 0.5, 1], [2, 0.5, 2], [2, 1, 0.5], [2, 1, 1], [2, 1, 2], [2, 2, 0.5], [2, 2, 1], [2, 2, 2]]

def euclidean_distance_part(a, b):
    distance_sqrd = []
    for i in range(len(a)):
        x, y = a[i], b[i] # extract the two ordinates
        diff_sqrd = (x - y) * (x - y) 
        distance_sqrd.append(diff_sqrd)    
    sum_distance_sqrd = sqrt(sum(distance_sqrd))
    return sum_distance_sqrd  

print("gene", "results", "pattern_number", "pattern", "euclidean_distance", sep='\t')

for gene, results in results_dictionary.items():
    euc = []
    for p in patterns:
        euc.append(euclidean_distance_part(results, p))
    print(gene, results, euc.index(min(euc)), patterns[euc.index(min(euc))], round(min(euc), 4), sep ='\t')

Result:
gene results pattern_number pattern euclidean_distance
SPOCK [1.56, 2.7, 0.24] 24 [2, 2, 0.5] 0.8667
DLGAP1 [1.3, 6, 2.5] 17 [1, 2, 2] 4.0423
SMURF1 [0.88, 0.67, 0.5] 9 [1, 0.5, 0.5] 0.2081
PYCARD [1.2, 0.8, 4] 14 [1, 1, 2] 2.0199
VANGL1 [0.43, 2.1, 2] 8 [0.5, 2, 2] 0.1221
TRIB1 [6.5, 1, 0.9] 22 [2, 1, 1] 4.5011
LCTL [2, 4, 0.6] 24 [2, 2, 0.5] 2.0025

The above script simply loops through each key and value (gene and results vector, respectively) against each pattern, deriving the 'matching' pattern (minimal Euclidean distance), and prints out a tab-delimited result for each key, value pair.

Step 6: Reading in the results and patterns from input files, and loop through all patterns and results to calculate the Euclidean distances:
This step simply reads the results and patterns from 2 separate files, populating the "results_dictionary" and "patterns" list with that contained. The files are tab-delimited text files which I am calling '2-fold_3-by-3_patterns.txt' and 'HepG2_significant-T3_6h.txt', culled from PMID: 23300972.

Here's what the file look like (Note, no headers on first line of either file):

2-fold_3-by-3_patterns.txt

0.5 0.5 0.5
0.5 0.5 1
0.5 0.5 2
0.5 1 0.5
.
.
.
2 1 2
2 2 0.5
2 2 1
2 2 2

HepG2_significant-T3_6h.txt
IL4R 1.024474886 2.219374904 1.49
PFKFB4 1.104901159 3.226603156 2.15
UBASH3B 0.792117728 0.307732139 0.528
EGFLAM 0.959569797 1.520448169 2.48
ACSL3 0.95956302 1.75749976 2.2
.
.
.
CLIP1 1.229766122 1.791323367 2.24
DNMT3L 2.353985909 2.163907179 3.01
ABCC6 1.244322488 1.759081551 2.04
PLAUR 0.934067161 2.489250673 3.2
TOX3 1.066779442 0.466628681 0.505


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from math import *

pattern_file = open("2-fold_3-by-3_patterns.txt", 'r').readlines()

results_file = open("HepG2_significant-T3_6h.txt", 'r').readlines()

patterns = []
for line in pattern_file:
 temp = line.split()
 patterns.append(list(map(float, temp)))

results_dict = {} 
for line in results_file:
 temp = line.split()
 results_dict[temp[0]] = list(map(float,temp[1:]))
 
def euclidean_distance_part(a, b):
    distance_sqrd = []
    for i in range(len(a)):
        x, y = a[i], b[i] # extract the two ordinates
        diff_sqrd = (x - y) * (x - y) 
        distance_sqrd.append(diff_sqrd)    
    sum_distance_sqrd = sqrt(sum(distance_sqrd))
    return sum_distance_sqrd  

print("gene", "pattern_number", "pattern", "euclidean_distance", sep='\t')

for gene, results in results_dict.items():
    euc = []
    for p in patterns:
        euc.append(euclidean_distance_part(results, p))
    print(gene, results, euc.index(min(euc)), patterns[euc.index(min(euc))], (min(euc)), sep='\t')

Instance 2, correlating treatment effects:
I was tasked to determine the similarity (or difference) between two Thyroid Hormone Receptor ligands (one natural = T3, and the other synthetic = GC-1). Perusing the internet and refereed articles, I found that correlation metrics such as Spearman's rank correlation coefficient and Pearson correlation have been used for such a task (e.g., determining the similarity in fold-change effect between 2 different treatments)

I used Python to create a Spearman's rank correlation coefficient function as outlined within the supplemental section of the paper 
(PMID: 22067320).

Spearman's rank correlation coefficient analysis
To determine whether there are differences between the microarray data received from T3 and GC-1 treatments, we used the Spearman's rank order correlation coefficient, rs.


where d is the difference between the probe-set rank in one dataset (T3) and that of the other (GC-1), and n is the number of probe-sets being compared. 

We decided to conduct the analysis with Affymetrix probe-set identifications instead of aggregating probe-sets into unified identifiers (e.g., Unigene) to eliminate the impact of splice variants. 
Affymetrix probe-sets for both datasets were ordered based upon fold induction of ligand-treated verses vehicle. We analyzed both the entire Affymetrix probe-set list ("complete probe-set list"), and probe-sets receiving an adjusted p-value <= 0.05 in both datasets ("significant probe-sets"). 
To determine if the calculated correlation coefficient, rs, is greater than would be expected by chance, we performed empirical permutation analysis, where the rank ordering of the genes in one list is randomly shuffled, followed by rs calculation. The permutation analysis was done 10,000 times using a custom Python script. The p-value is simply the proportion of times that the rs value in the shuffled list is greater than or equal to the correlation coefficient calculated for the experimental dataset. 

Using this procedure within the aforementioned paper, I received a highly significant p-value (< 0.0001, = 1/10,000), because there was no instance that the permuted rankings produced a greater correlation than that calculated for the experimental set.

For the current example, I will use another study that we published that compared the anti-androgen effects of a clinically-utilized drug, flutamide, to a novel chemistry, MJC13 (Suh et al. 2015. Similarities and Distinctions in Actions of Surface-Directed and Classic Androgen Receptor Antagonists.PLoS One 2015, 10(9):e0137103, PMID: 26332122). The commonly used prostate cancer cell line, LNCaP, was treated with DHT (testosterone) alone or in combination with one of the 2 anti-androgen chemistries. As can be seen in the line plot below, the anti-androgens behaved very similarly.




To test how significant this correlation is, we can employ Spearman's rank correlation coefficient and permutation analysis via my custom Python script.

Input is a 3-column, tab-delimited text file (Flutamide_MJC13_Effects.txt):

ILMN_1671486 0.619053839 0.655653049
ILMN_1661637 0.546773417 0.593758193
ILMN_2211739 0.628942245 0.614651425
ILMN_1687768 1.232997126 1.121925249
ILMN_1713952 0.639418101 0.567393595
ILMN_1731785 0.296007271 0.320239937

ILMN_2270015 0.740351356 0.732066055
.
.
.


  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
import random

from decimal import Decimal

file = open('Flutamide_MJC13_Effects.txt', 'r').readlines()

data_1 = [] # a nested 2-dimensional list that will be appended with column1 (probe_ID) and column2 (foldChange_1)
data_2 = [] # a nested 2-dimensional list that will be appended with column1 (probe_ID) and column3 (foldChange_2)
for line in file:
 temp = line.split() 
 data_1.append([temp[0], float(temp[1])])
 data_2.append([temp[0], float(temp[2])]) 
 
foldChange_1 = [] # a list that will be appended with each foldChange_1
for d_1 in data_1:
 foldChange_1.append(d_1[1])
 
sorted_foldChange_1 = sorted(foldChange_1) # sort foldChange_1 list from lowest to highest
 
foldChange_2 = [] # a list that will be appended with each foldChange_2
for d_2 in data_2:
 foldChange_2.append(d_2[1])
 
sorted_foldChange_2 = sorted(foldChange_2) # sort foldChange_2 list from lowest to highest

probes_ranked_1 = [] # a list that will be appended with data_1 probe_ID ranked in order according to sorted_foldChange_1
for s_fc_1 in sorted_foldChange_1:
 for d_1 in data_1:
  if s_fc_1 == d_1[1]:
   probes_ranked_1.append(d_1[0])
   
rankings_2 = [] # a nested 2-dimensional list that will be appended with Probe_ID and rank according to sorted_foldChange_2
count = 0
for s_fc_2 in sorted_foldChange_2:
 for d_2 in data_2:
  if s_fc_2 == d_2[1]:
   count = count + 1
   rankings_2.append([d_2[0], count])

relative_rankings_2 = [] # a list that will derive the ranking (index location) of data_2 (rankings_2) relative to data_1 (probes_ranked_1), which will be used for subsequent correlation analysis
count = 0
for probes_1 in probes_ranked_1:
 for r_2 in rankings_2:
  if probes_1 == r_2[0]:
   count = count + 1
   relative_rankings_2.append(rankings_2.index(r_2)+1)
   
straight_rank = [] # a list of numbers 1-81 (length of current input); will be used for subsequent experimental (actual) and permuted correlation analysis  
count = 0
for p in range(len(probes_ranked_1)):
 count = count + 1
 straight_rank.append(count)

def subtracted(a, b): # a function that takes 2 input lists calculating the differences for each index (i.e., list_1[0] - list_2[0], list_1[1] - list_2[1], etc.), returns differences as a list 
 subtracted = []
 for i in range(len(a)):
  x, y = a[i], b[i] # extract the two ordinates
  difference = x - y
  subtracted.append(difference)
 return subtracted

def spearmans (a, b): # a function takes the differences list and calculates and returns the Spearman's rank correlation coefficient
 differences = subtracted(a, b) # using the above 'subtracted' function to derive differences
 sqr_diffs = []
 for d in differences:
  dsq = d**2
  sqr_diffs.append(dsq)    
 sum_sqr_diffs = sum(sqr_diffs)
 numerator = 6*sum_sqr_diffs
 list_length = len(a)
 denominator = (list_length*(list_length**2-1))
 spears = Decimal(1- (Decimal(numerator)/Decimal(denominator)))
 return spears

SRCC = spearmans(straight_rank, relative_rankings_2) # calling the spearmans function

print("Spearman's rank correlation coefficient = ", float(SRCC))

copy = straight_rank[:] # creating a copy of the original 'straight_rank' list (1-81), so the copy can be permuted, leaving thee 'straight_rank' alone

all = [] # a nested 2-dimensional list that will harbor all the permuted rank lists (here, 1000 total) 
count = 0
permutations = 1000
while count < permutations:
    shuffled = []
    count += 1
    random.shuffle(copy)
    shuffled_copy= copy
    shuffled.extend(shuffled_copy)
    all.append(shuffled)
      
results = [] # a list harboring all calculated Spearman's rank correlation coefficients for the straight_rank against each permuted rank
for i in all:
    test_shuffled = spearmans(straight_rank, i)
    results.append(test_shuffled)

greater = 0 # count the instances that the Spearman's rank correlation coefficient of the permuted data is greater than the actual ("experimental") SRCC
for r in results:
    if abs(r) >= SRCC:
     greater = greater + 1

if greater == 0:
 print("p-value < ", 1/permutations) # will print if no SRCC from permuted ranks were greater than the actual SRCC, the p-value simply being 1/ the number of permutations
else:
 print("p-value = ", Decimal(greater/permutations))  # will print if there was at least one permuted rank SRCC that was greater than actual ("experimental"), the p-value = "number greater than actual/the number of permutations"

Result:
Spearman's rank correlation coefficient =  0.9459801264679314
p-value <  0.001

No comments:

Post a Comment