Statistical Genetics Project: Paper Demo

library(ggplot2)
library(PLIS)
data(GWAS.SampleData)
data = GWAS.SampleData
head(data)

     CHR        SNP      BP     OR  CHISQ       P
1017   1   rs707580 5659206 0.9657 0.1047 0.74630
1018   1   rs771071 5666739 0.7556 3.0630 0.08008
1019   1  rs1763330 5668279 0.8105 1.8840 0.16990
1020   1   rs770701 5677775 0.8158 2.1300 0.14440
1021   1   rs770702 5679187 0.8578 2.2960 0.12970
1022   1 rs10492949 5679229 1.4860 3.5220 0.06057

The sample dataset used in our demo lies in the PLIS package (Wei et al. 2009), which contains 400 SNPs, with 200 SNPs from chromosome 1 and the other 200 from chromosome 6. The variables in this dataset are:

CHR: Chromosome ID
SNP: SNP’s rs ID
BP: base-pair location (physical position)
OR: Odds Ratio, which is the ratio between the odds of individuals having a phenotype associated with a specific allele and the odds of the same phenotype for individuals who do not have that same allele
CHISQ: 1 degree of freedom Chi Square test Statistic
P: p-value of 1 degree of freedom Chi Square test Statistic

Before going into the details of how to carry out the PLIS method, we would like the introduce the idea behind LIS, proposed by Sun and Cai (Sun and Tony Cai 2009). LIS is the probability of whether a case is null given the observed data. In the GWAS context, it essentially carries out the below hypothesis test:

\(H_0\): given all the z-values(observations), this SNP is NOT associated with the disease

\(H_a\): given all the z-values(observations), this SNP is associated with the disease

The LIS procedure is not optimal since it only deals with a single Markov chain and cannot be applied directly to GWAS. Therefore, the PLIS method is supposed to be better given that it deals with the underlying LD among SNPs using Hidden Markov Model (HMM). The PLIS procedure is carried out as below:

Step 1: Gather p-values from Chi-Square tests
Step 2: Calculate z-values from p-values
Step 3: Given the distribution assumption: For non-associated SNPs, \(z \sim N(0,1)\); or associated SNPs, \(z \sim\) a normal mixture, which is a summation of different normal distribution. Based on the z-values of the observed data and this distribution assumption, apply EM algorithm to infer the HMM parameters from the genetics data (Ephraim and Merhav 2002).
Step 4: Based on the HMM model from last step, get LIS for each SNP, where LIS is the probability of whether the SNP is not associated with the disease given all of the observations and the parameters obtained from the HMM in Step 3.
Step 5: Rank the SNPs by LIS. The smaller LIS is, the more the SNP is associated with the disease.

The code below is done using the PLIS package. The PLIS package can be found here: https://cran.r-project.org/web/packages/PLIS/PLIS.pdf. Explanation for the code is attached as the hashtag comments.

# Create a new vector to store all Z-scores of the two-tailed z-test
ZSCORE = c()

# Find the z-score for each SNP 
for (i in 1:400) {
  # If Odd Ratio > 1, z-score is the integrated area above the p-value
  if(data$OR[i] > 1){
    ZSCORE[i] = qnorm(data$P[i]/2,0,1,lower.tail=FALSE)
  }
  # Otherwise, z-score is the integrated area below the p-value
  else{
    ZSCORE[i] = qnorm(data$P[i]/2,0,1,lower.tail=TRUE)
  }
}

# Append z-score to our dataset
data$ZSCORE = ZSCORE
head(data)

     CHR        SNP      BP     OR  CHISQ       P     ZSCORE
1017   1   rs707580 5659206 0.9657 0.1047 0.74630 -0.3235219
1018   1   rs771071 5666739 0.7556 3.0630 0.08008 -1.7502221
1019   1  rs1763330 5668279 0.8105 1.8840 0.16990 -1.3725252
1020   1   rs770701 5677775 0.8158 2.1300 0.14440 -1.4596001
1021   1   rs770702 5679187 0.8578 2.2960 0.12970 -1.5152860
1022   1 rs10492949 5679229 1.4860 3.5220 0.06057  1.8766214

# Divide the dataset into two separate datasets by chromosome 1 and 6
chr1=data[which(data[,"CHR"]==1),]
chr6=data[which(data[,"CHR"]==6),]

# Make sure SNPs in the linear physical order
chr1<-chr1[order(chr1[,"BP"]),]
chr6<-chr6[order(chr6[,"BP"]),]

chr1.zval <-rep(0, nrow(chr1))
chr1.ors=(chr1[,"OR"]>1)
chr1.zval[chr1.ors]

 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[34] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[67] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

# Use EM algorithm for HMM to estimate LIS statistic for chromosome 1
# L is the number of components in the non-null mixture, with default = 2
# Note that for analyzing a chromosome in real GWAS dataset, em.hmm can take as long as 10+ hrs

# Notations: 
# pii: the initial state distribution, pii=(prob. of being 0, prob. of being 1)
# A: transition matrix, A=(a00 a01\ a10 a11)
# f0: the null distribution
# pc: probability weights of each component in the non-null mixture
# f1: an L by 2 matrix, specifying the dist. of each component in the non-null mixture
# LIS: the LIS statistics ni the number of iterations excecuted
# logL: log likelihood
# BIC: BIC score for the estimated model
# converged: Logic, Convergence indicator of the EM procedure


chr1.L2rlts=em.hmm(chr1$ZSCORE, L = 4)
chr1.L2rlts

$pii
[1] 1.000000e+00 5.564545e-68

$A
          [,1]       [,2]
[1,] 0.9670906 0.03290937
[2,] 0.1241455 0.87585450

$pc
[1] 0.76263163 0.07889652 0.05256901 0.10590284

$f0
[1] 0 1

$f1
            mus        sds
[1,] -0.5119362 0.18951634
[2,] -1.0524945 0.01069731
[3,]  1.0485338 0.01780373
[4,]  1.2469554 0.05912999

$LIS
  [1] 1.000000000 1.000000000 0.999998495 0.999999810 0.999999954
  [6] 1.000000000 0.696264778 0.639115166 0.686551470 0.764953944
 [11] 0.999999763 0.999886354 0.980086646 0.999787193 0.919295092
 [16] 0.903343390 0.922080917 0.982146377 0.995549016 0.999964067
 [21] 0.999999816 0.635701850 0.531336474 0.539268861 0.579777670
 [26] 0.721597294 0.827624004 0.993833262 0.994811976 0.999999925
 [31] 0.999999998 0.999999951 0.999994059 1.000000000 1.000000000
 [36] 0.999803995 0.999951200 0.993398254 0.999999589 0.999985551
 [41] 0.994405106 0.999832759 0.845284261 0.679110938 0.658288490
 [46] 0.745203917 0.996762894 0.999962586 0.991570026 1.000000000
 [51] 0.999999917 0.999820547 0.999954798 0.999998132 0.998151455
 [56] 0.983879299 0.999970889 0.999931214 0.999533025 0.999954606
 [61] 0.999999999 0.995132194 0.992809728 1.000000000 0.999972275
 [66] 0.999999872 0.839342652 0.722434920 0.665474586 0.712872762
 [71] 0.999647213 0.658589234 0.616401751 0.671803686 0.970080698
 [76] 0.998538131 0.999912712 1.000000000 0.998179619 1.000000000
 [81] 1.000000000 0.999903472 1.000000000 0.994553209 0.995632948
 [86] 1.000000000 0.996227909 0.999996616 0.995513076 0.919107175
 [91] 0.922287025 0.999991925 1.000000000 0.999479393 1.000000000
 [96] 1.000000000 0.999999998 0.485276395 0.152087314 0.074863133
[101] 0.060858371 0.013830761 0.001749982 0.001086952 0.002358736
[106] 0.004240137 0.009253534 0.025024170 0.070492851 0.179534579
[111] 0.388934446 0.999951892 0.999501039 0.991065275 0.980572138
[116] 0.999996213 0.999999999 0.999996957 1.000000000 0.981312193
[121] 0.999961886 0.999994773 0.999999907 0.644643298 0.560032130
[126] 0.588609979 0.799138846 0.784790634 0.833969969 0.994761640
[131] 1.000000000 0.999999996 0.987281083 0.970557737 1.000000000
[136] 0.922603513 0.507646448 0.201857154 0.084530730 0.028200002
[141] 0.010849212 0.007316257 0.006418581 0.009928070 0.009015282
[146] 0.004999656 0.005349377 0.003210417 0.003432744 0.002565849
[151] 0.002795330 0.004932539 0.014204695 0.016948860 0.040250957
[156] 0.127485730 0.489343644 0.738450267 0.837433438 0.907303789
[161] 0.999863961 0.984561870 1.000000000 0.984251468 0.999674976
[166] 0.999999987 0.999968973 0.999994594 1.000000000 0.999985572
[171] 0.999985572 1.000000000 1.000000000 0.985896669 0.999970421
[176] 0.999999978 0.999974426 0.998425136 1.000000000 0.999995638
[181] 0.997874188 0.999326817 0.999712965 0.978510130 0.987310511
[186] 0.997006055 0.999947256 0.994181568 1.000000000 1.000000000
[191] 0.999542914 0.999999936 0.998853532 0.999919986 0.998642423
[196] 0.974763511 0.991863405 0.999999936 0.997008148 0.999999999

$logL
[1] -239.39

$BIC
[1] -276.4782

$ni
[1] 142

$converged
[1] TRUE

# Use EM algorithm for HMM to estimate LIS statistic for chromosome 6
chr6.L2rlts=em.hmm(chr6$ZSCORE, L = 4)
chr6.L2rlts

$pii
[1] 0 1

$A
          [,1]      [,2]
[1,] 0.8933271 0.1066729
[2,] 0.1158273 0.8841727

$pc
[1] 0.18821717 0.66140719 0.11012957 0.04024607

$f0
[1] 0 1

$f1
             mus        sds
[1,] -1.62822445 0.63658332
[2,] -0.35752796 0.54397107
[3,]  0.07734776 0.02112719
[4,]  0.76881090 0.01773143

$LIS
  [1] 0.00000000 0.03496822 0.05423095 0.17797525 0.31562589
  [6] 0.47984158 0.50080354 0.54846187 0.64741071 0.81564703
 [11] 0.60961201 0.45582568 0.35896689 0.29033129 0.24115920
 [16] 0.19885636 0.17188250 0.15350268 0.14296901 0.12739896
 [21] 0.10247231 0.06501071 0.01999060 0.01962258 0.02046288
 [26] 0.07393743 0.12075470 0.19247431 0.27031433 0.37584492
 [31] 0.53373312 0.77247134 0.85287033 0.99266508 0.98101335
 [36] 0.99948829 0.99094023 0.99930813 0.98102318 0.99985974
 [41] 0.99477836 0.92730624 0.88589196 0.91999628 0.99847841
 [46] 0.80853545 0.69762258 0.37902869 0.21299151 0.16694600
 [51] 0.13401234 0.16749447 0.20991323 0.26796961 0.34580802
 [56] 0.61095484 0.99948504 0.98042109 0.99997235 0.99890169
 [61] 0.99974140 0.96865814 0.94218498 0.84704905 0.80458967
 [66] 0.80216278 0.83937813 0.93167189 0.90251967 0.86023199
 [71] 0.85901510 0.90623040 0.89436540 0.87033520 0.89864683
 [76] 0.98427654 0.70143253 0.49525314 0.49013707 0.32177435
 [81] 0.20379565 0.23090952 0.25115975 0.26520298 0.22191864
 [86] 0.20535880 0.19230506 0.18345984 0.24021010 0.44663768
 [91] 0.97790114 0.99626549 0.96203670 0.99084580 0.98263710
 [96] 0.99185500 0.93534358 0.92469242 0.62459265 0.39092353
[101] 0.20903372 0.07216448 0.05758387 0.07664055 0.08630029
[106] 0.09357265 0.09379610 0.17299922 0.27550816 0.25968441
[111] 0.16496335 0.09057955 0.12271324 0.13868886 0.16177225
[116] 0.19504726 0.17496901 0.15792244 0.14721389 0.13528750
[121] 0.12825632 0.12237270 0.11290111 0.23847779 0.34537799
[126] 0.33044616 0.43259514 0.55033572 0.66904024 0.70438352
[131] 0.77479747 0.76603646 0.73368149 0.73963117 0.75936355
[136] 0.60419702 0.51329353 0.44396855 0.38394492 0.44857800
[141] 0.55442100 0.93934395 0.94623599 0.99696740 0.92955976
[146] 0.88836383 0.96986607 0.98858359 0.79017109 0.65339694
[151] 0.31241021 0.19718953 0.19582420 0.20638815 0.16765194
[156] 0.13661611 0.10801158 0.07231538 0.05618500 0.13972766
[161] 0.20602877 0.29491098 0.31162167 0.34908501 0.41955849
[166] 0.48308188 0.59252322 0.52173792 0.65738531 0.85678060
[171] 0.90776017 0.99502341 0.99917539 0.97150455 0.88384905
[176] 0.75160460 0.58198480 0.27261625 0.23449831 0.20733916
[181] 0.23576524 0.26992795 0.31695249 0.21072565 0.19661400
[186] 0.19511358 0.21029573 0.24545468 0.30376387 0.40473042
[191] 0.55644749 0.75708873 0.80652673 0.90248249 0.95976267
[196] 0.91050905 0.91237262 0.89081570 0.56858896 0.52743317

$logL
[1] -258.6625

$BIC
[1] -295.7507

$ni
[1] 1000

$converged
[1] TRUE

# em.hmm can be run in parallel for different chromosomes before applying the PLIS procedure (the plis function)

# plis controls the global FDR at 0.01 here for the pooled hypotheses from different groups
plis.rlts=plis(c(chr1.L2rlts$LIS,chr6.L2rlts$LIS),fdr=0.01)

# gFDR stands for global FDR, denotes the corresponding global FDR if using the LIS statistic as the significance cutoff
all.Rlts=cbind(rbind(chr1,chr6), LIS=c(chr1.L2rlts$LIS,chr6.L2rlts$LIS), gFDR=plis.rlts$aLIS, fdr001state=plis.rlts$States)

# show the 10 SNPs on both chromosomes with the lowest LIS
all.Rlts[order(all.Rlts[,"LIS"])[1:10],]

       CHR       SNP      BP     OR  CHISQ      P     ZSCORE
184078   6 rs4959835 3381880 0.8440 1.8620 0.1724 -1.3645326
1120     1 rs6696861 6313873 0.7308 1.0890 0.2967 -1.0435363
1119     1 rs4908877 6307740 0.8434 1.1050 0.2933 -1.0509098
1121     1 rs7545051 6322050 0.8636 0.2195 0.6394 -0.4685379
1166     1 rs6577584 6649656 0.9539 0.2358 0.6272 -0.4856716
1167     1 rs2281190 6655256 0.9493 0.2861 0.5928 -0.5347829
1164     1 rs2235564 6647380 0.9452 0.3380 0.5610 -0.5813568
1165     1  rs200440 6647548 0.9351 0.5034 0.4780 -0.7095230
1122     1 rs3789482 6331638 0.9272 0.4158 0.5190 -0.6448876
1168     1 rs6659873 6658072 0.9498 0.2834 0.5945 -0.5323263
               LIS         gFDR fdr001state
184078 0.000000000 0.0000000000           1
1120   0.001086952 0.0005434760           1
1119   0.001749982 0.0009456448           1
1121   0.002358736 0.0012989175           1
1166   0.002565849 0.0015523039           1
1167   0.002795330 0.0017594748           1
1164   0.003210417 0.0019667522           1
1165   0.003432744 0.0021500012           1
1122   0.004240137 0.0023822385           1
1168   0.004932539 0.0026372686           1

Ephraim, Yariv, and Neri Merhav. 2002. “Hidden Markov Processes.” IEEE Transactions on Information Theory 48 (6): 1518–69.

Sun, Wenguang, and T Tony Cai. 2009. “Large-Scale Multiple Testing Under Dependence.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (2): 393–424.

Wei, Zhi, Wenguang Sun, Kai Wang, and Hakon Hakonarson. 2009. “Multiple testing in genome-wide association studies via hidden Markov models.” Bioinformatics 25 (21): 2802–8. https://doi.org/10.1093/bioinformatics/btp476.

Paper Demo

References