P Values
P-values are most commonly used in significance testing. Specifically, they represent the probability of expecting to see a test statistic at least as extreme as yours under the default or null hypothesis.
The p-value is central to GWAS because the “no-effect” hypothesis (i.e., the genetic variant does not influence the phenotype being studied) is thought to be true for the vast majority of genetic variants tested, and testing “effect” versus “no effect” is well-served by calculating a p-value.
Because only a tiny fraction of genetic variants are associated, alternate approaches such as false discovery rate (FDR) methods and other Bayesian approaches estimating the proportion of true positives genome-wide are less commonly used in GWAS since they would add little additional value on top of this simple, frequentist formulation.
Generally, we take very small p-values to be evidence that the null hypothesis may be false and therefore “rejected”, because observing this data under the null hypothesis would be extremely unlikely.
There are several concepts and considerations that should be taken into account when using this p-values:
● The null hypothesis
The default null hypothesis in genetic studies is that your variant of interest does not influence (has an effect size of 0 on) your desired phenotype. In this way, each of your variants will have its own null hypothesis if you are testing more than one.
● Possible errors
In statistics there are typically two types of errors that are referred to: A false positive where someone rejects the null hypothesis despite it being true, or a Type I Error, and someone failing to reject the null hypothesis despite it being false, or a Type II Error.
● The significance threshold α
In many contexts, a standard significance threshold (α) for p-values is 0.05, or 1 in 20, which means that we mark all p-values less than that as potentially showing this data not operating under the null hypothesis. However, when doing a GWAS, we are performing association tests on millions of variants – and if a such a liberal threshold is selected, 1 out of every 20 tests will have the null hypothesis falsely rejected. Therefore, a “genome wide significant” threshold is typically around 5 x 10 (which is where you’ll see a line when browsing Manhattan plots on the FinnGen PheWeb).
For a derivation of this threshold, which corresponds to .05 / 1 million independent tests, see Pe’er et al. Genet Epidemiol. 2008 May;32(4):381-5. PMID: 18348202
● P-value corrections
While adding another level of computation, corrections made to your p-value statistic in their many forms are very important. There are numerous different methods to improve the accuracy of your statistics (the family-wise error rate correction family-wise error rate correction for α, and the Bonferroni correction for p, are two of the most common). Various scaling approaches may be used on the distribution wholesale in the event that there is systematic inflation of statistics which might arise, for example, from uncorrected population structure or cryptic relatedness.
A common misconception is that the p-value is the probability that the null hypothesis is true, or that the p-value represents the effect size in your data. Neither of these is true: under this frequentist formulation, there is no way to calculate the probability that the null hypothesis is true, and the effect size (represented by β which corresponds to the log(OR)) is a completely separate parameter. Generally β is fixed to be 0 under the null, and is then maximized given the data in the alternate model that is tested against the null. P-values speak only to the likelihood of observing your specific data under the null hypothesis.
Three ways to compute a p-value
Score test.
Wald’s test. This is the default p-value returned by R’s
summary()
function.Likelihood ratio test.
For GWAS, software such as PLINK and SAIGE efficiently provide multiple tests that can be run across all variants in the genome.
Additional Reading
Matti Pirinen’s GWAS course notes, Week 2.
Last updated