Variant PheWas

Analyzing phenotype data for individual variant(s)

Some of the variants in FinnGen are too rare to be imputed and available in the PheWeb, so you need to do a different analysis to look at the phenotypes associated with them.

Using the Genotype Browser we can download individuals with a certain genotype and then conduct PheWas analysis to examine which medical codes or FinnGen endpoints might be enriched in that set of individuals. To test if these differences are significant from the background rate in FinnGen (all FinnGen individuals) or from the control cohort (e.g. persons not in cases), one might use a test suitable for small counts such as Fisher's exact test. As thousands of correlated endpoints and codes might be examined in such an exploratory analysis, finding top results around p=.001 may be expected by chance and therefore not necessarily meaningful.

A series of analyses of similarly rare synonymous and non-coding variants may provide an empirical distribution of the types of extreme p-values expected by chance.

You may easily conduct PheWas analysis and Fisher's exact test between cases and background or control cohorts is to using the Cohort Operations tool. With the Cohort Operations tool, you can also use matched controls in order to standardize the effect of sex and age between cases and controls.

We provide a guideline for interpreting rare variant results.

Data available for coding variants in PheWeb

For coding and LoF variants that were common enough to be included in the imputation, you can find the association tests directly from PheWeb by searching for the gene they are included in:

Working with rare variant phenotypes

Analyzing rarer variants than those present in PheWeb can be done with CodeWAS analysis in the Cohort Operations tool. The Cohort Operations tool has a graphical user interface (GUI) needing no coding skills from the user.

Alternatively, you may also use ready-made R scripts. To use R scripts follow the steps below.

Step 1:

To work with these R scripts you will need to take the largest Sandbox as you will be reading large amounts of data into R.

Step 2:

Prepare your list of individuals genotype file either using Genotype Browser or V3C (V3C enables you to correct rare variant calls). In the File Browser, you will be able to browse to see these names so that you can tell the code where to also find the file. The name will likely be quite long if coming from GenotypeBrowser. (You can also make lists of individuals through other means, but this topic mainly follows variant analysis.)

Step 3:

Open RStudio via Applications>Development>RStudio

Step 4:

Next you need to bring in the R libraries for PheWAS into your IVM. (Note that you can copy and paste the pathname below to the clipboard of your Sandbox rather than needing to type it all in.)

cp /finngen/shared/fgphewasdf9/20220214_062607/files/FGphewasDF9_0.0.0.9000.tar.gz /home/ivm

Step 5:

Within R studio you then need to install the libraries. In RStudio, on the middle right side, click "Install" and then "Browse" to the FGphewasDF9_0.0.0.9000.tar.gz file you just copied over.

Step 6:

From this point there a couple ways you can work through the code. One is to use the vignettes feature of R by typing in the RStudio Console (lower left side):

vignette("simple tutorial", package="FGphewasDF9")

vignette("compare_medical_codes", package="FGphewasDF9")

Step 7:

You will see in both these examples that you can look for enrichment of particular medical codes or FinnGen endpoints. Generally, looking at enrichment of FinnGen endpoints is the best way to go (this is the second option). This is because many synonymous medical codes are combined into one FinnGen endpoint (e.g. ICD8, ICD9 and ICD10 are all included, whereas for the code test they will be listed separately).

The libraries will then statistically use Fisher's exact test to compare the cases and controls in each group.

Here is an example of the output for a codes-based analysis: (note that sometimes non-standard codes are found in the raw data. You can always check the full list of codes to see if there is a nearby code. However there are also many typos in the original registry data, especially the older ICD8 and ICD9 codes. Note also that some codes are Finnish specific.)

Step 8:

The next step is to interpret your results. We provide information on which p-values are significant at Interpreting rare-variant analysis results. Please note also that this method with the Fisher test used in these libraries may not be the best for common phenotypes. For common phenotype s it is important to also take into account the principal components due to population distribution.

A message like this one means you need to shutdown your current Sandbox and take the largest Sandbox:

PreviousGenotypes from VCF files NextInterpreting rare-variant analysis results

Last updated 1 year ago

Was this helpful?