How to run PRS pipeline

This pipeline is for calculating PRS from external summary statistics for FinnGen individuals. See more information about the pipeline in the github- page.

Example script- files

Example scripts (WDL and json) for running PRS pipeline can be found at /finngen/library-green/scripts/prs/. There are 2 example files:

  • prs_sandbox.wdl and

  • prs_sandbox.json (inputs, needs to be edited!)

Files that are required and what should be included

First, you need to find a GWAS summary statistic file for the disease or trait you are interested in studying. The summary statistic needs to be a full summary statistic, i.e. it should contain all variants from the GWAS and not be a selected set of variants. Links to full GWAS summary statistics can be found in published articles, and a good resource for full summary statistics is the GWAS Catalog.

gwas_meta.tsv. This file contains all the meta information of the input sumstats (i.e. the column names of all relevant fields) and it’s where 99% of the work is required. Here is a sampe of how the file should look:

  • The metadata file needs to be tab-separated with no spaces in any field (Cromwell specs). We recommend taking the default file in the json, upload it into excel, insert custom data and export as tsv, including header row!

  • Mandatory fields are in bold in the sample above. Others can be filled with a placeholder, like NA (but no empty fields!)

  • chrom & pos are needed only if the variant field does not contain rsids or two integers that can be mapped to chrom_pos

  • finngen_phenocode is used for filtering regions in score step, needs to match key in regions file (see below for more on the regions file)

GWAS summary statistic file (referred to in column ‘filename’) needs to be gzipped and have a .gz extension

Files need to be found in the global input folder as DATA_PATH (another input) + FILENAME

In the WDL, the tsv is reduced to only a subset of required fields. You can also check manually/locally if it runs properly by running the following command (all on one line):

cat TABLE_ABOVE.tsv| sed -E 1d | cut -f 1,3,8-17 > sumstats.txt

Please check that all fields are present (with NAs if the case) and that they match the expected output.\

bim_file is the list of variants that are to be kept in the munging phase (i.e. FinnGen variants, or a subset, if one is interested only in certain variants)

bed_file is the link to the bed (and in general there should be bim/fam matching files) for which the scores are calculated

regions is a file (use the current one for template) that allows one to calculate scores excluding regions based on phenotypes (for the time being APOE for Alzheimer’s). If one specifies a mapping between phenos and regions, for each pheno and extra score labeled no_regions is produced

ref_list is the list to the file structure that localizes the reference panel (1kg Europeans) in the pipeline. One can in principle change the reference panel, but it’s recommended to first test outside the Cromwell environment

All other parameters are somewhat hardcoded or cannot/should not be changed if not for specific requirements. Please reach out on Slack if you might want to change them so we can provide feedback on the matter.

How to submit your PRS job

Once you have edited your .json- file, you can submit your PRS job for example using the following command:

finngen-cli request-workflow \
    --wdl /path/to/prs_sandbox.wdl \
    --input /path/to/prs_sandbox.json

Take a look at a section How to use the Pipelines area to see general instruction how to submit your job.

Take a look at section Pipelines is based on Cromwell and WDL.

Common pitfalls include:

  • The summary statistics file is not gzipped (.gz extension)

  • The column names don’t match

  • Alleles are in lower case instead of upper case in the summary statistic

  • Meta information file has empty fields

Last updated