How to run PRS pipeline
Last updated
Last updated
This pipeline is for calculating PRS from external summary statistics for FinnGen individuals. See more information about the pipeline in the github- page.
Example scripts (WDL and json) for running PRS pipeline can be found at /finngen/library-green/wdl/PRS/
. There are 2 example files:
prs_sandbox.wdl
and
prs_sandbox.json
(inputs, needs to be edited!)
First, you need to find a GWAS summary statistic file for the disease or trait you are interested in studying. The summary statistic needs to be a full summary statistic, i.e. it should contain all variants from the GWAS and not be a selected set of variants. Links to full GWAS summary statistics can be found in published articles, and a good resource for full summary statistics is the GWAS Catalog.
gwas_meta.tsv. This file contains all the meta information of the input sumstats (i.e. the column names of all relevant fields) and it’s where 99% of the work is required. Here is a sampe of how the file should look:
The metadata file needs to be tab-separated with no spaces in any field (Cromwell specs). We recommend taking the default file in the json, upload it into excel, insert custom data and export as tsv, including header row!
Mandatory fields are in bold in the sample above. Others can be filled with a placeholder, like NA (but no empty fields!)
chrom
& pos
are needed only if the variant field does not contain rsids or two integers that can be mapped to chrom_pos
finngen_phenocode
is used for filtering regions in score step, needs to match key in regions file (see below for more on the regions file)
GWAS summary statistic file (referred to in column ‘filename’) needs to be gzipped and have a .gz extension
Files need to be found in the global input folder as DATA_PATH (another input) + FILENAME
In the WDL, the tsv is reduced to only a subset of required fields. You can also check manually/locally if it runs properly by running the following command (all on one line):
cat TABLE_ABOVE.tsv| sed -E 1d | cut -f 1,3,8-17 > sumstats.txt
Please check that all fields are present (with NAs if the case) and that they match the expected output.\
bim_file is the list of variants that are to be kept in the munging phase (i.e. FinnGen variants, or a subset, if one is interested only in certain variants)
bed_file is the link to the bed (and in general there should be bim/fam matching files) for which the scores are calculated
regions is a file (use the current one for template) that allows one to calculate scores excluding regions based on phenotypes (for the time being APOE for Alzheimer’s). If one specifies a mapping between phenos and regions, for each pheno and extra score labeled no_regions is produced
ref_list is the list to the file structure that localizes the reference panel (1kg Europeans) in the pipeline. One can in principle change the reference panel, but it’s recommended to first test outside the Cromwell environment
All other parameters are somewhat hardcoded or cannot/should not be changed if not for specific requirements. Please reach out on Slack if you might want to change them so we can provide feedback on the matter.
Once you have edited your .json
- file, you can submit your PRS job for example using the following command:
Take a look at a section How to use the Pipelines area to see general instruction how to submit your job.
Take a look at section Pipelines is based on Cromwell and WDL.
The summary statistics file is not gzipped (.gz extension)
The column names don’t match
Alleles are in lower case instead of upper case in the summary statistic
Meta information file has empty fields