How to run the LDSC pipeline

The LDSC pipeline is used for calculating heritabilities and genetic correlations for disease endpoints using ldsc. The complete documentation for the pipeline can be found in github.

Example files

You can find example files (ldsc_sandbox.wdl and ldsc_sandbox.json) for running the pipeline in: /finngen/scripts/ldsc

In the example .json file, you will first need to define a list of endpoints of interest (given in ldsc_rg.meta_fg) and their summary statistics (files and total sample sizes) tab- separated in the meta-table format with 3 columns (phenocode,path_to_phenocode, N_total). For example:

AD_AM_EXMORE    gs://finngen-production-library-green/ldsc/test/munged/AD_AM_EXMORE.premunged.gz    11345
KRA_PSY_ANXIETY_EXMORE    gs://finngen-production-library-green/ldsc/test/munged/KRA_PSY_ANXIETY_EXMORE.premunged.gz    263812

NOTE! The number of columns in the summary statistic file is hardcoded. The variant columns be the column containing the snp identifier, it being chrom/pos or rsid, the script can handle multiple formats at the same time if needed.

With this pipeline, you can:

1) calculate heritability estimates and all pair-wise genetic correlations for a list of endpoints by giving just one meta-table list in the .json file in ldsc_rg.meta_fg(make sure to comment out the ldsc_rg.comparison_fg line in this case), or

2) calculate heritability estimates for a list of endpoints (given in ldsc_rg.meta_fg) and their genetic correlations with endpoints given in another list (given in ldsc_rg.comparison_fg), such as the full list of endpoints in a given DF. However, the example .json file is only for the first scenario, so you will need to generate this file yourself.

3) calculate ONLY heritability estimates for a list of endpoints, by setting the parameter 'only_het' as True in the .json file.

Pre-munge your summary statistics file(s):

Before running the pipeline, you need to make sure that input sumstats are coherent with the requirements by ldsc for its own munging step.

The required input format is as follows:

SNP	A1	A2	BETA	P
rs74337086	A	G	0.0923	0.5059
rs76388980	A	G	0.1227	0.2945
rs562172865	T	C	-0.0262	0.8142
rs780596509	A	G	-0.2202	0.1545
rs778009914	A	G	-0.3938	0.3044
rs564223368	T	C	0.2195	0.03913
rs71628921	C	A	0.1763	0.3682
rs577189614	A	G	0.0845	0.5341
rs77357188	T	C	-0.0414	0.3383

To get summary statistics (in REGENIE output format) into right format, you can use the following example:

bash /finngen/library-green/scripts/ldsc/munge_sumstats.sh $SUM_STATS $OUT_FILE

where $SUM_STATS is a path to your input summary statistics file, and $OUT_FILE is the name of you munged summary statistics file.

Note: If your summary statistics file is not in the same format as the FG summary statistics, please change the column names from the munging script to correspond to your columns.\

Submit your job

You can submit your ldsc_rg pipeline job via the command line using the following command:

finngen-cli request-workflow --wdl /path/to/ldsc_sandbox.wdl \\
    --input /path/to/ldsc_sandbox.json

Output:

You'll find the heritability estimates for your endpoint(s) as one .tsv file in: /finngen/pipeline/cromwell/workflows/ldsc_rg/[WORKFLOW_ID]/call-gather_h2/[ldsc_rg.name]_[ldsc_rg_population].ldsc.heritability.tsv

and the pairwise genetic correlations, also as one .tsv file, in: /finngen/pipeline/cromwell/workflows/ldsc_rg/[WORKFLOW_ID]/call-gather_summaries//[ldsc_rg.name]_[ldsc_rg_population].ldsc.summary.tsv (genetic correlations are in column rg)

Last updated