How to run colocalization pipeline

Describe how to run the new colocalization pipeline with coloc susie package

Introduction

This pipeline takes the outputs from our finemapping pipeline, and perform colocalization among 571 resources we gathered, including all GWAS endpoints from FinnGen, UKB, eQTL catelogue, Generisk project, proteomics study from INTERVAL, UKB and FinnGen.

Data SourceData typeDescription

FinnGen-R12

GWAS

all endpoints from FinnGen R12

GeneRisk

GWAS

GeneRISK Study is an ongoing prospective observational study focusing on genetic risk factors of cardiovascular diseases and on utilizing genetic information in preventing diseases.

UKB-finucane

GWAS

Some endpoints from UKB shared from Masahiro. https://www.medrxiv.org/content/10.1101/2021.09.03.21262975v1

Alasoo_2018--macrophage_naive--ge

eQTL_Catalogue

expression QTL from eQTL catalogue (release 6), gathered from macrophage and based on gene expression, see eQTL catelogue website for more information

... (other ~560 more items)

eQTL_Catelogue

Other resources from eQTL Catelogue indicated by the data source. eQTL catelogue assembled multiple data sources, e.g., tissue expression from GTEX.

INTERVAL

Plasma-Proteomics

Proteomics QTL from INTERVAL

UKB-PPP

Plasma-Proteomics

Proteomics QTL from UKBiobank (Olink)

FIN-R12-Olink

Plasma-Proteomics

Proteomics QTL from FinnGen R12 (Olink)

FIN-R12-Somascan

Plasma-Proteomics

Proteomics QTL from FinnGen R12 (Somascan)

Example to run

  1. Download the meta data from finemapping pipeline.

Menu(Applications) -> Sandbox -> pipelines and find your successful finemapping run -> click download metadata (assumed to be located in Downloads/XXXX_metadata.json)

  1. Submit the colocalization job in local terminal in the sandbox

# run the script: metadata, trait_name, data_type, storage bucket (your green bucket)
# please customize those inputs to your own project and data_type can be any string wihout space)
# please change the red bucket number "N" to match your sandbox environment, you can see the red bucket uri by running "gsutil ls" in SB terminal 
/finngen/shared_nfs/finngen/coloc/submit ~/Downloads/XXXX_metadata.json T2D GWAS gs://fg-production-sandbox-"N"-red/YOUR_PATH/T2D_Project

Check the errors if there are some.

If no error occurs, pressing the Enter key at the terminal will open a browser to check the jobs. Refresh and look into your submitted job. The job is named "ColocSusieDirectMulti" with your user name, it takes some time to show due to reponse time for the backends in the sandbox.

  1. Download results

The outputs are labeled as "ColocSusieDirectMulti.colocQC" in output of pipeline's job details. We only keep the H4.PP > 0.5 and valid credible set from both dataset (the threshold could be controled in the input). Future filtering should be performed based on your purpose to this output, e.g., H4.PP > 0.8 and overlapped region size. We could not provide a gold standard for this, as it is dependent on the study design and the aim for colocalization.

The raw results are listed in the "ColocSusieDirectMulti.coloc" without any filtering and merging.

"ColocSusieDirectMulti.hit": all the information for the top signals in the full colocalization results.

"ColocSusieDirectMulti.pairs": the overlapped region being run in the workflow.

Output formats

ColumnDescription

dataset1

generated from your trait_name and data_type

dataset2

Study--DataType in our resources

trait1

the trait name in your data

trait2

trait name / molecular phenotype name from our resources

region1

region in your data

region2

overlapped region in our resources

cs1

credible set in your data

cs2

credible set in our resources

nsnps

total variants overlapped

hit1

top signal in your data

hit2

top signal in our resources

PP.H4.abf

probability of colocalization between your data and our resources

low_purity1

the credible set is low purity or not in your data. (1 means low purity, 0, high purity)

low_purity2

the purity in our resources

nsnps1

number of variants in region from your data

nsnps2

number of variants in region from our resources

cs1_log10bf

log10 bayes factor for the credible set in your data

cs2_log10bf

log10 bayes factor for the credible set in our resources

clpp

colocalization based on CLPP

clpa

colocalization based on CLPA (min of PIP)

cs1_size

size of the raw credible set in your data

cs2_size

size of the raw credible set in our resources

cs_overlap

size of the overlapped credible set

topInOverlap

Indicator if a top variant (highest PIP) in each dataset is in the overlap region of finemapped regions of the 2 datasets. 1,1: both orginal top signal located in the overlapped region (expected reasonable coloc); 1,0 /0,1: only one top in the overlapped region; 0,0: both top signal are not in the overlapped.

hit1_info

information of top signal in your data (beta, p-value)

hit2_info

information of top signal in our resources (beta, p-value)

Codes are available on github: https://github.com/FINNGEN/coloc.susie.direct

Last updated