Finemapping of Custom GWAS analyses

This page explains the following:

  1. How is finemapping performed for Custom GWAS analyses?

  2. How to get your endpoint finemapped?

  3. How to access the data?

  4. What data is available and how it is structured?

​Finemapping process

The finemapping process consists of two steps: Region selection and actual fine-mapping of the selected regions.

Region selection algorithm

In short, region selection selects the regions that have genomewide significant variants for finemapping. Sometimes regions can be too large to finemap, in which case those regions will be marked as not possible to finemap.

In more detail, the region selection algorithm works in the following way: Taking the summary statistics as input, the region selection algorithm expands a window region around each genome-wide significant variant, with window size of 3MB and significance threshold 5e-8. Then, if any of these windows overlap, it merges them, and in an ideal case, that would be the end of region selection. However, due to practical reasons, we can not finemap arbitrarily large regions. Therefore, we have a maximum width of 6Mbp for region width, which the merged regions sometimes do exceed. In those cases, we try the following: For those too large regions, we try to re-form the regions using a 10% smaller window size than in the previous try, down until 1Mbp in width. In most cases the regions split and form smaller, manageable regions. In some cases we reach this lower threshold of 1 Mbp window size without being able to form finemappable regions, and in those cases we give up on that region and mark it as not possible to finemap in the outputs.

Fine-mapping of regions

These regions are then finemapped using both FINEMAP and SuSiE. More information about the methods can be found both in the release finemapping documentation in release data bucket green_library/finngen_R12/finngen_analysis_documentation/finngen_R12_finemap.md, as well as the finemapping pipeline repository here.

What variants are included in the finemapping process?

Finemapping is performed on variants inside a region that fill the following prerequisites:

  1. They are included in the GWAS summary statistic for that endpoint

  2. Their INFO score for the data release was greater than 0.6

How to get your endpoint finemapped?

Finemapping GWAS analyses is not done by default at this point. To get your GWAS analysis finemapped, send an email to the servicedesk (finngen-servicedesk@helsinki.fi), with the following information:

  1. Request to finemap your endpoint

  2. endpoint name

  3. finngen release

  4. URL to to endpoint in user results Pheweb

For example:

to:finngen-servicedesk@helsinki.fi

subject: FinnGen Custom GWAS Finemap request

Dear service desk,

Could you fine-map the following endpoint:

Phepotype: ENDPOINT X

Release: 12

GWAS results available at FinnGen User Results browser: https://userresults.finngen.fi/pheno/ENDPOINTX

Best,

Eager Finemapper

Note that, for now, only release 12 endpoints are available for finemapping.

Data availability

The finemapping results are available in two places: In the userresults pheweb browser, as well as in the green library.

Finemapped endpoints are automatically loaded to the pheweb browser. In the pheweb browser, you can find the finemap data when examining a single genome-wide significant region. Fine-mapped results are not unfortunately listed yet in the phenotype view.

You can get to individual regions to by first going to your endpoint in userresults Pheweb, and then either clicking on a GWAS peak in the manhattan plot, or on the 'locus' link in the table, like in the below image.

In the region view, the credible set data should show as both a listing of how many signals were found on both SuSiE and FINEMAP, as well as a locuszoom plot. These have been highlighted with red in the image below.

You can find the finemapping data in the green library under green_library/finngen_R12/sandbox_custom_gwas/PHENOTYPE/finemap, given release 12 and phenotype PHENOTYPE. Note that if this phenotype has not been finemapped, the finemap subfolder does not exist.

Available files

All of the finemapping results are in a bucket /green_library/finngen_R12/sandbox_custom_gwas/PHENOTYPE/finemap. Some of the files are on this top-level directory, while some are in nested directories. The folder contains region selection outputs, FINEMAP and SuSiE outputs.

Here is a table describing each of those files or directories:

Filename
Description

had_results

This file tells if there were any regions to finemap in your endpoint. It will contain the text "True" if there were regions that were sent to finemapping, and "False" if there were no regions to finemap. Having regions to finemap in this context means the endpoint had genome-wide significant (GWS) variants.

PHENOTYPE.region_status

This tab-separated file (TSV) shows a brief summary of the regions identified in region selection.

too_many_regions

This file contains the word "True" if your endpoint contained too many regions to finemap (currently the limit is set to 300 regions).

finemap/

This folder contains the finemapped results of FINEMAP

susie/

This folder contains the finemapped results of SuSiE

Next, the contents of the region status file, as well as finemap and susie folders are described.

Region status file

The region status file is a tab-separated file that tells which regions were sent to finemapping and if there were any problems that prevented finemapping. It has the following columns:

Column name
Description

region

The span of the region, specified in chromosomal coordinates chromosome.start-end

status

Status of the region, either "OK" if the region was passed on to finemapping, or "Failure" if the region was not successfully formed.

windowsize

The window size when determining a region. Region selection works by extending a window (in basepairs) around each genome-wide significant variable. If windows overlap each other, those windows get merged. These possibly merged windows are the resulting regions that are finemapped. In case a region is larger than the maximum allowed region size (currently 6 megabases), that region is retried with a smaller window. The final window size that is tried is the one showed here.

failure

Empty if the region was successful. In case the region was not successful, the reason will read here. Most likely the region was too long, and it could not be formed even when lowering the window size to its minimum value.

For example, it might be that there is a genome-wide significant region that is 10 Mbp or even 20 Mbp long. In those cases, it is likely that the region selection algorithm will not be able to narrow down the region into one that can be finemapped.

finemap folder

The finemap folder contains the following files and folders:finemap folder

Filename
Description

PHENOTYPE.FINEMAP.config.bgz

A bgzipped, tab-separated file containing the posterior summaries for each causal configuration, one per line

PHENOTYPE.FINEMAP.region.bgz

A bgzipped, tab-separated file containing each region and the probabilities of the predicted causal variant configurations

PHENOTYPE.FINEMAP.snp.bgz

A bgzipped, tab-separated file containing the credible set status for each of the snps in the finemapped regions.

PHENOTYPE.FINEMAP.snp.bgz.tbi

A tabix index file for the snp file

cred_regions/

A folder containing the individual credible set predictions, with one file per model with amount of k causal SNPs. For example, a file ending with .cred3 has the predictions for the scenario that there are 3 independent causal variants in the region, and therefore 3 credible sets in the region

The files are described in more detail below.

PHENOTYPE.FINEMAP.config.bgz

This file contains posterior summaries for all of the causal configuration, one per line. The columns are described in the following table:

Column name
Description

rank

ranking of this configuration

config

the SNP identifiers

prob

posterior probability of the configuration being the causal configuration

log10bf

log10 Bayes factor of the configuration. The Bayes factor quantifies the evidence for the causal configuration over the null ocnfiguration (no causal variants)

odds

Odds of the causal configuration

k

number of SNPS in the causal configuration

prob_norm_k

posterior probability of this configuration being the causal configuration, normalized over the set of configurations with the same number of causal variants

h2

heritability contribution of SNPs

h2_0.95CI

95% credible interval of heritability contribution of SNPs

mean

mean of joint posterior effect size

sd

standard deviation of joint posterior effect size

More information can be found in http://www.christianbenner.com/

PHENOTYPE.FINEMAP.region.bgz

This bgzipped, tab-separated value file contains all of the finemapped regions for the endpoint, one region per line.

Column name
Description

trait

phenotype in question

region

finemapped region

h2g

Model-averaged heritability

h2g_sd

Model-averaged heritability, standard deviation

h2g_lower95

lower bound of the heritability 95% credible interval

h2g_upper95

upper bound of the heritability 95% credible interval

log10bf

log10 Bayes factor for the region

prob_1..LSNP

Posterior probability for number of causal SNPS (= number of credible sets) from 1 to L, where L is the maximum amount of causal SNPs considered

expectedvalue

Expected number of causal SNPs in the genomic region

More information can be found in http://www.christianbenner.com/

PHENOTYPE.FINEMAP.snp.bgz

This tabixed, bgzipped file contains finemapping information for each of the snps that were finemapped. The file is a tab-separated value (TSV) file with one variant per line. The columns of the file are described in the table below:

Column name
Description

trait

phenotype in question

region

finemapped region

v

variant identifier, in form chromosome:position:ref:alt

index

index

rsid

variant identifier in the for 'chr'chromosome_position_ref_alt

chromosome

chromosome of the variant, prefixed with 'chr'

position

chromosomal position of the variant

allele1

reference allele of the variant

allele2

alternate allele of the variant

maf

minor allele frequency of the variant

beta

effect size of the variant in the GWAS summary statistic

se

standard error for the variant in GWAS summary statistic

z

z-score for the variant

prob

Posterior Inclusion Probability for this variant, i.e. the probability that this variant is causal

log10bf

log10 of Bayes factor. The Bayes factor quantifies the evidence that the variant is causal.

mean

This column contains the marginalized shrinkage estimates of the posterior effect size mean for the alternate allele. The marginalized shrinkage estimate for a SNP is computed by averaging the posterior effect size means of this SNP from all causal configurations in the PHENOTYPE.FINEMAP.config.bgz file, assuming that the effect size of the SNP is zero if the SNP is not in the causal configuration.

sd

This column contains the marginalized shrinkage estimates of the posterior effect size standard deviation. The estimates are computed in the same way as the marginalized shrinkage estimates of the posterior effect size mean.

mean_incl

This column contains the conditional estimates of the posterior effect size mean for the alternate allele. The conditional estimate for a SNP is computed by averaging the posterior effect size means of this SNP from causal configurations in PHENOTYPE.FINEMAP.config.bgz file in which it is included.

sd_incl

This column contains the conditional estimates of the posterior effect size standard deviation. The estimates are computed in the same way as the conditional estimates of the posterior effect size mean

p

p-value of the association in the summary statistic

csx

credible set index for given number of causal variants x

PHENOTYPE.FINEMAP.snp.bgz.tbi

The tabix index file for PHENOTYPE.FINEMAP.snp.bgz. It is not directly used, but needs to be in same folder as the snp file in case tabix is used to search the file.

susie folder

The susie folder contains the following files:

Filename
Description

PHENOTYPE.SUSIE.cred.bgz

A bgzipped TSV, containing SUSIE per-credible set output.

PHENOTYPE.SUSIE.cred.summary.tsv

A summary of the SUSIE credible set output, in TSV form.

PHENOTYPE.SUSIE.cred_99.bgz

A bgzipped TSV, containing SUSIE per-credible set output for 99% credible sets.

PHENOTYPE.SUSIE.snp.bgz

SUSIE output for every variant in the regions inspected

PHENOTYPE.SUSIE.snp.bgz.tbi

A tabix index file for PHENOTYPE.SUSIE.snp.gbz

PHENOTYPE.SUSIE.snp.filter.tsv

Filtered SUSIE SNP output for 95% credible sets

PHENOTYPE.SUSIE_99.cred.summary.tsv

A summary of the SUSIE 99% credible set output, in TSV form.

PHENOTYPE.SUSIE_99.snp.filter.tsv

Filtered SUSIE SNP output for 99% credible sets, in TSV form.

PHENOTYPE.SUSIE_EXTEND.cred.summary.tsv

A summary of the SUSIE 95% credible sets extended with 99% variants, in TSV form.

PHENOTYPE.SUSIE_EXTEND.snp.filter.tsv

Filtered SUSIE SNP output for 95% credible sets, extended with 99% CS variants, in TSV form.

The files are described in more detail below:

PHENOTYPE.SUSIE.cred.bgz

This file contains all of the credible sets for this phenotype. The credible sets are the 95% credible sets, i.e. under the model they have a 95% probability of containing the causal variant. The file is a bgzipped tab-separated values file, with one credible set per line. The columns are described in the following table:

Column name
Description

trait

phenotype in question

region

region which was finemapped, formatted as chrCHROMOSOME:START-END

cs

credible set index. The credible set index can be used to match credible sets with their variants.

cs_log10bf

Log10 of the credible set's Bayes factor. This quantifies the evidence for the model. For example, a value of 3 means that the model with this specific credible set had a 10^3 = 1000 larger likelihood than the null model without that credible set.

cs_avg_r2

Average r2 between credible set's variants

cs_min_r2

Minmum r2 between credible set's variants

low_purity

This will be 'TRUE' if minimum r2 between all variants in credible set was less than a given threshold, currently 0.25, and 'FALSE' otherwise.

cs_size

Size of the credible set, in amount of variants included.

PHENOTYPE.SUSIE.cred_99.bgz

This file contains the 99% credible sets for this phenotype. A 99% credible set is one which under the model contains 99% probability mass that the causal variant is part of the credible set. The columns are otherwise the same as in PHENOTYPE.SUSIE.cred.bgz.

PHENOTYPE.SUSIE.cred.summary.tsv

This file contains a summary of the credible sets for this phenotype. The credible sets are the 95% credible sets, i.e. under the model they have a 95% probability of containing the causal variant. The file is a tab-separated values file, with one credible set per line. The columns are described in the following table:

Column name
Description

trait

phenotype in question

region

region which was finemapped, formatted as chrCHROMOSOME:START-END

cs

credible set index. The credible set index can be used to match credible sets with their variants.

cs_log10bf

Log10 of the credible set's Bayes factor. This quantifies the evidence for the model. For example, a value of 3 means that the model with this specific credible set had a 10^3 = 1000 larger likelihood than the null model without that credible set.

cs_avg_r2

Average r2 between credible set's variants

cs_min_r2

Minmum r2 between credible set's variants

low_purity

This will be 'True' if minimum r2 between all variants in credible set was less than a given threshold, currently 0.25, and 'False' otherwise.

cs_size

Size of the credible set, in amount of variants included.

good_cs

This column is currently the inverse of low_purity column, and indicates whether the credible set consists of variants that are in reasonably strong LD together.

cs_id

Unique identifier to the credible sed, consisting of the credible set region and credible set index in the following format: REGION_CS_INDEX

v

Credible set lead variant (largest PIP in the credible set). In format CHROMOSOME:POSITION:REF:ALT

rsid

Credible set lead variant in format chrCHROMOSOME_POSITION_REF_ALT

p

lead variant p-value

beta

lead variant effect size

sd

lead variant effect standard error

prob

lead variant PIP in the region

cs_specific_prob

lead variant PIP in this specific credible set. This and the prob column are almost always equal or very close to each other.

most_severe

most severe predicted effect of this variant.

gene_most_severe

Gene in which the most severe predicted effect of this variant is.

PHENOTYPE.SUSIE_99.cred.summary.tsv

This file contains a summary of the credible sets for this phenotype. The credible sets are the 99% credible sets, i.e. under the model they have a 99% probability of containing the causal variant. The columns are otherwise the same as in PHENOTYPE.SUSIE.cred.summary.tsv.

PHENOTYPE.SUSIE_EXTEND.cred.summary.tsv

This file contains a summary of the credible sets for this phenotype. The credible sets are the 95% credible sets, i.e. under the model they have a 95% probability of containing the causal variant, but they have been extended with the 99% credible set variants where possible. The columns are otherwise the same as in PHENOTYPE.SUSIE.cred.summary.tsv.

PHENOTYPE.SUSIE.snp.bgz

This file contains susie data for all of the variants in all of the regions. The file is in bgzipped, tabixed tab-separated value form. One line containts one variant. The columns are described in the below table:

Column name
Description

trait

phenotype in question

region

region which was finemapped, formatted as chrCHROMOSOME:START-END

v

variant identifier in format CHROMOSOME:POSITION:REF:ALT

rsid

variant identifier in format chrCHROMOSOME_POSITION_REF_ALT

chromosome

chromosome of variant, in format chrCHROMOSOME

position

chromosomal position of the variant

allele1

variant reference allele

allele2

variant alternate allele

maf

variant alternate allele frequency

beta

variant effect size in summary statistic

se

variant effect standard error in summary statistic

p

variant p-value in summary statistic

mean

posterior mean beta after fine-mapping

sd

posterior standard deviation after fine-mapping

prob

posterior inclusion probability (PIP) in this region

cs

credible set index, can be used to reference credible set in this region.

cs_specific_prob

posterior inclusion probability (PIP) for this variant in its credible set. Almost always almost equal to the prob column.

low_purity

Whether this credible set had r2 between variants that was lower than 0.25.

lead_r2

Pearsonr correlation to the credible set lead variant

mean_99

posterior mean beta after finemapping, for 99% credible set

sd_99

posterior standard deviation after fine-mapping, for 99% credible set

prob_99

posterior inclusion probability (PIP) in this region, for 99% credible set

cs_99

99% credible set index

cs_specific_prob_99

posterior inclusion probability (PIP) for this variant in its 99% credible set. Almost always almost equal to the prob_99 column.

low_purity_99

Whether this 99% credible set had r2 between variants that was lower than 0.25.

lead_r2_99

Pearsonr correlation to the 99% credible set lead variant

alpha1..L

posterior inclusion probability for the x-th single effect (x := 1..L where L is the number of single effects (causal variants) specified; default: L = 10)

mean1..L

posterior mean beta for the xth single effect (x := 1..L where L is the number of single effects (causal variants) specified; default: L = 10)

sd1..L

posterior standard deviation for the xth single effect (x := 1..L where L is the number of single effects (causal variants) specified; default: L = 10)

lbf_variable1..L

Log-Bayes factor for each variant and effect, conditional on all other signals

PHENOTYPE.SUSIE.snp.bgz.tbi

Tabix index file for the SUSIE snp file.

PHENOTYPE.SUSIE.snp.filter.tsv

This file contains the filtered SNPs for the 95% credible sets. Variants not included in the 95% credible sets are not included. Neither are those that were part of low_purity credible sets. Variants are listed one per line. The file is in tab-separated value form. The columns are described in the table below:

Column name
Description

trait

phenotype in question

region

region which was finemapped, formatted as chrCHROMOSOME:START-END

v

variant identifier in format CHROMOSOME:POSITION:REF:ALT

cs

credible set index, can be used to reference credible set in this region.

cs_specific_prob

posterior inclusion probability (PIP) for this variant in its credible set.

chromosome

chromosome of the variant

position

chromosomal position of the variant

allele1

reference allele of the variant

allele2

alternate allele of the variant

maf

alternate allele frequency for the variant

beta

variant effect size in summary statistic

p

variant p-value in summary statistic

se

variant standard error in summary statistic

most_severe

most severe predicted consequence for this variant

gene_most_severe

gene in which this consequence is shown

PHENOTYPE.SUSIE_99.snp.filter.tsv

This file contains the filtered SNPs for the 99% credible sets. Credible sets not included in the 99% credible sets are not included. Neither are those that were part of low_purity credible sets. Variants are listed one per line. The file is in tab-separated value form. This file contains the same columns as the PHENOTYPE.SUSIE.snp.filter.tsv file.

PHENOTYPE.SUSIE_extend.snp.filter.tsv

This tab-separated values file contains the filtered SNPs for the 95% credible sets, extended with 99% credible set variants where applicable. Credible sets not included in the 95%/99% credible sets are not included. Neither are those that were part of low_purity credible sets. Variants are listed one per line. The file is in tab-separated value form. This file contains the same columns as the PHENOTYPE.SUSIE.snp.filter.tsv file.

Last updated