Finemapping of Custom GWAS analyses
This page explains the following:
How is finemapping performed for Custom GWAS analyses?
How to get your endpoint finemapped?
How to access the data?
What data is available and how it is structured?
Finemapping process
The finemapping process consists of two steps: Region selection and actual fine-mapping of the selected regions.
Region selection algorithm
In short, region selection selects the regions that have genomewide significant variants for finemapping. Sometimes regions can be too large to finemap, in which case those regions will be marked as not possible to finemap.
In more detail, the region selection algorithm works in the following way: Taking the summary statistics as input, the region selection algorithm expands a window region around each genome-wide significant variant, with window size of 3MB and significance threshold 5e-8. Then, if any of these windows overlap, it merges them, and in an ideal case, that would be the end of region selection. However, due to practical reasons, we can not finemap arbitrarily large regions. Therefore, we have a maximum width of 6Mbp for region width, which the merged regions sometimes do exceed. In those cases, we try the following: For those too large regions, we try to re-form the regions using a 10% smaller window size than in the previous try, down until 1Mbp in width. In most cases the regions split and form smaller, manageable regions. In some cases we reach this lower threshold of 1 Mbp window size without being able to form finemappable regions, and in those cases we give up on that region and mark it as not possible to finemap in the outputs.
Fine-mapping of regions
These regions are then finemapped using both FINEMAP and SuSiE. More information about the methods can be found both in the release finemapping documentation in release data bucket green_library/finngen_R12/finngen_analysis_documentation/finngen_R12_finemap.md
, as well as the finemapping pipeline repository here.
What variants are included in the finemapping process?
Finemapping is performed on variants inside a region that fill the following prerequisites:
They are included in the GWAS summary statistic for that endpoint
Their INFO score for the data release was greater than 0.6
How to get your endpoint finemapped?
Finemapping GWAS analyses is not done by default at this point. To get your GWAS analysis finemapped, send an email to the servicedesk (finngen-servicedesk@helsinki.fi), with the following information:
Request to finemap your endpoint
endpoint name
finngen release
URL to to endpoint in user results Pheweb
For example:
to:finngen-servicedesk@helsinki.fi
subject: FinnGen Custom GWAS Finemap request
Dear service desk,
Could you fine-map the following endpoint:
Phepotype: ENDPOINT X
Release: 12
GWAS results available at FinnGen User Results browser: https://userresults.finngen.fi/pheno/ENDPOINTX
Best,
Eager Finemapper
Note that, for now, only release 12 endpoints are available for finemapping.
Data availability
The finemapping results are available in two places: In the userresults pheweb browser, as well as in the green library.
Finemapped endpoints are automatically loaded to the pheweb browser. In the pheweb browser, you can find the finemap data when examining a single genome-wide significant region. Fine-mapped results are not unfortunately listed yet in the phenotype view.
You can get to individual regions to by first going to your endpoint in userresults Pheweb, and then either clicking on a GWAS peak in the manhattan plot, or on the 'locus' link in the table, like in the below image.
In the region view, the credible set data should show as both a listing of how many signals were found on both SuSiE and FINEMAP, as well as a locuszoom plot. These have been highlighted with red in the image below.
You can find the finemapping data in the green library under green_library/finngen_R12/sandbox_custom_gwas/PHENOTYPE/finemap
, given release 12 and phenotype PHENOTYPE. Note that if this phenotype has not been finemapped, the finemap
subfolder does not exist.
Available files
All of the finemapping results are in a bucket /green_library/finngen_R12/sandbox_custom_gwas/PHENOTYPE/finemap. Some of the files are on this top-level directory, while some are in nested directories. The folder contains region selection outputs, FINEMAP and SuSiE outputs.
Here is a table describing each of those files or directories:
Filename | Description |
---|---|
had_results | This file tells if there were any regions to finemap in your endpoint. It will contain the text "True" if there were regions that were sent to finemapping, and "False" if there were no regions to finemap. Having regions to finemap in this context means the endpoint had genome-wide significant (GWS) variants. |
PHENOTYPE.region_status | This tab-separated file (TSV) shows a brief summary of the regions identified in region selection. |
too_many_regions | This file contains the word "True" if your endpoint contained too many regions to finemap (currently the limit is set to 300 regions). |
finemap/ | This folder contains the finemapped results of FINEMAP |
susie/ | This folder contains the finemapped results of SuSiE |
Next, the contents of the region status file, as well as finemap and susie folders are described.
Region status file
The region status file is a tab-separated file that tells which regions were sent to finemapping and if there were any problems that prevented finemapping. It has the following columns:
Column name | Description |
---|---|
region | The span of the region, specified in chromosomal coordinates |
status | Status of the region, either "OK" if the region was passed on to finemapping, or "Failure" if the region was not successfully formed. |
windowsize | The window size when determining a region. Region selection works by extending a window (in basepairs) around each genome-wide significant variable. If windows overlap each other, those windows get merged. These possibly merged windows are the resulting regions that are finemapped. In case a region is larger than the maximum allowed region size (currently 6 megabases), that region is retried with a smaller window. The final window size that is tried is the one showed here. |
failure | Empty if the region was successful. In case the region was not successful, the reason will read here. Most likely the region was too long, and it could not be formed even when lowering the window size to its minimum value. |
For example, it might be that there is a genome-wide significant region that is 10 Mbp or even 20 Mbp long. In those cases, it is likely that the region selection algorithm will not be able to narrow down the region into one that can be finemapped.
finemap folder
The finemap folder contains the following files and folders:finemap folder
Filename | Description |
---|---|
PHENOTYPE.FINEMAP.config.bgz | A bgzipped, tab-separated file containing the posterior summaries for each causal configuration, one per line |
PHENOTYPE.FINEMAP.region.bgz | A bgzipped, tab-separated file containing each region and the probabilities of the predicted causal variant configurations |
PHENOTYPE.FINEMAP.snp.bgz | A bgzipped, tab-separated file containing the credible set status for each of the snps in the finemapped regions. |
PHENOTYPE.FINEMAP.snp.bgz.tbi | A tabix index file for the snp file |
cred_regions/ | A folder containing the individual credible set predictions, with one file per model with amount of k causal SNPs. For example, a file ending with .cred3 has the predictions for the scenario that there are 3 independent causal variants in the region, and therefore 3 credible sets in the region |
The files are described in more detail below.
PHENOTYPE.FINEMAP.config.bgz
This file contains posterior summaries for all of the causal configuration, one per line. The columns are described in the following table:
Column name | Description |
---|---|
rank | ranking of this configuration |
config | the SNP identifiers |
prob | posterior probability of the configuration being the causal configuration |
log10bf | log10 Bayes factor of the configuration. The Bayes factor quantifies the evidence for the causal configuration over the null ocnfiguration (no causal variants) |
odds | Odds of the causal configuration |
k | number of SNPS in the causal configuration |
prob_norm_k | posterior probability of this configuration being the causal configuration, normalized over the set of configurations with the same number of causal variants |
h2 | heritability contribution of SNPs |
h2_0.95CI | 95% credible interval of heritability contribution of SNPs |
mean | mean of joint posterior effect size |
sd | standard deviation of joint posterior effect size |
More information can be found in http://www.christianbenner.com/
PHENOTYPE.FINEMAP.region.bgz
This bgzipped, tab-separated value file contains all of the finemapped regions for the endpoint, one region per line.
Column name | Description |
---|---|
trait | phenotype in question |
region | finemapped region |
h2g | Model-averaged heritability |
h2g_sd | Model-averaged heritability, standard deviation |
h2g_lower95 | lower bound of the heritability 95% credible interval |
h2g_upper95 | upper bound of the heritability 95% credible interval |
log10bf | log10 Bayes factor for the region |
prob_1..LSNP | Posterior probability for number of causal SNPS (= number of credible sets) from 1 to L, where L is the maximum amount of causal SNPs considered |
expectedvalue | Expected number of causal SNPs in the genomic region |
More information can be found in http://www.christianbenner.com/
PHENOTYPE.FINEMAP.snp.bgz
This tabixed, bgzipped file contains finemapping information for each of the snps that were finemapped. The file is a tab-separated value (TSV) file with one variant per line. The columns of the file are described in the table below:
Column name | Description |
---|---|
trait | phenotype in question |
region | finemapped region |
v | variant identifier, in form chromosome:position:ref:alt |
index | index |
rsid | variant identifier in the for 'chr'chromosome_position_ref_alt |
chromosome | chromosome of the variant, prefixed with 'chr' |
position | chromosomal position of the variant |
allele1 | reference allele of the variant |
allele2 | alternate allele of the variant |
maf | minor allele frequency of the variant |
beta | effect size of the variant in the GWAS summary statistic |
se | standard error for the variant in GWAS summary statistic |
z | z-score for the variant |
prob | Posterior Inclusion Probability for this variant, i.e. the probability that this variant is causal |
log10bf | log10 of Bayes factor. The Bayes factor quantifies the evidence that the variant is causal. |
mean | This column contains the marginalized shrinkage estimates of the posterior effect size mean for the alternate allele. The marginalized shrinkage estimate for a SNP is computed by averaging the posterior effect size means of this SNP from all causal configurations in the PHENOTYPE.FINEMAP.config.bgz file, assuming that the effect size of the SNP is zero if the SNP is not in the causal configuration. |
sd | This column contains the marginalized shrinkage estimates of the posterior effect size standard deviation. The estimates are computed in the same way as the marginalized shrinkage estimates of the posterior effect size mean. |
mean_incl | This column contains the conditional estimates of the posterior effect size mean for the alternate allele. The conditional estimate for a SNP is computed by averaging the posterior effect size means of this SNP from causal configurations in PHENOTYPE.FINEMAP.config.bgz file in which it is included. |
sd_incl | This column contains the conditional estimates of the posterior effect size standard deviation. The estimates are computed in the same way as the conditional estimates of the posterior effect size mean |
p | p-value of the association in the summary statistic |
csx | credible set index for given number of causal variants x |
PHENOTYPE.FINEMAP.snp.bgz.tbi
The tabix index file for PHENOTYPE.FINEMAP.snp.bgz. It is not directly used, but needs to be in same folder as the snp file in case tabix is used to search the file.
susie folder
The susie folder contains the following files:
Filename | Description |
---|---|
PHENOTYPE.SUSIE.cred.bgz | A bgzipped TSV, containing SUSIE per-credible set output. |
PHENOTYPE.SUSIE.cred.summary.tsv | A summary of the SUSIE credible set output, in TSV form. |
PHENOTYPE.SUSIE.cred_99.bgz | A bgzipped TSV, containing SUSIE per-credible set output for 99% credible sets. |
PHENOTYPE.SUSIE.snp.bgz | SUSIE output for every variant in the regions inspected |
PHENOTYPE.SUSIE.snp.bgz.tbi | A tabix index file for PHENOTYPE.SUSIE.snp.gbz |
PHENOTYPE.SUSIE.snp.filter.tsv | Filtered SUSIE SNP output for 95% credible sets |
PHENOTYPE.SUSIE_99.cred.summary.tsv | A summary of the SUSIE 99% credible set output, in TSV form. |
PHENOTYPE.SUSIE_99.snp.filter.tsv | Filtered SUSIE SNP output for 99% credible sets, in TSV form. |
PHENOTYPE.SUSIE_EXTEND.cred.summary.tsv | A summary of the SUSIE 95% credible sets extended with 99% variants, in TSV form. |
PHENOTYPE.SUSIE_EXTEND.snp.filter.tsv | Filtered SUSIE SNP output for 95% credible sets, extended with 99% CS variants, in TSV form. |
The files are described in more detail below:
PHENOTYPE.SUSIE.cred.bgz
This file contains all of the credible sets for this phenotype. The credible sets are the 95% credible sets, i.e. under the model they have a 95% probability of containing the causal variant. The file is a bgzipped tab-separated values file, with one credible set per line. The columns are described in the following table:
Column name | Description |
---|---|
trait | phenotype in question |
region | region which was finemapped, formatted as |
cs | credible set index. The credible set index can be used to match credible sets with their variants. |
cs_log10bf | Log10 of the credible set's Bayes factor. This quantifies the evidence for the model. For example, a value of 3 means that the model with this specific credible set had a 10^3 = 1000 larger likelihood than the null model without that credible set. |
cs_avg_r2 | Average r2 between credible set's variants |
cs_min_r2 | Minmum r2 between credible set's variants |
low_purity | This will be 'TRUE' if minimum r2 between all variants in credible set was less than a given threshold, currently 0.25, and 'FALSE' otherwise. |
cs_size | Size of the credible set, in amount of variants included. |
PHENOTYPE.SUSIE.cred_99.bgz
This file contains the 99% credible sets for this phenotype. A 99% credible set is one which under the model contains 99% probability mass that the causal variant is part of the credible set. The columns are otherwise the same as in PHENOTYPE.SUSIE.cred.bgz
.
PHENOTYPE.SUSIE.cred.summary.tsv
This file contains a summary of the credible sets for this phenotype. The credible sets are the 95% credible sets, i.e. under the model they have a 95% probability of containing the causal variant. The file is a tab-separated values file, with one credible set per line. The columns are described in the following table:
Column name | Description |
---|---|
trait | phenotype in question |
region | region which was finemapped, formatted as |
cs | credible set index. The credible set index can be used to match credible sets with their variants. |
cs_log10bf | Log10 of the credible set's Bayes factor. This quantifies the evidence for the model. For example, a value of 3 means that the model with this specific credible set had a 10^3 = 1000 larger likelihood than the null model without that credible set. |
cs_avg_r2 | Average r2 between credible set's variants |
cs_min_r2 | Minmum r2 between credible set's variants |
low_purity | This will be 'True' if minimum r2 between all variants in credible set was less than a given threshold, currently 0.25, and 'False' otherwise. |
cs_size | Size of the credible set, in amount of variants included. |
good_cs | This column is currently the inverse of low_purity column, and indicates whether the credible set consists of variants that are in reasonably strong LD together. |
cs_id | Unique identifier to the credible sed, consisting of the credible set region and credible set index in the following format: |
v | Credible set lead variant (largest PIP in the credible set). In format |
rsid | Credible set lead variant in format |
p | lead variant p-value |
beta | lead variant effect size |
sd | lead variant effect standard error |
prob | lead variant PIP in the region |
cs_specific_prob | lead variant PIP in this specific credible set. This and the prob column are almost always equal or very close to each other. |
most_severe | most severe predicted effect of this variant. |
gene_most_severe | Gene in which the most severe predicted effect of this variant is. |
PHENOTYPE.SUSIE_99.cred.summary.tsv
This file contains a summary of the credible sets for this phenotype. The credible sets are the 99% credible sets, i.e. under the model they have a 99% probability of containing the causal variant. The columns are otherwise the same as in PHENOTYPE.SUSIE.cred.summary.tsv
.
PHENOTYPE.SUSIE_EXTEND.cred.summary.tsv
This file contains a summary of the credible sets for this phenotype. The credible sets are the 95% credible sets, i.e. under the model they have a 95% probability of containing the causal variant, but they have been extended with the 99% credible set variants where possible. The columns are otherwise the same as in PHENOTYPE.SUSIE.cred.summary.tsv
.
PHENOTYPE.SUSIE.snp.bgz
This file contains susie data for all of the variants in all of the regions. The file is in bgzipped, tabixed tab-separated value form. One line containts one variant. The columns are described in the below table:
Column name | Description |
---|---|
trait | phenotype in question |
region | region which was finemapped, formatted as |
v | variant identifier in format |
rsid | variant identifier in format |
chromosome | chromosome of variant, in format |
position | chromosomal position of the variant |
allele1 | variant reference allele |
allele2 | variant alternate allele |
maf | variant alternate allele frequency |
beta | variant effect size in summary statistic |
se | variant effect standard error in summary statistic |
p | variant p-value in summary statistic |
mean | posterior mean beta after fine-mapping |
sd | posterior standard deviation after fine-mapping |
prob | posterior inclusion probability (PIP) in this region |
cs | credible set index, can be used to reference credible set in this region. |
cs_specific_prob | posterior inclusion probability (PIP) for this variant in its credible set. Almost always almost equal to the |
low_purity | Whether this credible set had r2 between variants that was lower than 0.25. |
lead_r2 | Pearsonr correlation to the credible set lead variant |
mean_99 | posterior mean beta after finemapping, for 99% credible set |
sd_99 | posterior standard deviation after fine-mapping, for 99% credible set |
prob_99 | posterior inclusion probability (PIP) in this region, for 99% credible set |
cs_99 | 99% credible set index |
cs_specific_prob_99 | posterior inclusion probability (PIP) for this variant in its 99% credible set. Almost always almost equal to the |
low_purity_99 | Whether this 99% credible set had r2 between variants that was lower than 0.25. |
lead_r2_99 | Pearsonr correlation to the 99% credible set lead variant |
alpha1..L | posterior inclusion probability for the x-th single effect (x := 1..L where L is the number of single effects (causal variants) specified; default: L = 10) |
mean1..L | posterior mean beta for the xth single effect (x := 1..L where L is the number of single effects (causal variants) specified; default: L = 10) |
sd1..L | posterior standard deviation for the xth single effect (x := 1..L where L is the number of single effects (causal variants) specified; default: L = 10) |
lbf_variable1..L | Log-Bayes factor for each variant and effect, conditional on all other signals |
PHENOTYPE.SUSIE.snp.bgz.tbi
Tabix index file for the SUSIE snp file.
PHENOTYPE.SUSIE.snp.filter.tsv
This file contains the filtered SNPs for the 95% credible sets. Variants not included in the 95% credible sets are not included. Neither are those that were part of low_purity credible sets. Variants are listed one per line. The file is in tab-separated value form. The columns are described in the table below:
Column name | Description |
---|---|
trait | phenotype in question |
region | region which was finemapped, formatted as |
v | variant identifier in format |
cs | credible set index, can be used to reference credible set in this region. |
cs_specific_prob | posterior inclusion probability (PIP) for this variant in its credible set. |
chromosome | chromosome of the variant |
position | chromosomal position of the variant |
allele1 | reference allele of the variant |
allele2 | alternate allele of the variant |
maf | alternate allele frequency for the variant |
beta | variant effect size in summary statistic |
p | variant p-value in summary statistic |
se | variant standard error in summary statistic |
most_severe | most severe predicted consequence for this variant |
gene_most_severe | gene in which this consequence is shown |
PHENOTYPE.SUSIE_99.snp.filter.tsv
This file contains the filtered SNPs for the 99% credible sets. Credible sets not included in the 99% credible sets are not included. Neither are those that were part of low_purity credible sets. Variants are listed one per line. The file is in tab-separated value form. This file contains the same columns as the PHENOTYPE.SUSIE.snp.filter.tsv
file.
PHENOTYPE.SUSIE_extend.snp.filter.tsv
This tab-separated values file contains the filtered SNPs for the 95% credible sets, extended with 99% credible set variants where applicable. Credible sets not included in the 95%/99% credible sets are not included. Neither are those that were part of low_purity credible sets. Variants are listed one per line. The file is in tab-separated value form. This file contains the same columns as the PHENOTYPE.SUSIE.snp.filter.tsv
file.
Last updated