Finemapping of Custom GWAS analyses
This page explains the following:
How is finemapping performed for Custom GWAS analyses?
How to get your endpoint finemapped?
How to access the data?
What data is available and how it is structured?
Finemapping process
The finemapping process consists of two steps: Region selection and actual fine-mapping of the selected regions.
Region selection algorithm
In short, region selection selects the regions that have genomewide significant variants for finemapping. Sometimes regions can be too large to finemap, in which case those regions will be marked as not possible to finemap.
In more detail, the region selection algorithm works in the following way: Taking the summary statistics as input, the region selection algorithm expands a window region around each genome-wide significant variant, with window size of 3MB and significance threshold 5e-8. Then, if any of these windows overlap, it merges them, and in an ideal case, that would be the end of region selection. However, due to practical reasons, we can not finemap arbitrarily large regions. Therefore, we have a maximum width of 6Mbp for region width, which the merged regions sometimes do exceed. In those cases, we try the following: For those too large regions, we try to re-form the regions using a 10% smaller window size than in the previous try, down until 1Mbp in width. In most cases the regions split and form smaller, manageable regions. In some cases we reach this lower threshold of 1 Mbp window size without being able to form finemappable regions, and in those cases we give up on that region and mark it as not possible to finemap in the outputs.
Fine-mapping of regions
These regions are then finemapped using both FINEMAP and SuSiE. More information about the methods can be found both in the release finemapping documentation in release data bucket green_library/finngen_R12/finngen_analysis_documentation/finngen_R12_finemap.md
, as well as the finemapping pipeline repository here.
What variants are included in the finemapping process?
Finemapping is performed on variants inside a region that fill the following prerequisites:
They are included in the GWAS summary statistic for that endpoint
Their INFO score for the data release was greater than 0.6
How to get your endpoint finemapped?
Finemapping GWAS analyses is not done by default at this point. To get your GWAS analysis finemapped, send an email to the servicedesk (finngen-servicedesk@helsinki.fi), with the following information:
Request to finemap your endpoint
endpoint name
finngen release
URL to to endpoint in user results Pheweb
For example:
to:finngen-servicedesk@helsinki.fi
subject: FinnGen Custom GWAS Finemap request
Dear service desk,
Could you fine-map the following endpoint:
Phepotype: ENDPOINT X
Release: 12
GWAS results available at FinnGen User Results browser: https://userresults.finngen.fi/pheno/ENDPOINTX
Best,
Eager Finemapper
Note that, for now, only release 12 endpoints are available for finemapping.
Data availability
The finemapping results are available in two places: In the userresults pheweb browser, as well as in the green library.
Finemapped endpoints are automatically loaded to the pheweb browser. In the pheweb browser, you can find the finemap data when examining a single genome-wide significant region. Fine-mapped results are not unfortunately listed yet in the phenotype view.
You can get to individual regions to by first going to your endpoint in userresults Pheweb, and then either clicking on a GWAS peak in the manhattan plot, or on the 'locus' link in the table, like in the below image.
In the region view, the credible set data should show as both a listing of how many signals were found on both SuSiE and FINEMAP, as well as a locuszoom plot. These have been highlighted with red in the image below.
You can find the finemapping data in the green library under green_library/finngen_R12/sandbox_custom_gwas/PHENOTYPE/finemap
, given release 12 and phenotype PHENOTYPE. Note that if this phenotype has not been finemapped, the finemap
subfolder does not exist.
Available files
All of the finemapping results are in a bucket /green_library/finngen_R12/sandbox_custom_gwas/PHENOTYPE/finemap. Some of the files are on this top-level directory, while some are in nested directories. The folder contains region selection outputs, FINEMAP and SuSiE outputs.
Here is a table describing each of those files or directories:
had_results
This file tells if there were any regions to finemap in your endpoint. It will contain the text "True" if there were regions that were sent to finemapping, and "False" if there were no regions to finemap. Having regions to finemap in this context means the endpoint had genome-wide significant (GWS) variants.
PHENOTYPE.region_status
This tab-separated file (TSV) shows a brief summary of the regions identified in region selection.
too_many_regions
This file contains the word "True" if your endpoint contained too many regions to finemap (currently the limit is set to 300 regions).
finemap/
This folder contains the finemapped results of FINEMAP
susie/
This folder contains the finemapped results of SuSiE
Next, the contents of the region status file, as well as finemap and susie folders are described.
Region status file
The region status file is a tab-separated file that tells which regions were sent to finemapping and if there were any problems that prevented finemapping. It has the following columns:
region
The span of the region, specified in chromosomal coordinates chromosome.start-end
status
Status of the region, either "OK" if the region was passed on to finemapping, or "Failure" if the region was not successfully formed.
windowsize
The window size when determining a region. Region selection works by extending a window (in basepairs) around each genome-wide significant variable. If windows overlap each other, those windows get merged. These possibly merged windows are the resulting regions that are finemapped. In case a region is larger than the maximum allowed region size (currently 6 megabases), that region is retried with a smaller window. The final window size that is tried is the one showed here.
failure
Empty if the region was successful. In case the region was not successful, the reason will read here. Most likely the region was too long, and it could not be formed even when lowering the window size to its minimum value.
For example, it might be that there is a genome-wide significant region that is 10 Mbp or even 20 Mbp long. In those cases, it is likely that the region selection algorithm will not be able to narrow down the region into one that can be finemapped.
finemap folder
The finemap folder contains the following files and folders:finemap folder
PHENOTYPE.FINEMAP.config.bgz
A bgzipped, tab-separated file containing the posterior summaries for each causal configuration, one per line
PHENOTYPE.FINEMAP.region.bgz
A bgzipped, tab-separated file containing each region and the probabilities of the predicted causal variant configurations
PHENOTYPE.FINEMAP.snp.bgz
A bgzipped, tab-separated file containing the credible set status for each of the snps in the finemapped regions.
PHENOTYPE.FINEMAP.snp.bgz.tbi
A tabix index file for the snp file
cred_regions/
A folder containing the individual credible set predictions, with one file per model with amount of k causal SNPs. For example, a file ending with .cred3 has the predictions for the scenario that there are 3 independent causal variants in the region, and therefore 3 credible sets in the region
The files are described in more detail below.
PHENOTYPE.FINEMAP.config.bgz
This file contains posterior summaries for all of the causal configuration, one per line. The columns are described in the following table:
rank
ranking of this configuration
config
the SNP identifiers
prob
posterior probability of the configuration being the causal configuration
log10bf
log10 Bayes factor of the configuration. The Bayes factor quantifies the evidence for the causal configuration over the null ocnfiguration (no causal variants)
odds
Odds of the causal configuration
k
number of SNPS in the causal configuration
prob_norm_k
posterior probability of this configuration being the causal configuration, normalized over the set of configurations with the same number of causal variants
h2
heritability contribution of SNPs
h2_0.95CI
95% credible interval of heritability contribution of SNPs
mean
mean of joint posterior effect size
sd
standard deviation of joint posterior effect size
More information can be found in http://www.christianbenner.com/
PHENOTYPE.FINEMAP.region.bgz
This bgzipped, tab-separated value file contains all of the finemapped regions for the endpoint, one region per line.
trait
phenotype in question
region
finemapped region
h2g
Model-averaged heritability
h2g_sd
Model-averaged heritability, standard deviation
h2g_lower95
lower bound of the heritability 95% credible interval
h2g_upper95
upper bound of the heritability 95% credible interval
log10bf
log10 Bayes factor for the region
prob_1..LSNP
Posterior probability for number of causal SNPS (= number of credible sets) from 1 to L, where L is the maximum amount of causal SNPs considered
expectedvalue
Expected number of causal SNPs in the genomic region
More information can be found in http://www.christianbenner.com/
PHENOTYPE.FINEMAP.snp.bgz
This tabixed, bgzipped file contains finemapping information for each of the snps that were finemapped. The file is a tab-separated value (TSV) file with one variant per line. The columns of the file are described in the table below:
trait
phenotype in question
region
finemapped region
v
variant identifier, in form chromosome:position:ref:alt
index
index
rsid
variant identifier in the for 'chr'chromosome_position_ref_alt
chromosome
chromosome of the variant, prefixed with 'chr'
position
chromosomal position of the variant
allele1
reference allele of the variant
allele2
alternate allele of the variant
maf
minor allele frequency of the variant
beta
effect size of the variant in the GWAS summary statistic
se
standard error for the variant in GWAS summary statistic
z
z-score for the variant
prob
Posterior Inclusion Probability for this variant, i.e. the probability that this variant is causal
log10bf
log10 of Bayes factor. The Bayes factor quantifies the evidence that the variant is causal.
mean
This column contains the marginalized shrinkage estimates of the posterior effect size mean for the alternate allele. The marginalized shrinkage estimate for a SNP is computed by averaging the posterior effect size means of this SNP from all causal configurations in the PHENOTYPE.FINEMAP.config.bgz file, assuming that the effect size of the SNP is zero if the SNP is not in the causal configuration.
sd
This column contains the marginalized shrinkage estimates of the posterior effect size standard deviation. The estimates are computed in the same way as the marginalized shrinkage estimates of the posterior effect size mean.
mean_incl
This column contains the conditional estimates of the posterior effect size mean for the alternate allele. The conditional estimate for a SNP is computed by averaging the posterior effect size means of this SNP from causal configurations in PHENOTYPE.FINEMAP.config.bgz file in which it is included.
sd_incl
This column contains the conditional estimates of the posterior effect size standard deviation. The estimates are computed in the same way as the conditional estimates of the posterior effect size mean
p
p-value of the association in the summary statistic
csx
credible set index for given number of causal variants x
PHENOTYPE.FINEMAP.snp.bgz.tbi
The tabix index file for PHENOTYPE.FINEMAP.snp.bgz. It is not directly used, but needs to be in same folder as the snp file in case tabix is used to search the file.
susie folder
The susie folder contains the following files:
PHENOTYPE.SUSIE.cred.bgz
A bgzipped TSV, containing SUSIE per-credible set output.
PHENOTYPE.SUSIE.cred.summary.tsv
A summary of the SUSIE credible set output, in TSV form.
PHENOTYPE.SUSIE.cred_99.bgz
A bgzipped TSV, containing SUSIE per-credible set output for 99% credible sets.
PHENOTYPE.SUSIE.snp.bgz
SUSIE output for every variant in the regions inspected
PHENOTYPE.SUSIE.snp.bgz.tbi
A tabix index file for PHENOTYPE.SUSIE.snp.gbz
PHENOTYPE.SUSIE.snp.filter.tsv
Filtered SUSIE SNP output for 95% credible sets
PHENOTYPE.SUSIE_99.cred.summary.tsv
A summary of the SUSIE 99% credible set output, in TSV form.
PHENOTYPE.SUSIE_99.snp.filter.tsv
Filtered SUSIE SNP output for 99% credible sets, in TSV form.
PHENOTYPE.SUSIE_EXTEND.cred.summary.tsv
A summary of the SUSIE 95% credible sets extended with 99% variants, in TSV form.
PHENOTYPE.SUSIE_EXTEND.snp.filter.tsv
Filtered SUSIE SNP output for 95% credible sets, extended with 99% CS variants, in TSV form.
The files are described in more detail below:
PHENOTYPE.SUSIE.cred.bgz
This file contains all of the credible sets for this phenotype. The credible sets are the 95% credible sets, i.e. under the model they have a 95% probability of containing the causal variant. The file is a bgzipped tab-separated values file, with one credible set per line. The columns are described in the following table:
trait
phenotype in question
region
region which was finemapped, formatted as chrCHROMOSOME:START-END
cs
credible set index. The credible set index can be used to match credible sets with their variants.
cs_log10bf
Log10 of the credible set's Bayes factor. This quantifies the evidence for the model. For example, a value of 3 means that the model with this specific credible set had a 10^3 = 1000 larger likelihood than the null model without that credible set.
cs_avg_r2
Average r2 between credible set's variants
cs_min_r2
Minmum r2 between credible set's variants
low_purity
This will be 'TRUE' if minimum r2 between all variants in credible set was less than a given threshold, currently 0.25, and 'FALSE' otherwise.
cs_size
Size of the credible set, in amount of variants included.
PHENOTYPE.SUSIE.cred_99.bgz
This file contains the 99% credible sets for this phenotype. A 99% credible set is one which under the model contains 99% probability mass that the causal variant is part of the credible set. The columns are otherwise the same as in PHENOTYPE.SUSIE.cred.bgz
.
PHENOTYPE.SUSIE.cred.summary.tsv
This file contains a summary of the credible sets for this phenotype. The credible sets are the 95% credible sets, i.e. under the model they have a 95% probability of containing the causal variant. The file is a tab-separated values file, with one credible set per line. The columns are described in the following table:
trait
phenotype in question
region
region which was finemapped, formatted as chrCHROMOSOME:START-END
cs
credible set index. The credible set index can be used to match credible sets with their variants.
cs_log10bf
Log10 of the credible set's Bayes factor. This quantifies the evidence for the model. For example, a value of 3 means that the model with this specific credible set had a 10^3 = 1000 larger likelihood than the null model without that credible set.
cs_avg_r2
Average r2 between credible set's variants
cs_min_r2
Minmum r2 between credible set's variants
low_purity
This will be 'True' if minimum r2 between all variants in credible set was less than a given threshold, currently 0.25, and 'False' otherwise.
cs_size
Size of the credible set, in amount of variants included.
good_cs
This column is currently the inverse of low_purity column, and indicates whether the credible set consists of variants that are in reasonably strong LD together.
cs_id
Unique identifier to the credible sed, consisting of the credible set region and credible set index in the following format: REGION_CS_INDEX
v
Credible set lead variant (largest PIP in the credible set). In format CHROMOSOME:POSITION:REF:ALT
rsid
Credible set lead variant in format chrCHROMOSOME_POSITION_REF_ALT
p
lead variant p-value
beta
lead variant effect size
sd
lead variant effect standard error
prob
lead variant PIP in the region
cs_specific_prob
lead variant PIP in this specific credible set. This and the prob column are almost always equal or very close to each other.
most_severe
most severe predicted effect of this variant.
gene_most_severe
Gene in which the most severe predicted effect of this variant is.
PHENOTYPE.SUSIE_99.cred.summary.tsv
This file contains a summary of the credible sets for this phenotype. The credible sets are the 99% credible sets, i.e. under the model they have a 99% probability of containing the causal variant. The columns are otherwise the same as in PHENOTYPE.SUSIE.cred.summary.tsv
.
PHENOTYPE.SUSIE_EXTEND.cred.summary.tsv
This file contains a summary of the credible sets for this phenotype. The credible sets are the 95% credible sets, i.e. under the model they have a 95% probability of containing the causal variant, but they have been extended with the 99% credible set variants where possible. The columns are otherwise the same as in PHENOTYPE.SUSIE.cred.summary.tsv
.
PHENOTYPE.SUSIE.snp.bgz
This file contains susie data for all of the variants in all of the regions. The file is in bgzipped, tabixed tab-separated value form. One line containts one variant. The columns are described in the below table:
trait
phenotype in question
region
region which was finemapped, formatted as chrCHROMOSOME:START-END
v
variant identifier in format CHROMOSOME:POSITION:REF:ALT
rsid
variant identifier in format chrCHROMOSOME_POSITION_REF_ALT
chromosome
chromosome of variant, in format chrCHROMOSOME
position
chromosomal position of the variant
allele1
variant reference allele
allele2
variant alternate allele
maf
variant alternate allele frequency
beta
variant effect size in summary statistic
se
variant effect standard error in summary statistic
p
variant p-value in summary statistic
mean
posterior mean beta after fine-mapping
sd
posterior standard deviation after fine-mapping
prob
posterior inclusion probability (PIP) in this region
cs
credible set index, can be used to reference credible set in this region.
cs_specific_prob
posterior inclusion probability (PIP) for this variant in its credible set. Almost always almost equal to the prob
column.
low_purity
Whether this credible set had r2 between variants that was lower than 0.25.
lead_r2
Pearsonr correlation to the credible set lead variant
mean_99
posterior mean beta after finemapping, for 99% credible set
sd_99
posterior standard deviation after fine-mapping, for 99% credible set
prob_99
posterior inclusion probability (PIP) in this region, for 99% credible set
cs_99
99% credible set index
cs_specific_prob_99
posterior inclusion probability (PIP) for this variant in its 99% credible set. Almost always almost equal to the prob_99
column.
low_purity_99
Whether this 99% credible set had r2 between variants that was lower than 0.25.
lead_r2_99
Pearsonr correlation to the 99% credible set lead variant
alpha1..L
posterior inclusion probability for the x-th single effect (x := 1..L where L is the number of single effects (causal variants) specified; default: L = 10)
mean1..L
posterior mean beta for the xth single effect (x := 1..L where L is the number of single effects (causal variants) specified; default: L = 10)
sd1..L
posterior standard deviation for the xth single effect (x := 1..L where L is the number of single effects (causal variants) specified; default: L = 10)
lbf_variable1..L
Log-Bayes factor for each variant and effect, conditional on all other signals
PHENOTYPE.SUSIE.snp.bgz.tbi
Tabix index file for the SUSIE snp file.
PHENOTYPE.SUSIE.snp.filter.tsv
This file contains the filtered SNPs for the 95% credible sets. Variants not included in the 95% credible sets are not included. Neither are those that were part of low_purity credible sets. Variants are listed one per line. The file is in tab-separated value form. The columns are described in the table below:
trait
phenotype in question
region
region which was finemapped, formatted as chrCHROMOSOME:START-END
v
variant identifier in format CHROMOSOME:POSITION:REF:ALT
cs
credible set index, can be used to reference credible set in this region.
cs_specific_prob
posterior inclusion probability (PIP) for this variant in its credible set.
chromosome
chromosome of the variant
position
chromosomal position of the variant
allele1
reference allele of the variant
allele2
alternate allele of the variant
maf
alternate allele frequency for the variant
beta
variant effect size in summary statistic
p
variant p-value in summary statistic
se
variant standard error in summary statistic
most_severe
most severe predicted consequence for this variant
gene_most_severe
gene in which this consequence is shown
PHENOTYPE.SUSIE_99.snp.filter.tsv
This file contains the filtered SNPs for the 99% credible sets. Credible sets not included in the 99% credible sets are not included. Neither are those that were part of low_purity credible sets. Variants are listed one per line. The file is in tab-separated value form. This file contains the same columns as the PHENOTYPE.SUSIE.snp.filter.tsv
file.
PHENOTYPE.SUSIE_extend.snp.filter.tsv
This tab-separated values file contains the filtered SNPs for the 95% credible sets, extended with 99% credible set variants where applicable. Credible sets not included in the 95%/99% credible sets are not included. Neither are those that were part of low_purity credible sets. Variants are listed one per line. The file is in tab-separated value form. This file contains the same columns as the PHENOTYPE.SUSIE.snp.filter.tsv
file.
Last updated