Data description
File naming pattern and file structure
Summary association statistics
GWAS summary statistics (tab-delimited, bgzipped, genome build 38, tabix index files included) are named as {endpoint}.gz
. For example, endpoint I9_CHD
has I9_CHD.gz
and I9_CHD.gz.tbi
.
To learn more about the methods used, see section GWAS.
The {endpoint}.gz
have the following structure:
Column name | Description |
| chromosome on build GRCh38 ( |
| position in base pairs on build GRCh38 |
| reference allele |
| alternative allele (effect allele) |
| variant identifier |
| nearest gene(s) (comma separated) from variant |
| p-value from regenie |
| -log10(p-value) |
| effect size (log(OR) scale) estimated with regenie for the alternative allele |
| standard error of effect size estimated with regenie |
| alternative (effect) allele frequency |
| alternative (effect) allele frequency among cases |
| alternative (effect) allele frequency among controls |
Fine-mapping results
Two fine-mapping methods were used:
Fine-mapping results are tab-delimited and bgzipped.
SuSiE results have the following filename pattern:
{endpoint}.SUSIE.cred.bgz
{endpoint}.SUSIE.cred_99.bgz
{endpoint}.SUSIE.snp.bgz
FINEMAP results have the following filename pattern:
{endpoint}.FINEMAP.config.bgz
{endpoint}.FINEMAP.region.bgz
{endpoint}.FINEMAP.snp.bgz
To learn more about the methods used, see section Fine-mapping.
{endpoint}.SUSIE.cred.bgz
contain credible set summaries from SuSiE fine-mapping for all genome-wide significant regions. {endpoint}.SUSIE.cred_99.bgz
contain the 99% credible set summaries while the default is 95%. They have the following structure:
Column name | Description |
---|---|
Column name | Description |
trait | phenotype |
region | region for which the fine-mapping was run |
cs | running number for independent credible sets in a region |
cs_log10bf | Log10 bayes factor of comparing the solution of this model (cs independent credible sets) to cs -1 credible sets |
cs_avg_r2 | Average correlation R2 between variants in the credible set |
cs_min_r2 | minimum r2 between variants in the credible set |
low_purity | |
cs_size | how many snps does this credible set contain |
{endpoint}.SUSIE.snp.bgz
contain variant summaries with credible set information and have the following structure:
Column name | Description |
trait | endpoint name |
region | chr:start-end |
v | variant identifier |
rsid | rs variant identifier |
chromosome | chromosome on build GRCh38 ( |
position | position in base pairs on build GRCh38 |
allele1 | reference allele |
allele2 | alternative allele (effect allele) |
maf | minor allele frequency |
beta | effect size GWAS |
se | standard error GWAS |
p | p-value GWAS |
mean | posterior expectation of true effect size |
sd | posterior standard deviation of true effect size |
prob | posterior probability of association |
cs | identifier of 95% credible set (-1 = variant is not part of credible set) |
lead_r2 | r2 value to a lead variant (the one with maximum PIP) in a credible set |
alphax | posterior inclusion probability for the x-th single effect (x := 1..L where L is the number of single effects (causal variants) specified; default: L = 10) |
{endpoint}.FINEMAP.config.bgz
contain summary fine-mapping variant configurations from FINEMAP method and have the following structure:
Column name | Description |
---|---|
Column name | Description |
trait | phenotype |
region | region for which the fine-mapping was run |
rank | rank of this configuration within a region |
config | causal variants in this configuration |
prob | probability across all n independent signal configurations |
log10bf | log10 bayes factor for this configuration |
odds | odds of this configuration |
k | how many independent signals in this configuration |
prob_norm_k | probability of this configuration within k independent signals solution |
h2 | snp heritability of this solution |
h2_0.95CI | 95% confidence interval limits of snp heritability of this solution |
mean | marginalized shrinkage estimates of the posterior effect size mean |
sd | marginalized shrinkage estimates of the posterior effect standard deviation |
{endpoint}.FINEMAP.region.bgz
contain summary statistics on number of independent signals in each region and have the following structure:
Column name | Description |
---|---|
Column name | Description |
trait | phenotype |
region | region for which the fine-mapping was run |
h2g | heritability of this region |
h2g_sd | standard deviation of snp heritability of this region |
h2g_lower95 | lower limit of 95% CI for snp heritability |
h2g_upper95 | upper limit of 95% CI for snp heritability |
log10bf | log bayes factor compared against null (no signals in the region) |
prob_xSNP | columns for probabilities of different number of independent signals |
expectedvalue | expectation (average) of the number of signals |
{endpoint}.FINEMAP.snp.bgz
has summary statistics of variants and into what credible set they may belong to. Columns:
Column name | Description |
---|---|
Column name | Description |
trait | phenotype |
region | region for which the fine-mapping was run |
v | variant |
index | running index |
rsid | rs variant identifier |
chromosome | chromosome |
position | position |
allele1 | reference allele |
allele2 | alternative allele |
maf | alternative allele frequency |
beta | original marginal effect size |
se | original standard error |
z | original zscore |
prob | post inclusion probability |
log10bf | log10 bayes factor |
mean | marginalized shrinkage estimates of the posterior effect size mean |
sd | marginalized shrinkage estimates of the posterior effect standard deviation |
mean_incl | conditional estimates of the posterior effect size mean |
sd_incl | conditional estimates of the posterior effect size standard deviation |
p | original p-value |
csx | credible set index for given number of causal variants x |
Variant annotation
The variant annotation has measures (HWE
, INFO
, ...) listed per batch.
Gene-based burden test results of LoF variants
Loss of function (LoF) variants were generated from vcf files with VEP (https://github.com/Ensembl/ensembl-vep). LoF variants are defined as having consequences in the list [frameshift_variant,splice_donor_variant,stop_gained,splice_acceptor_variant]. Also, a max_maf (0.01) and minimum info score (0.8) filters are applied. Then a bgen file is formed by filtering chromosome based vcfs and merging them into a single file, allowing us to run the whole analysis in one data set. Then the bgen is passed to step 2 of regenie in burden mode, which uses the nulls from the standard GWAS runs.
## File structure
### Data
| File | Description |
|---|---|
|finngen_R8_lof_txt.gz | Merged results, sorted by mglop. |
|finngen_R8_lof_variants.txt | A tsv file with variant/geno/lof data used in the run. |
|finngen_R8_lof_sig_hits.txt | A summary of the results only including hits for mlogp > 3 and sorted by difference between mlogp and max(mlogp) of its variants.|
### Documentation
| File | Description |
|---|---|
|finngen_R8_lof.log| Merged logs of all runs.|
Last updated