Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Additionally to the biobanks mentioned in the previous releases, the following biobanks and cohorts are part of the R6 release:
Please use the following description when referring to our project:
The FinnGen study is a large-scale genomics initiative that has analyzed over 500,000 Finnish biobank samples and correlated genetic variation with health data to understand disease mechanisms and predispositions. The project is a collaboration between research organisations and biobanks within Finland and international industry partners.
When using these results in publications, please remember to:
Acknowledge the FinnGen study. You can use the following text:
“We want to acknowledge the participants and investigators of the FinnGen study”
Cite our latest publication:
Kurki, M.I., Karjalainen, J., Palta, P. et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature 613, 508–518 (2023). https://doi.org/10.1038/s41586-022-05473-8
Furthermore, if possible, include "FinnGen" as a keyword for your publication.
If you want to cite this website, use the following citation:
FinnGen individuals were with Illumina and Affymetrix chip arrays (Illumina Inc., San Diego, and Thermo Fisher Scientific, Santa Clara, CA, USA).
Chip genotype data were using the population-specific of 3,775 whole genomes.
Merged imputed genotype data is composed of 75 data sets that include samples from multiple cohorts.
Total number of individuals: 271,341
Total number of variants (merged set): 16,962,023
Reference assembly: GRCh38/hg38
FinnGen research project is a public-private partnership combining genotype data from Finnish biobanks and digital health record data from Finnish health registries. FinnGen provides a unique opportunity to study genetic variation in relation to disease trajectories in an isolated population.
FinnGen is a growing project, aiming at 500,000 individuals in the end of 2023.
FinnGen results are subjected to one year embargo and, after that, available to the larger scientific community via the Pheweb browser or through data download.
Timeline for releases:
Release
Date release to partners
Date release to public
Total sample size [1]
R2
Q4 2018 (Nov)
Q1 2020
96,499
R3
Q2 2019 (May)
Q2 2020
135,638
R4
Q4 2019 (Oct)
Q4 2020
176,899
R5
Q2 2020 (March)
Q2 2021
218,792
R6
Q3 2020
Q1 2022
260,405
R7
Q1 2021
~Q2 2022
~321,000
R8
Q3 2021
~Q3 2022
~340,000
R9
Q1 2022
~Q1 2023
~375,000
R10
Q3 2022
~Q3 2023
~410,000
R11
Q1 2023
~Q1 2024
~445,000
R12
Q3 2023
~Q3 2024
~480,000
R13
Q1 2024
~Q1 2025
~500,000
[1] samples used for PheWAS.
To download FinnGen summary statistics you will need to fill the online form at this link. You will then receive an email containing the detailed instructions for downloading the data.
Release 6 contains
LD estimation from SISu v3
Please remember to acknowledge the FinnGen study when using these results in publications.
You can use the following text:
We want to acknowledge the participants and investigators of FinnGen study.
The manifest file with the link to all the downloadable summary stats is available at:
Chip genotype data processing and QC Samples were genotyped with Illumina (Illumina Inc., San Diego, CA, USA) and Affymetrix arrays (Thermo Fisher Scientific, Santa Clara, CA, USA).
Genotype calls were made with GenCall and zCall algorithms for Illumina and AxiomGT1 algorithm for Affymetrix data.
Chip genotyping data produced with previous chip platforms and reference genome builds were lifted over to build version 38 (GRCh38/hg38) following the protocol described here: dx.doi.org/10.17504/protocols.io.nqtddwn.
In sample-wise quality control, individuals with ambiguous gender, high genotype missingness (>5%), excess heterozygosity (+-4SD) and non-Finnish ancestry were excluded. In variant-wise quality control variants with high missingness (>2%), low HWE P-value (<1e-6) and minor allele count, MAC<3 were excluded.
Prior imputation, chip genotyped samples were pre-phased with Eagle 2.3.5 with the default parameters, except the number of conditioning haplotypes was set to 20,000.
Genotype imputation was done with the population-specific .
Variant call set was produced with GATK HaplotypeCaller algorithm by following GATK best-practices for variant calling.
Genotype-, sample- and variant-wise QC was applied in an iterative manner by using the and the resulting high-quality WGS data for 3,775 individuals were phased with Eagle 2.3.5 as described in the previous section.
Genotype imputation was carried out by using the population-specific SISu v3 imputation reference panel with (version 08Jun17.d8b) as described in the following protocol: .
Post-imputation quality-control involved checking expected conformity of the imputation INFO-value distribution, MAF differences between the target dataset and the imputation reference panel and checking chromosomal continuity of the imputed genotype calls.
Cromwell-42
Wdltool-0.14
Plink 1.9 and 2.0
BCFtools 1.7 and 1.9
Eagle 2.3.5
Beagle 4.1 (version 08Jun17.d8b)
R 3.4.1 (packages: data.table 1.10.4, sm 2.2-5.4)
File naming pattern and file structure
GWAS summary statistics (tab-delimited, bgzipped, genome build 38, tabix index files included) are named as {endpoint}.gz
. For example, endpoint I9_CHD
has I9_CHD.gz
and I9_CHD.gz.tbi
.
To learn more about the methods used, see section GWAS.
The {endpoint}.gz
have the following structure:
Column name
Description
#chrom
chromosome on build GRCh38 (1-23
)
pos
position in base pairs on build GRCh38
ref
reference allele
alt
alternative allele (effect allele)
rsids
variant identifier
nearest_genes
nearest gene name from variant
pval
mlogp
-log10(p-value)
beta
sebeta
af_alt
alternative (effect) allele frequency
af_alt_cases
alternative (effect) allele frequency among cases
af_alt_controls
alternative (effect) allele frequency among controls
n_hom_cases
number of homozygous cases*
n_hom_ref_cases
number of homozygous reference cases*
n_het_cases
number of heterozygous cases*
n_hom_controls
number of homozygous controls*
n_hom_ref_controls
number of homozygous reference controls*
n_het_controls
number of heterozygous cases*
*)Note that the results are based on imputed genotype dosages and produced using SAIGE and that is why the data is not presented as integers but might contain digits.
Two fine-mapping methods were used:
Fine-mapping results are tab-delimited and bgzipped.
SuSiE results have the following filename pattern:
{endpoint}.SUSIE.cred.bgz
{endpoint}.SUSIE.cred_99.bgz
{endpoint}.SUSIE.snp.bgz
FINEMAP results have the following filename pattern:
{endpoint}.FINEMAP.config.bgz
{endpoint}.FINEMAP.region.bgz
{endpoint}.FINEMAP.snp.bgz
To learn more about the methods used, see section Fine-mapping.
{endpoint}.SUSIE.cred.bgz
contain credible set summaries from SuSiE fine-mapping for all genome-wide significant regions. {endpoint}.SUSIE.cred_99.bgz
contain the 99% credible set summaries while the default is 95%. They have the following structure:
Column name
Description
trait
phenotype
region
region for which the fine-mapping was run
cs
running number for independent credible sets in a region
cs_log10bf
Log10 bayes factor of comparing the solution of this model (cs independent credible sets) to cs -1 credible sets
cs_avg_r2
Average correlation R2 between variants in the credible set
cs_min_r2
minimum r2 between variants in the credible set
low_purity
cs_size
how many snps does this credible set contain
{endpoint}.SUSIE.snp.bgz
contain variant summaries with credible set information and have the following structure:
Column name
Description
trait
endpoint name
region
chr:start-end
v
variant identifier
rsid
rs variant identifier
chromosome
chromosome on build GRCh38 (1-22, X
)
position
position in base pairs on build GRCh38
allele1
reference allele
allele2
alternative allele (effect allele)
maf
minor allele frequency
beta
effect size GWAS
se
standard error GWAS
p
p-value GWAS
mean
posterior expectation of true effect size
sd
posterior standard deviation of true effect size
prob
posterior probability of association
cs
identifier of 95% credible set (-1 = variant is not part of credible set)
{endpoint}.FINEMAP.config.bgz
contain summary fine-mapping variant configurations from FINEMAP method and have the following structure:
Column name
Description
trait
phenotype
region
region for which the fine-mapping was run
rank
rank of this configuration within a region
config
causal variants in this configuration
prob
probability across all n independent signal configurations
log10bf
log10 bayes factor for this configuration
odds
odds of this configuration
k
how many independent signals in this configuration
prob_norm_k
probability of this configuration within k independent signals solution
h2
snp heritability of this solution
h2_0.95CI
95% confidence interval limits of snp heritability of this solution
mean
marginalized shrinkage estimates of the posterior effect size mean
sd
marginalized shrinkage estimates of the posterior effect standard deviation
{endpoint}.FINEMAP.region.bgz
contain summary statistics on number of independent signals in each region and have the following structure:
Column name
Description
trait
phenotype
region
region for which the fine-mapping was run
h2g
heritability of this region
h2g_sd
standard deviation of snp heritability of this region
h2g_lower95
lower limit of 95% CI for snp heritability
h2g_upper95
upper limit of 95% CI for snp heritability
log10bf
log bayes factor compared against null (no signals in the region)
prob_xSNP
columns for probabilities of different number of independent signals
expectedvalue
expectation (average) of the number of signals
{endpoint}.FINEMAP.snp.bgz
has summary statistics of variants and into what credible set they may belong to. Columns:
Column name
Description
trait
phenotype
region
region for which the fine-mapping was run
v
variant
index
running index
rsid
rs variant identifier
chromosome
chromosome
position
position
allele1
reference allele
allele2
alternative allele
maf
alternative allele frequency
beta
original marginal effect size
se
original standard error
z
original zscore
prob
post inclusion probability
log10bf
log10 bayes factor
mean
marginalized shrinkage estimates of the posterior effect size mean
sd
marginalized shrinkage estimates of the posterior effect standard deviation
mean_incl
conditional estimates of the posterior effect size mean
sd_incl
conditional estimates of the posterior effect size standard deviation
p
original p-value
csx
credible set index for given number of causal variants x
Linkage disequilibrium (LD) was estimated from SISU v3 for each chromosome. Use the tool LDstore (v1.1) for further usage of the bcor files.
ldstore --bcor FG_LD_chr1.bcor --incl-range 20000000-50000000 --table output_file_name.table
To learn more about the methods used, see section LD estimation.
The variant annotation has measures (HWE
, INFO
, ...) listed per batch.
We used the SAIGE software for running R6 GWAS as we did in previous releases. SAIGE is a mixed model logistic regression R/C++ package. We used code of version 0.39.1: We made two modifications to SAIGE 0.39.1 codebase (neither modification affects the method):
Null model .rda objects are trimmed to reduce RAM consumption
Ref hom, het, and alt hom counts in cases and controls are included in the output, summing the probabilities of each genotype over individuals, different from the 0.39.1 implementation in SAIGE in which the counts are sums of most probable genotypes over individuals
We analyzed:
2,861 endpoints
260,405 samples
16,962,023 variants
We included the following covariates in the model: sex, age, 10 PCs, genotyping batch.
The disease endpoints were defined using nationwide registries:
We harmonized over the International Classification of Diseases (ICD) revisions 8, 9 and 10, cancer-specific ICD-O-3, (NOMESCO) procedure codes, Finnish-specific Social Insurance Institute (KELA) drug reimbursement codes and ATC-codes.
These registries spanning decades were electronically linked to the cohort baseline data using the unique national personal identification numbers assigned to all Finnish citizens and residents.
A full list of FinnGen endpoints is for release 6.
The endpoints with fewer than 80 cases, and developmental “helper” endpoints were excluded from the final PheWas (“OMIT” tag in the endpoint definition file).
Endpoints with less than 150 cases are not released by (Finnish Institute for Health and Welfare).
p-value from
effect size estimated with for the alternative allele
standard deviation of effect size estimated with
(Risteys = intersection in Finnish) allows browsing of the FinnGen data at the phenotype level, including endpoint definitions, statistics about number of individuals, gender distribution, and longitudinal relationships.
The BCOR files were created using LDstore from the Finnish SISU panel v3.
The panel has been divided per chromosome. For example, to use the LD information in the first chromosome, FG_LD_chr1.bcor
would be the file to use.
number of samples: 3775
window size: 1500 kb
accuracy: low
number of threads: 96
LD threshold to include correlations: 0.05
LDstore v1.1 can be downloaded via:
And an example to extract variant range 20 Mb - 50 Mb from chromosome 7 is as follows:
It is not preferred to use these LD estimate files for e.g. fine-mapping, since many of the fine-mapping methods (e.g. SuSiE) require in-sample LD information for good results!
SISu v3 consists of 3,775 WGS of Finnish individuals from six research cohorts:
METSIM (PIs Markku Laakso and Mike Boehnke)
FINRISK (PI Pekka Jousilahti)
Health2000 (PI Seppo Koskinen)
Finnish Migraine Family Study (PI Aarno Palotie)
Merck/Tienari samples (PI Pentti Tienari)
MESTA samples (PI Jaana Suvisaari)
High-coverage (25-30x) WGS data used to develop the SISu v3 reference panel were generated at the Broad Institute of MIT and Harvard and at the McDonnell Genome Institute at Washington University; and jointly processed at the Broad Institute.
This is a description of the quality control procedures applied before running the GWAS.
The PCA for population structure has been run in the following way:
The imputation panel is pruned iteratively, until a target number of SNPs is reached:
8,580,565 starting variants: only variants with a minimum info score of 0.9 in all batches are kept.
The script starts with [500.0, 50.0, 0.9] params in plink (window,step,r2). It then decreases 0.05 in r2 iteratively pruning the imputation panel until the threshold of 200000 snps is reached. Once the SNP count falls under 200000 the closest pruning is returned.
If the higher r2 is closer, 200,000 snps are randomly selected, else the last pruned snps are returned.
Plink flags used: --snps-only --chr 1-22 --max-alleles 2 --maf 0.01
For this run the final ld params are --indep-pairwise 500.0 50.0 0.2 and 200,000 snps are returned.
Then, FinnGen data was merged with the 1k genome project (1kgp) data, using the variants mentioned above. A round of PCA was performed and a bayesian algorithm was used to spot outliers. This process got rid of 5,995 FinnGen samples. The figure below shows the scatter plots for the first 3 PCs. Outliers, in green , are separated from the FinnGen red cluster.
While the method automatically detected as being outliers the 1kgp samples with non European and southern European ancestries, it did not manage to exclude some samples with Western European origins. Since the signal from these samples would have been too small to allow a second round to be performed without detecting substructures of the Finnish population, another approach was used. The FinnGen samples that survived the first round were used to compute another PCA. The EUR and FIN 1kg samples were then projected onto the space generated by the first 3 PCs. Then, the centroid of each cluster was calculated and used to calculate the squared mahalanobis distance of each FinnGen sample to each of the centroids. Being the squared distance a sum of squared variables (with unitary variance, due to the mahalanobis distance), we could see it as a sum of 3 independent squared variables. This allowed to map the squared distance into a probability (chi squared with 3 degrees of freedom). Therefore, for each cluster, a probability of being part of it was computed. Then, a threshold of 0.95 was used to exclude FinnGen samples whose relative chance of being part of the Finnish cluster was below the level. This method produced another 290 outliers. The figure below shows the first three principal components.
FIN 1kgp samples are in purple, while EUR 1kgp samples are in blue. Samples in green are FinnGen samples who are flagged as being non Finnish, while red ones are considered Finnish.
Then all pairs of FinnGen samples up to second degree were returned. The figure below shows the distribution of kinship values.
Then, the previously defined “non Finnish” samples were excluded and 2 algorithms were used to return a unique subset of unrelated samples:
one called greedy would continuously remove the highest degree node from the network of relations, until no more links are left in the network.
one called native, based on a native implementation of python’s networkx package, performed on each subgraph of the network.
The largest independent set of either algorithm would be used to keep those sample, while flagging the others as “outliers” for the final PCA.
Then, the subset of outliers who also belong to the set of duplicates/twins was identified.
To compute the final step the Finngen samples were ultimately separated in three groups:
182,616 inliers: unrelated samples with Finnish ancestry.
79,182 outliers: non duplicate samples with Finnish ancestries, but who are also related to the inliers.
9,543 rejected samples: either of non Finnish ancestry or are twins/duplicates with relations to other samples.
Finally, the PCA for the inliers was calculated, and then outliers were projected on the same PC space, allowing to calculate covariates for a total of 261,798 samples.
Of the 261,798 non-duplicate population inlier samples from PCA, we excluded 1,390 samples from analysis because of missing minimum phenotype data, and 3 samples because of a mismatch between imputed sex and sex in registry data. A total of 260,405 samples was used for core analysis. There are 147,061 females and 113,344 males among these samples.
Documentation from the original developers of the algorithm can be found here: http://www.well.ox.ac.uk/~spencer/Aberrant/aberrant-manu.
We used two state-of-the-art methods, FINEMAP (; ) and SuSiE () to fine-map genome-wide significant loci in FinnGen endpoints.
Briefly, there are three main steps:
For each genome-wide significant locus (default configuration: P < 5e-8), we define a fine-mapping region by taking a 3 Mb window around a lead variant (and merge regions if they overlap). We preprocess an input GWAS summary statistics into separate files per region for the following steps.
We compute in-sample dosage LD using for each fine-mapping region.
With the inputs of summary statistics and in-sample LD from the steps 1-2, we conduct fine-mapping using and with the maximum number of causal variants in a locus L = 10.
The "Credible Sets"-table on a phenotype page in the browser shows the SuSiE-fine-mapped credible sets of that phenotype. The variant shown per credible set is the maximum PIP (posterior inclusion probability) variant of that credible set. In addition to the causal variants, variants that were in sufficient LD (pearsonr^2 > 0.05), had a small enough p-value (pval < 0.01), and were close enough to the lead variant (distance to lead variant < 1.5 megabases) were clumped together with the credible set. Variants have been compared against GWAS Catalog and annotated. The LD grouping, annotation and GWAS Catalog comparison were done using the autoreporting pipeline.
The columns of the table are explained below:
We included 2,861 endpoints in the analysis. Endpoints with less than 80 cases among the 260,405 samples were excluded, as well as endpoints labeled with an OMIT tag in the endpoint definition file.
For null model computation for each endpoint, we used age, sex, 10 PCs and genotyping batch as covariates. Each genotyping batch was included as a covariate for an endpoint if there were at least 10 cases and 10 controls in that batch to avoid convergence issues. One genotyping batch need be excluded from covariates to not have them saturated. We excluded Thermo Fisher batch 16 as it was not enriched for any particular endpoints.
For calculating the genetic relationship matrix, only variants imputed with an INFO score > 0.95 in all batches were used. Variants with > 3 % missing genotypes were excluded as well as variants with MAF < 1 %. The remaining variants were LD pruned with a 1Mb window and r2 threshold of 0.1. This resulted in a set of 59,037 well-imputed not rare variants for GRM calculation.
options for the null computation:
LOCO = false
numMarkers = 30
traceCVcutoff = 0.0025
ratioCVcutoff = 0.001
We ran association tests against each of the 2,861 endpoints with for each variant with a minimum allele count of 5 from the imputation pipeline (SAIGE optionminMAC = 5
). We filtered the results to include variants with an imputation INFO > 0.6.
For matters related to this documentation, click Edit on GitHub
or send us an email to finngen-info@helsinki.fi.
for the latest updates on the project as well as additional background information please consider visiting the study website or follow FinnGen on twitter .
If you want to host FinnGen summary statistics on your website, please get in contact with us at: humgen-servicedesk@helsinki.fi.
Column name
Explanation
top PIP variant
variant with largest PIP int he credible set. Click the arrow to the left of the variant to show the credible set variants.
CS quality
This column shows whether the credible set is well-formed. a 'true' value means that the credible set is likely trustworthy, and a 'false' value means that the credible set is likely not trustworthy.
chromosome
The chromosome in which the credible set lies.
p-value
p-value of the top PIP variant.
effect size (beta)
effect size of the top PIP variant.
Finnish Enrichment
Finnish enrichment of the top PIP variant.
Alternate allele frequency
alternate allele frequency of the top PIP variant.
Lead Variant Gene
A probable gene of the top PIP variant.
# coding in cs
number of coding variants in the credible set. Hover over the number to see the variant, the consequence, and the correlation (pearsonr squared) to the lead variant.
# credible variants
number of variants in the credible set.
Credible set bayes factor (log10)
The bayes factor related to the credible set.
CS matching Traits
Number of matches found in GWAS Catalog for the credible set variants. Hover over the number to see the trait, as well as the associated variant's LD (pearsonr squared) to the lead variant.
LD Partner Traits
Number of matches found in GWAS Catalog to the group of credible variants and variants in LD with the top PIP variant.Hover over the numbr to see the trait, as well as the associated variant's LD (pearsonr squared) to the lead variant.
UKBB
Matching Pan-UKBB trait association.