1 of 20

R3

Introduction

FinnGen a public-private partnership project combining genotype data from Finnish biobanks and digital health record data from Finnish health registries. FinnGen provides a unique opportunity to study genetic variation in relation to disease trajectories in an isolated population.

FinnGen is a growing project, aiming at 500,000 individuals in 2023.

FinnGen results are subjected to one year embargo and, after that, available to the larger scientific community via the or through .

Data download

To download FinnGen summary statistics you will need to fill the online form at this link. You will then receive an email containing the detailed instructions for downloading the data.

Release 3 contains

Using FinnGen data for publications

Please remember to acknowledge the FinnGen study when using these results in publications.

You can use the following text:

We want to acknowledge the participants and investigators of FinnGen study.

Manifest

The Manifest file with the link to all the downloadable data is available at:

Data description

File naming pattern and file structure

Summary association statistics

GWAS summary statistics (tab-delimited, bgzipped, genome build 38, tabix index files included) are named as {endpoint}.gz. For example, endpoint I9_CHD has I9_CHD.gz and I9_CHD.gz.tbi.

To learn more about the methods used, see section .

The {endpoint}.gz have the following structure:

Fine-mapping results

Two fine-mapping methods were used:

Fine-mapping results are tab-delimited and bgzipped.

SuSiE results have the following filename pattern:

{endpoint}.SUSIE.cred.bgz
{endpoint}.SUSIE.snp.bgz

FINEMAP results have the following filename pattern:

{endpoint}.FINEMAP.region.bgz
{endpoint}.FINEMAP.snp.bgz
{endpoint}.FINEMAP.cred.bgz

To learn more about the methods used, see section .

SuSiE output files {endpoint}.SUSIE.snp.bgz have the following structure:

LD estimation

Linkage disequilibrium (LD) was estimated from for each chromosome. Use the tool for further usage of the bcor files.

ldstore --bcor FG_LD_chr1.bcor --incl-range 20000000-50000000 --table output_file_name.table

To learn more about the methods used, see section .

Data releases

Timeline for releases:

Release

Date release to partners

Date release to public

Total sample size [1]

Q4 2018 (Nov)

Q1 2020

96,499

Q2 2019 (May)

[1] samples used for PheWAS.

How to cite

Please use the following description when referring to our project:

The FinnGen study is a large-scale genomics initiative that has analyzed over 500,000 Finnish biobank samples and correlated genetic variation with health data to understand disease mechanisms and predispositions. The project is a collaboration between research organisations and biobanks within Finland and international industry partners.

When using these results in publications, please remember to:

Acknowledge the FinnGen study. You can use the following text:

“We want to acknowledge the participants and investigators of the FinnGen study”

Cite our latest publication:

Kurki, M.I., Karjalainen, J., Palta, P. et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature 613, 508–518 (2023). https://doi.org/10.1038/s41586-022-05473-8

Furthermore, if possible, include "FinnGen" as a keyword for your publication.

If you want to cite this website, use the following citation:

Methods

Participating biobanks/cohorts

Additionally to the biobanks mentioned in the previous releases, the following biobanks and cohorts are part of the R3 release:

Genotypes

FinnGen individuals were genotyped with Illumina and Affymetrix chip arrays (Illumina Inc., San Diego, and Thermo Fisher Scientific, Santa Clara, CA, USA).

Chip genotype data were imputed using the population-specific SISu v3 imputation reference panel of 3,775 whole genomes.

Merged imputed genotype data is composed of 42 data sets that include samples from multiple cohorts.

Total number of individuals: 146,630
Total number of variants (merged set): 16,962,023
Reference assembly: GRCh38/hg38

Genotype data

Chip genotype data processing and QC Samples were genotyped with Illumina (Illumina Inc., San Diego, CA, USA) and Affymetrix arrays (Thermo Fisher Scientific, Santa Clara, CA, USA).

Genotype calls were made with GenCall and zCall algorithms for Illumina and AxiomGT1 algorithm for Affymetrix data.

Chip genotyping data produced with previous chip platforms and reference genome builds were lifted over to build version 38 (GRCh38/hg38) following the protocol described here: .

Quality control

Genotype imputation

Genotype imputation was done with the population-specific .

Variant call set was produced with GATK HaplotypeCaller algorithm by following GATK best-practices for variant calling.

Genotype-, sample- and variant-wise QC was applied in an iterative manner by using the and the resulting high-quality WGS data for 3,775 individuals were phased with Eagle 2.3.5 as described in the previous section.

Genotype imputation was carried out by using the population-specific SISu v3 imputation reference panel with (version 08Jun17.d8b) as described in the following protocol: .

Post-imputation quality-control involved checking expected conformity of the imputation INFO-value distribution, MAF differences between the target dataset and the imputation reference panel and checking chromosomal continuity of the imputed genotype calls.

SISu reference panel

SISu v3 consists of 3,775 high coverage (30x) WGS Finnish individuals from six cohorts:

METSIM (PIs Markku Laakso and Mike Boehnke)
FINRISK (PI Pekka Jousilahti)
Health2000 (PI Seppo Koskinen)
Finnish Migraine Family Study (PI Aarno Palotie)
Merck/Tienari samples (PI Pentti Tienari)
MESTA samples (PI Jaana Suvisaari)

High-coverage (25-30x) WGS data used to develop the SISu v3 reference panel were generated at the Broad Institute of MIT and Harvard and at the McDonnell Genome Institute at Washington University; and jointly processed at the Broad Institute.

Software used

Cromwell-29 and 31
Wdltool-0.14
Plink 1.9 and 2.0

LD estimation

The BCOR files were created using LDstore from the Finnish SISU panel v3.

The panel has been divided per chromosome. For example, to use the LD information in the first chromosome, FG_LD_chr1.bcor would be the file to use.

Settings used

number of samples: 3775
window size: 1500 kb
accuracy: low
number of threads: 96
LD threshold to include correlations: 0.05

Example usage

can be downloaded via:

And an example to extract variant range 20 Mb - 50 Mb from chromosome 7 is as follows:

Endpoints

Registries

The disease endpoints were defined using nationwide registries:

Drug purchase and Drug Reimbursement

We harmonized over the International Classification of Diseases (ICD) revisions 8, 9 and 10, cancer-specific ICD-O-3, (NOMESCO) procedure codes, Finnish-specific Social Insurance Institute (KELA) drug reimbursement codes and ATC-codes.

These registries spanning decades were electronically linked to the cohort baseline data using the unique national personal identification numbers assigned to all Finnish citizens and residents.

A full list of FinnGen endpoints is for release 3.

Excluded endpoints

The endpoints with fewer than 100 cases, near-duplicate endpoints, and developmental “helper” endpoints were excluded from the final PheWas (column “OMIT”).

Endpoints with N<150 are not released by (Finnish Institute for Health and Welfare).

Risteys

(Risteys = intersection in Finnish) allows browsing of the FinnGen data at the phenotype level, including endpoint definitions, statistics about number of individuals, gender distribution, and longitudinal relationships.

GWAS

We used the SAIGE (r3 release) software for running the R3 GWAS.

SAIGE is a mixed model logistic regression R/C++ package, able to account for related samples.

We analyzed:

1,801 endpoints
135,638 samples
16,962,023 variants

We included the following covariates in the model: sex, age, 10 PCs, genotyping batch.

Sample QC and PCA

This is a description of the quality control procedures applied before running the GWAS.

In summary, we removed 10,992 samples who were either of non-Finnish ancestry or twins/duplicates. Finnish ancestry was assessed with a combination of PCA and a Bayesian method for outlier detection.

PCA

The PCA for population structure has been run in the following way:

Variant filtering and LD pruning

The following filters were applied:

Exclusion of chromosome 23
Exclusion of variants with info score < 0.95
Exclusion of variants with missingness > 0.01 (based on the GP,see conversion)

This filtering step produced 42,805 variants, that were used for the rest of the analysis.

PCA outlier detection

Then, FinnGen data was merged with the 1k genome project (1kgp) data, using the variants mentioned above. A round of PCA was performed and a Bayesian algorithm was used to spot outliers. This process removed 4,208 outliers, of which 1,820 are from the Finngen samples.

The figure below shows the scatter plots for the first 3 PCs. Outliers, in red, are separated from the FinnGen (blue cluster). While the method automatically detected as being outliers the 1kgp samples with non European and southern European ancestries, it did not manage to exclude 12 samples with Western European origins.

Since the signal from these sample would have been too small to allow a second round to be performed without detecting substructures of the Finnish population, another approach was used. The Finngen samples that survived the first round were used to compute another PCA. The EUR and FIN 1kgp samples were then projected onto the space generated by the first 3 PCs. Then, the centroid of each cluster was calculated and used it to calculate the squared mahalanobis distance of each Finngen sample to each of the centroids. Being the squared distance a sum of squared variables (with unitary variance, due to the mahalanobis distance), we could see it as a sum of 3 independent squared variables. This allowed to map the squared distance into a probability (chi squared with 3 degrees of freedom). Therefore, for each cluster, a probability of being part of it was computed.

Next, a threshold of 0.95 was used to exclude Finngen samples whose relative chance of being part of the Finnish cluster was below the level. This method produced another 359 outliers.

FIN 1kgp samples are in purple, while EUR 1kgp sample are in Blue. Samples in green are Finngen samples who are flagged as being non Finnish, while red ones are.

Kinship

In a next step, all pairs of Finngen samples up to second degree were returned. The figure shows the distribution of kinship values.

Then, the previously defined “non Finnish” samples were excluded and 2 algorithms were used to return a unique subset of unrelated samples:

one called greedy would continuously remove the highest degree node from the network of relations, until no more links are left in the network.
one called native, based on a native implementation of python’s networkx package, performed on each subgraph of the network. The largest independent set of either algorithm would be used to keep those sample, while flagging the others as “outliers” for the final PCA.

Then, the subset of outliers who also belong to the set of duplicates/twins was identified.

Final PCA

To compute the final step the Finngen samples were ultimately separated in three groups:

109184 inliers: unrelated samples with Finnish ancestry.
33302 outliers: non duplicate samples with Finnish ancestries, but who are also related to the inliers.
4144 rejected samples: either of non Finnish ancestry or are twins/duplicates with relations to other samples.

Finally, the PCA for the inliers was calculated, and then outliers were projected on the same same, allowing to calculate covariates for a total of 142,486 samples.

Sample filtering based on phenotype data

Of the 142,486 non-duplicate population inlier samples from PCA, 5,846 were excluded from analysis because of missing minimum phenotype data. Finally, 1,002 samples of age less than 18 were excluded. A total of 135,638 samples was used for core analysis.

Further info

Bayesian outlier detection

Documentation from the original developers of the algorithm can be found here: .

Association tests

Endpoint

We included 1,801 endpoints from the phenotype/registry teams’ pipeline in the analysis. Endpoints with OMIT in the endpoint definition file were excluded, as well as endpoints with less than 100 cases among the 135,638 samples. “Smoking: yes” and “Smoking: current or former” were created based on the respective smoking data in the phenotype data file.

Null models

For the null model calculation for each endpoint, we used age, sex, 10 PCs and genotyping batch as covariates.

For calculating the genetic relationship matrix, we used the genotype dataset where genotypes with GP < 0.95 have been set missing. Only variants imputed with an INFO score > 0.95 in all batches were used. Variants with > 3 % missing genotypes were excluded as well as variants with MAF < 5 %. The remaining variants were LD pruned with a 1Mb window and r2 threshold of 0.1. This resulted in a set of 35,557 common, well-imputed variants for GRM calculation.

options for the null computation:

LOCO = false
numMarkers = 30
traceCVcutoff = 0.0025

Association tests

We ran association tests against each of the 1,801 endpoints with for each variant with a minimum allele count of 10 from the imputation pipeline (SAIGE optionminMAC = 10). The alternative allele is always the effect allele.

Fine-mapping

To identify potential causal variants in GWAS signals, we fine-mapped each genome-wide significant (p < 5e-8) region from the 1,801 GWAS endpoints. Each region was fine-mapped with SuSiE and FINEMAP. We used in-sample LD for fine-mapping.

We used a 3-megabase window (+- 1.5M) around each lead variant, merged overlapping regions into one, and used these regions for fine-mapping.

Loss of function burden

We estimated the loss of function (LoF) burden of each gene on every endpoint.

First, we calculated per individual and gene whether any loss of function variant(s) was present, yielding a $n \times p$ matrix with 0 and 1 values ( $n$ being the number of individuals and $p$ the number of genes).

Then we used the new summarised variables as input in the SAIGE GWAS, replacing the genotype matrix that was used in the regular GWAS.

Contact

For matters related to this documentation, click Edit on GitHubor send us an email to finngen-info@helsinki.fi.

for the latest updates on the project as well as additional background information please consider visiting the study website https://www.finngen.fi/en or follow FinnGen on twitter @FinnGen_FI.

If you want to host FinnGen summary statistics on your website, please get in contact with us at: humgen-servicedesk@helsinki.fi.

Sample QC and PCA

This is a description of the quality control procedures applied before running the GWAS.

In summary, we removed 10,992 samples who were either of non-Finnish ancestry or twins/duplicates. Finnish ancestry was assessed with a combination of PCA and a Bayesian method for outlier detection.

PCA

The PCA for population structure has been run in the following way:

Variant filtering and LD pruning

The following filters were applied:

Exclusion of chromosome 23
Exclusion of variants with info score < 0.95
Exclusion of variants with missingness > 0.01 (based on the GP,see conversion)

This filtering step produced 42,805 variants, that were used for the rest of the analysis.

PCA outlier detection

Next, a threshold of 0.95 was used to exclude Finngen samples whose relative chance of being part of the Finnish cluster was below the level. This method produced another 359 outliers.

FIN 1kgp samples are in purple, while EUR 1kgp sample are in Blue. Samples in green are Finngen samples who are flagged as being non Finnish, while red ones are.

Kinship

In a next step, all pairs of Finngen samples up to second degree were returned. The figure shows the distribution of kinship values.

Then, the previously defined “non Finnish” samples were excluded and 2 algorithms were used to return a unique subset of unrelated samples:

one called greedy would continuously remove the highest degree node from the network of relations, until no more links are left in the network.
one called native, based on a native implementation of python’s networkx package, performed on each subgraph of the network. The largest independent set of either algorithm would be used to keep those sample, while flagging the others as “outliers” for the final PCA.

Then, the subset of outliers who also belong to the set of duplicates/twins was identified.

Final PCA

To compute the final step the Finngen samples were ultimately separated in three groups:

109184 inliers: unrelated samples with Finnish ancestry.
33302 outliers: non duplicate samples with Finnish ancestries, but who are also related to the inliers.
4144 rejected samples: either of non Finnish ancestry or are twins/duplicates with relations to other samples.

Finally, the PCA for the inliers was calculated, and then outliers were projected on the same same, allowing to calculate covariates for a total of 142,486 samples.

Sample filtering based on phenotype data

Further info

Bayesian outlier detection

Documentation from the original developers of the algorithm can be found here: .

R3

Introduction

Data download

hashtagUsing FinnGen data for publications

hashtagManifest

Data description

hashtagSummary association statistics

hashtagFine-mapping results

hashtagLD estimation

Data releases

How to cite

Methods

Participating biobanks/cohorts

Genotypes

Genotype data

hashtagQuality control

Genotype imputation

SISu reference panel

Software used

LD estimation

hashtagSettings used

hashtagExample usage

Endpoints

hashtagRegistries

hashtagExcluded endpoints

hashtagRisteys

GWAS

Sample QC and PCA

hashtagPCA

hashtagVariant filtering and LD pruning

hashtagPCA outlier detection

hashtagKinship

hashtagFinal PCA

hashtagSample filtering based on phenotype data

hashtagFurther info

hashtagBayesian outlier detection

Association tests

hashtagEndpoint

hashtagNull models

hashtagAssociation tests

Fine-mapping

Loss of function burden

Contact

Introduction

Data download

hashtagUsing FinnGen data for publications

hashtagManifest

Participating biobanks/cohorts

How to cite

Data description

hashtagSummary association statistics

hashtagFine-mapping results

hashtagLD estimation

Data releases

LD estimation

hashtagSettings used

hashtagExample usage

Software used

Genotype data

hashtagQuality control

hashtagPre-phasing

Genotypes

SISu reference panel

Endpoints

hashtagRegistries

hashtagExcluded endpoints

hashtagRisteys

GWAS

Fine-mapping

Sample QC and PCA

hashtagPCA

hashtagVariant filtering and LD pruning

hashtagPCA outlier detection

hashtagKinship

hashtagFinal PCA

hashtagSample filtering based on phenotype data

hashtagFurther info

hashtagBayesian outlier detection

Contact

Loss of function burden

Using FinnGen data for publications

Manifest

Summary association statistics

Fine-mapping results

LD estimation

Quality control

Settings used

Example usage

Registries

Excluded endpoints

Risteys

PCA

Variant filtering and LD pruning

PCA outlier detection

Kinship

Final PCA

Sample filtering based on phenotype data

Further info

Bayesian outlier detection

Endpoint

Null models

Association tests

Using FinnGen data for publications

Manifest

Summary association statistics

Fine-mapping results

LD estimation

Settings used

Example usage

Quality control

Pre-phasing

Registries

Excluded endpoints

Risteys

PCA

Variant filtering and LD pruning

PCA outlier detection

Kinship

Final PCA

Sample filtering based on phenotype data

Further info

Bayesian outlier detection

Endpoint

Null models

Association tests