1 of 24

R10

Introduction

FinnGen research project is a public-private partnership combining genotype data from Finnish biobanks and digital health record data from Finnish health registries. FinnGen provides a unique opportunity to study genetic variation in relation to disease trajectories in an isolated population.

FinnGen is a growing project, aiming at 500,000 individuals in the end of 2023.

FinnGen results are subjected to one year embargo and, after that, available to the larger scientific community via the Pheweb browser or through data download.

Data download

To download FinnGen summary statistics you will need to fill the online form at this link. You will then receive an email containing the detailed instructions for downloading the data.

Release 10 contains

GWAS summary association statistics
Fine-mapping results

Using FinnGen data for publications

When using these results in publications, please remember to:

1) Acknowledge the FinnGen study. You can use the following text:

“We want to acknowledge the participants and investigators of the FinnGen study”

2) Cite our latest publication:

Kurki M.I., et al. . Nature 2023 Jan;613(7944):508-518. doi: 10.1038/s41586-022-05473-8. Epub 2023 Jan 18.

Furthermore, if possible, include "FinnGen" as a keyword for your publication.

If you want to cite this website, use the following citation:

Manifest

The manifest file with the link to all the downloadable summary stats is available at:

Data description

File naming pattern and file structure

Summary association statistics

GWAS summary statistics (tab-delimited, bgzipped, genome build 38, tabix index files included) are named as {endpoint}.gz. For example, endpoint I9_CHD has I9_CHD.gz and I9_CHD.gz.tbi.

To learn more about the methods used, see section .

The {endpoint}.gz have the following structure:

Fine-mapping results

Two fine-mapping methods were used:

Fine-mapping results are tab-delimited and bgzipped.

SuSiE results have the following filename pattern:

{endpoint}.SUSIE.cred.bgz
{endpoint}.SUSIE.cred_99.bgz
{endpoint}.SUSIE.snp.bgz

FINEMAP results have the following filename pattern:

{endpoint}.FINEMAP.config.bgz
{endpoint}.FINEMAP.region.bgz
{endpoint}.FINEMAP.snp.bgz

To learn more about the methods used, see section .

{endpoint}.SUSIE.cred.bgz contain credible set summaries from SuSiE fine-mapping for all genome-wide significant regions. {endpoint}.SUSIE.cred_99.bgz contain the 99% credible set summaries while the default is 95%. They have the following structure:

Column name

Description

{endpoint}.SUSIE.snp.bgz contain variant summaries with credible set information and have the following structure:

{endpoint}.FINEMAP.config.bgz contain summary fine-mapping variant configurations from FINEMAP method and have the following structure:

Column name

Description

{endpoint}.FINEMAP.region.bgz contain summary statistics on number of independent signals in each region and have the following structure:

Column name

Description

{endpoint}.FINEMAP.snp.bgz has summary statistics of variants and into what credible set they may belong to. Columns:

Column name

Description

pQTL summary statistics

pQTL summary statistics (tab-delimited, bgzipped, genome build 38, index files included) are named as {probeName}.gz. For example, endpoint seq.9928.125 has seq.9928.125.gz and seq.9928.125.gz.tbi.

To learn more about the methods used, see section

The {probeName}.gz have the following structure:

Field

Description

LD estimation

Linkage disequilibrium (LD) was estimated from for each chromosome. Use the tool for further usage of the bcor files.

ldstore --bcor FG_LD_chr1.bcor --incl-range 20000000-50000000 --table output_file_name.table

To learn more about the methods used, see section .

Variant annotation

The variant annotation has measures (HWE, INFO, ...) listed per batch.

Data releases

Timeline for releases:

Release

Date release to partners

Date release to public

Total sample size [1]

Q4 2018 (Nov)

Q1 2020

96,499

Q2 2019 (May)

[1] samples used for PheWAS.

How to cite

Please use the following description when referring to our project:

The FinnGen study is a large-scale genomics initiative that has analyzed over 500,000 Finnish biobank samples and correlated genetic variation with health data to understand disease mechanisms and predispositions. The project is a collaboration between research organisations and biobanks within Finland and international industry partners.

When using these results in publications, please remember to:

Acknowledge the FinnGen study. You can use the following text:

“We want to acknowledge the participants and investigators of the FinnGen study”

Cite our latest publication:

Kurki, M.I., Karjalainen, J., Palta, P. et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature 613, 508–518 (2023). https://doi.org/10.1038/s41586-022-05473-8

Furthermore, if possible, include "FinnGen" as a keyword for your publication.

If you want to cite this website, use the following citation:

Methods

Participating biobanks/cohorts

Genotypes

FinnGen individuals were genotyped with Illumina and Affymetrix chip arrays (Illumina Inc., San Diego, and Thermo Fisher Scientific, Santa Clara, CA, USA).

Chip genotype data were imputed using the population-specific SISu v4.2 imputation reference panel of 8,554 whole genomes.

Merged imputed genotype data is composed of 116 data sets that include samples from multiple cohorts.

Total number of individuals: 430,897
Total number of variants (merged set): 21,311,942
Reference assembly: GRCh38/hg38

Genotype data

Chip genotype data processing and QC Samples were genotyped with Illumina (Illumina Inc., San Diego, CA, USA) and Affymetrix arrays (Thermo Fisher Scientific, Santa Clara, CA, USA).

Genotype calls were made with GenCall and zCall algorithms for Illumina and AxiomGT1 algorithm for Affymetrix data.

Chip genotyping data produced with previous chip platforms and reference genome builds were lifted over to build version 38 (GRCh38/hg38) following the protocol described here:

Quality control

Genotype imputation

Genotype imputation was done with the population-specific .

The reference panel variant call set was produced with the GATK HaplotypeCaller algorithm by following GATK best practices for variant calling.

Genotype-, sample- and variant-wise QC was carried out iteratively by using the and the resulting high-quality WGS data for 8,554 individuals were phased with as described in the previous section.

Genotype imputation was carried out by using the population-specific SISu v4.2 imputation reference panel with (version 27Jan18.7e1) as described in the following protocol: .

Post-imputation quality control involved checking the expected conformity of the imputation INFO-value distribution, MAF differences between the target dataset and the imputation reference panel and checking chromosomal continuity of the imputed genotype calls.

SISu reference panel

SISu v4.2 consists of 8,554 WGS of Finnish individuals from 5 research cohorts from:

METSIM (PIs Markku Laakso and Mike Boehnke)
FINRISK (PI Pekka Jousilahti)
Corogene (PI Juha Sinisalo)
Biobank of Eastern Finland (PI Arto Mannermaa)
Finnish EUFAM Dyslipidemia Study (PIs Marja-Riitta Taskinen and Samuli Ripatti)

High-coverage (25x) WGS data used to develop the SISu v4.2 reference panel were generated at the McDonnell Genome Institute at Washington University (PIs Ira Hall and Nathan Stitziel).

Software used

Hail v0.2
Cromwell-42
Wdltool-0.14

LD estimation

The BCOR files were created using LDstore from the Finnish SISu panel v4.2.

The panel has been divided per chromosome. For example, to use the LD information in the first chromosome, FG_LD_chr1.bcor would be the file to use.

Settings used

number of samples: 3775
window size: 1500 kb
accuracy: low
number of threads: 96
LD threshold to include correlations: 0.05

Example usage

can be downloaded via:

And an example to extract variant range 20 Mb - 50 Mb from chromosome 7 is as follows:

Note

It is not preferred to use these LD estimate files for e.g. fine-mapping, since many of the fine-mapping methods (e.g. SuSiE) require in-sample LD information for good results!

Endpoints

Registries

The disease endpoints were defined using nationwide registries:

Sample QC and PCA

This is a description of the quality control procedures applied before running the GWAS.

PCA

The PCA for population structure has been run in the following way:

Variant filtering and LD pruning

The sisu version 4.2 imputation panel is pruned iteratively, until a target number of SNPs is reached:

9,641,808 starting variants: only variants with a minimum info score of 0.9 in all batches are kept.

The script starts with [500.0, 50.0, 0.9] params in plink (window,step,r2). It then decreases 0.05 in r2 iteratively pruning the imputation panel until the threshold of 200,000 snps is reached. Once the SNP count falls under 200,000 the closest pruning is returned.

If the higher r2 is closer, 200,000 snps are randomly selected, else the last pruned snps are returned.

Plink flags used: --snps-only --chr 1-22 --max-alleles 2 --maf 0.01 .

For this run 180,037 snps are returned.

PCA outlier detection

Then, FinnGen data was merged with the 1k genome project (1kgp) data, using the variants mentioned above. A round of PCA was performed and a bayesian algorithm was used to spot outliers. This process got rid of 14,547 FinnGen samples. The figure below shows the scatter plots for the first 3 PCs. Outliers, in green, are separated from the FinnGen red cluster.

While the method automatically detected as being outliers the 1kg samples with non European and southern European ancestries, it did not manage to exclude some samples with Western European origins. Since the signal from these samples would have been too small to allow a second round to be performed without detecting substructures of the Finnish population, another approach was used. The FinnGen samples that survived the first round were used to compute another PCA. The EUR and FIN 1kg samples were then projected onto the space generated by the first 3 PCs. Then, the centroid of each cluster was calculated and used to calculate the squared mahalanobis distance of each FinnGen sample to each of the centroids. Being the squared distance a sum of squared variables (with unitary variance, due to the mahalanobis distance), we could see it as a sum of 3 independent squared variables. This allowed us to map the squared distance into a probability (chi squared with 3 degrees of freedom). Therefore, for each cluster, a probability of being part of it was computed. Then, a threshold of 0.95 was used to exclude FinnGen samples whose relative chance of being part of the Finnish cluster was below the level. This method produced another 43 outliers. The figure below shows the first three principal components.

FIN 1kg samples are in purple, while EUR 1kgp samples are in Blue. Samples in green are FinnGen samples who are flagged as being non Finnish, while red ones are considered Finnish.

Kinship

Then all pairs of FinnGen samples up to second degree were returned. The figure below shows the distribution of kinship values.

Then, the previously defined “non Finnish” samples were excluded and 2 algorithms were used to return a unique subset of unrelated samples:

one called greedy would continuously remove the highest degree node from the network of relations, until no more links are left in the network.
one called native, based on a native implementation of python’s networkx package, performed on each subgraph of the network.

The largest independent set of either algorithm would be used to keep those sample, while flagging the others as “outliers” for the final PCA.

Then, the subset of outliers who also belong to the set of duplicates/twins was identified.

Final PCA

To compute the final step the Finngen samples were ultimately separated in three groups:

259,801 inliers: unrelated samples with Finnish ancestry.
153,927 outliers: non duplicate samples with Finnish ancestries, but who are also related to the inliers.
17,169 rejected samples: either of non Finnish ancestry or are twins/duplicates with relations to other samples.

Finally, the PCA for the inliers was calculated, and then outliers were projected on the same PC space, allowing to calculate covariates for a total of 413,728 samples.

Sample filtering based on phenotype data

Of the 413,728 non-duplicate population inlier samples from PCA, we excluded 1,543 samples from analysis because of missing minimum phenotype data, and 5 samples because of failing sex check with F thresholds of 0.4 and 0.7. A total of 412,181 samples were used for core analysis. There are 230,310 females and 181,871 males among these samples.

Further info

Bayesian outlier detection

Documentation from the original developers of the algorithm can be found here: .

Association tests

Endpoint

We included 2,408 endpoints in the analysis, which consisted of 2,405 binary endpoints and 3 quantitative endpoints (HEIGHT_IRN, WEIGHT_IRN, BMI_IRN). Endpoints with less than 50 cases among the 412,181 samples were excluded, as well as endpoints labeled with an OMIT tag in the endpoint definition file.

The quantitative endpoints HEIGHT and WEIGHT were acquired from minimum phenotype data. After that, phenotype BMI was formed from them, and all of them were inverse normal transformed.

Sample QC and PCA

This is a description of the quality control procedures applied before running the GWAS.

PCA

The PCA for population structure has been run in the following way:

Variant filtering and LD pruning

The sisu version 4.2 imputation panel is pruned iteratively, until a target number of SNPs is reached:

9,641,808 starting variants: only variants with a minimum info score of 0.9 in all batches are kept.

If the higher r2 is closer, 200,000 snps are randomly selected, else the last pruned snps are returned.

Plink flags used: --snps-only --chr 1-22 --max-alleles 2 --maf 0.01 .

For this run 180,037 snps are returned.

PCA outlier detection

FIN 1kg samples are in purple, while EUR 1kgp samples are in Blue. Samples in green are FinnGen samples who are flagged as being non Finnish, while red ones are considered Finnish.

Kinship

Then all pairs of FinnGen samples up to second degree were returned. The figure below shows the distribution of kinship values.

Then, the previously defined “non Finnish” samples were excluded and 2 algorithms were used to return a unique subset of unrelated samples:

one called greedy would continuously remove the highest degree node from the network of relations, until no more links are left in the network.
one called native, based on a native implementation of python’s networkx package, performed on each subgraph of the network.

The largest independent set of either algorithm would be used to keep those sample, while flagging the others as “outliers” for the final PCA.

Then, the subset of outliers who also belong to the set of duplicates/twins was identified.

Final PCA

To compute the final step the Finngen samples were ultimately separated in three groups:

259,801 inliers: unrelated samples with Finnish ancestry.
153,927 outliers: non duplicate samples with Finnish ancestries, but who are also related to the inliers.
17,169 rejected samples: either of non Finnish ancestry or are twins/duplicates with relations to other samples.

Finally, the PCA for the inliers was calculated, and then outliers were projected on the same PC space, allowing to calculate covariates for a total of 413,728 samples.

Sample filtering based on phenotype data

Further info

Bayesian outlier detection

Documentation from the original developers of the algorithm can be found here: .

Data description

File naming pattern and file structure

Summary association statistics

GWAS summary statistics (tab-delimited, bgzipped, genome build 38, tabix index files included) are named as {endpoint}.gz. For example, endpoint I9_CHD has I9_CHD.gz and I9_CHD.gz.tbi.

To learn more about the methods used, see section .

The {endpoint}.gz have the following structure:

Fine-mapping results

Two fine-mapping methods were used:

Fine-mapping results are tab-delimited and bgzipped.

SuSiE results have the following filename pattern:

{endpoint}.SUSIE.cred.bgz
{endpoint}.SUSIE.cred_99.bgz
{endpoint}.SUSIE.snp.bgz

FINEMAP results have the following filename pattern:

{endpoint}.FINEMAP.config.bgz
{endpoint}.FINEMAP.region.bgz
{endpoint}.FINEMAP.snp.bgz

To learn more about the methods used, see section .

Column name

Description

{endpoint}.SUSIE.snp.bgz contain variant summaries with credible set information and have the following structure:

{endpoint}.FINEMAP.config.bgz contain summary fine-mapping variant configurations from FINEMAP method and have the following structure:

Column name

Description

{endpoint}.FINEMAP.region.bgz contain summary statistics on number of independent signals in each region and have the following structure:

Column name

Description

{endpoint}.FINEMAP.snp.bgz has summary statistics of variants and into what credible set they may belong to. Columns:

Column name

Description

pQTL summary statistics

To learn more about the methods used, see section

The {probeName}.gz have the following structure:

Field

Description

LD estimation

Linkage disequilibrium (LD) was estimated from for each chromosome. Use the tool for further usage of the bcor files.

ldstore --bcor FG_LD_chr1.bcor --incl-range 20000000-50000000 --table output_file_name.table

To learn more about the methods used, see section .

Variant annotation

The variant annotation has measures (HWE, INFO, ...) listed per batch.

Colocalization

Colocalizations in FinnGen

Our colocalization approach uses the probabilistic model for integrating GWAS and eQTL data presented in eCAVIAR (Hormozdiari et al. 2016). Compared to eCAVIAR, we are using SuSiE (Wang et al. 2019) to fine-map our inputs and provide an additional colocalization metric (CLPA).

Our goal is to extract a list of genomic regions that show colocalization between two phenotypes p1 and p2. Further, we assume that the summary statistics of p1 and p2 have been fine-mapped. The fine-mapping output for each phenotype contains three columns: the variant identifier (VAR), posterior inclusion probability (PIP), and the credible set (CS) identifier.

CLPP

The Causal Posterior Probability (CLPP) is computed between two credible sets cs1 and cs2, with cs1 coming from a given phenotype p1 and cs2 coming from phenotype p2. CLPP is defined as follows: For vectors x and y, containing the PIP for variants in cs1 and cs2, respectively, CLPP is calculated by

This CLPP calculation is similar to equation 8 in Hormozdiari et al. 2016.

CLPP is dependent on the credible set size. By definition, any credible set size > 1 will yield a CLPP < 1.

CLPA

We derived another colocalization metric called causal posterior agreement (CLPA) that is independent of credible set size.

The picture below shows how colocalizations are defined.

Example Comparison

This rough example shows why we mostly use CLPA since it is independent of sample size.

Data

The colocalization is performed between FinnGen endpoints as well as between FinnGen endpoints and various QTL resources, as shown in the image below.

These resources are listed below:

FinnGen resources

The SuSiE finemapping results for the release were used as the FinnGen data.

Expression QTL datasets

GTEx v8: SuSiE fine-mapping, 49 tissues, donors of mixed ancestry, Aguet et al. (2019, BioRxiv) (49 tissues only involve tissues with a sample size of n >= 50). Fine-mapping performed by Hilary Finucane, Jacob Ulirsch, Masahiro Kanai from the . Effect size interpretation: change in normalised gene expression (sd units) per alternate allele. Normalization = inverse normal transformation.
EMBL-EBI (European Bioinformatics Institute) . eQTL data from 24 tissues/cell types, 16 RNAseq sources, 6 Microarray, SuSiE fine-mapping, donors of 88% European ancestry, Kerimov et al. (2020, BioRxiv). For RNAseq data, four quantification methods (gene expression, exon expression, transcript usage, txrevise event usage). Fine-mapping was performed by . Effect size interpretation: change in normalised gene expression (sd units) per alternate allele. Normalization = inverse normal transformation.

Metabolon QTL datasets

GeneRISK: 186 lipid species QTLs, SuSiE fine-mapping of Widen et al. (2020), 7632 Finnish samples. Effect size interpretation: change in standard deviation of the lipid species per alternate allele.

Biomarkers

UK Biobank: 36 continuous endpoints, 57 biomarkers from UKBB prepared by , SuSiE fine-mapping. Effect size interpretation for quantitative traits: change in standard deviation of the normalized outcome per alternate allele. Effect size interpretation for binary traits increase in log(odds ratios) per alternate allele.

Post-colocalization QC

Only unique source1-source2-pheno1-pheno2-tissue2-quant2-locus_id1-locus_id2 combinations were included in the results. FinnGen endpoints with _COMORB-definition were left out of the results.

Acknowledgements

We thank the following people for helping us assembling the QTL resources:

Kaur Alasoo and Nurlan Kerimov provided us the fine-mapped EMBL-EBI eQTL catalogue datasets.
Hilary Finucane, Jacob Ulirsch, Masahiro Kanai gave us access to their fine-mapped GTEx data.

R10

Introduction

Data download

hashtagUsing FinnGen data for publications

hashtagManifest

Data description

hashtagSummary association statistics

hashtagFine-mapping results

hashtagpQTL summary statistics

hashtagLD estimation

hashtagVariant annotation

Data releases

How to cite

Methods

Participating biobanks/cohorts

Genotypes

Genotype data

hashtagQuality control

Genotype imputation

SISu reference panel

Software used

LD estimation

hashtagSettings used

hashtagExample usage

hashtagNote

Endpoints

hashtagRegistries

Sample QC and PCA

hashtagPCA

hashtagVariant filtering and LD pruning

hashtagPCA outlier detection

hashtagKinship

hashtagFinal PCA

hashtagSample filtering based on phenotype data

hashtagFurther info

hashtagBayesian outlier detection

Association tests

hashtagEndpoint

LD estimation

hashtagSettings used

hashtagExample usage

hashtagNote

Software used

Data download

hashtagUsing FinnGen data for publications

hashtagManifest

Introduction

Data releases

How to cite

SISu reference panel

Genotype imputation

Genotypes

Sample QC and PCA

hashtagPCA

hashtagVariant filtering and LD pruning

hashtagPCA outlier detection

hashtagKinship

hashtagFinal PCA

hashtagSample filtering based on phenotype data

hashtagFurther info

hashtagBayesian outlier detection

Data description

hashtagSummary association statistics

hashtagFine-mapping results

hashtagpQTL summary statistics

hashtagLD estimation

hashtagVariant annotation

Participating biobanks/cohorts

Genotype data

hashtagQuality control

Endpoints

hashtagRegistries

Association tests

hashtagEndpoint

hashtagAssociation tests

hashtagPre-phasing

hashtagExcluded endpoints

hashtagRisteys

GWAS

Colocalization

Using FinnGen data for publications

Manifest

Summary association statistics

Fine-mapping results

pQTL summary statistics

LD estimation

Variant annotation

Quality control

Settings used

Example usage

Note

Registries

PCA

Variant filtering and LD pruning

PCA outlier detection

Kinship

Final PCA

Sample filtering based on phenotype data

Further info

Bayesian outlier detection

Endpoint

Settings used

Example usage

Note

Using FinnGen data for publications

Manifest

PCA

Variant filtering and LD pruning

PCA outlier detection

Kinship

Final PCA

Sample filtering based on phenotype data

Further info

Bayesian outlier detection

Summary association statistics

Fine-mapping results

pQTL summary statistics

LD estimation

Variant annotation

Quality control

Registries

Endpoint

Association tests

Pre-phasing

Excluded endpoints

Risteys

Colocalizations in FinnGen

CLPP

CLPA

Example Comparison

Data

FinnGen resources

Expression QTL datasets

Metabolon QTL datasets

Biomarkers

Post-colocalization QC

Acknowledgements

1. Preprocessing

2. LD computation

3. Fine-mapping

Notes

Integration to PheWeb

Variant Selection

HLA imputation

FINNGEN PROTEOMICS SUMMARY STATISTICS

General description

Null Models

Association tests

Variant summary

Association testing

Association summary

Data

Folder structure

Column descriptions for pQTL results

Map the probes's name to gene symbol

Software