1 of 21

R6

Introduction

FinnGen research project is a public-private partnership combining genotype data from Finnish biobanks and digital health record data from Finnish health registries. FinnGen provides a unique opportunity to study genetic variation in relation to disease trajectories in an isolated population.

FinnGen is a growing project, aiming at 500,000 individuals in the end of 2023.

FinnGen results are subjected to one year embargo and, after that, available to the larger scientific community via the Pheweb browser or through data download.

Data download

To download FinnGen summary statistics you will need to fill the online form at this link. You will then receive an email containing the detailed instructions for downloading the data.

Release 6 contains

GWAS summary association statistics
Fine-mapping results
from

Using FinnGen data for publications

Please remember to acknowledge the FinnGen study when using these results in publications.

You can use the following text:

We want to acknowledge the participants and investigators of FinnGen study.

Manifest

The manifest file with the link to all the downloadable summary stats is available at:

Data description

File naming pattern and file structure

Summary association statistics

GWAS summary statistics (tab-delimited, bgzipped, genome build 38, tabix index files included) are named as {endpoint}.gz. For example, endpoint I9_CHD has I9_CHD.gz and I9_CHD.gz.tbi.

To learn more about the methods used, see section .

The {endpoint}.gz have the following structure:

*)Note that the results are based on imputed genotype dosages and produced using SAIGE and that is why the data is not presented as integers but might contain digits.

Fine-mapping results

Two fine-mapping methods were used:

Fine-mapping results are tab-delimited and bgzipped.

SuSiE results have the following filename pattern:

{endpoint}.SUSIE.cred.bgz
{endpoint}.SUSIE.cred_99.bgz
{endpoint}.SUSIE.snp.bgz

FINEMAP results have the following filename pattern:

{endpoint}.FINEMAP.config.bgz
{endpoint}.FINEMAP.region.bgz
{endpoint}.FINEMAP.snp.bgz

To learn more about the methods used, see section .

{endpoint}.SUSIE.cred.bgz contain credible set summaries from SuSiE fine-mapping for all genome-wide significant regions. {endpoint}.SUSIE.cred_99.bgz contain the 99% credible set summaries while the default is 95%. They have the following structure:

Column name

Description

{endpoint}.SUSIE.snp.bgz contain variant summaries with credible set information and have the following structure:

{endpoint}.FINEMAP.config.bgz contain summary fine-mapping variant configurations from FINEMAP method and have the following structure:

Column name

Description

{endpoint}.FINEMAP.region.bgz contain summary statistics on number of independent signals in each region and have the following structure:

Column name

Description

{endpoint}.FINEMAP.snp.bgz has summary statistics of variants and into what credible set they may belong to. Columns:

Column name

Description

LD estimation

Linkage disequilibrium (LD) was estimated from for each chromosome. Use the tool for further usage of the bcor files.

ldstore --bcor FG_LD_chr1.bcor --incl-range 20000000-50000000 --table output_file_name.table

To learn more about the methods used, see section .

Variant annotation

The variant annotation has measures (HWE, INFO, ...) listed per batch.

Data releases

Timeline for releases:

Release

Date release to partners

Date release to public

Total sample size [1]

Q4 2018 (Nov)

[1] samples used for PheWAS.

How to cite

Please use the following description when referring to our project:

The FinnGen study is a large-scale genomics initiative that has analyzed over 500,000 Finnish biobank samples and correlated genetic variation with health data to understand disease mechanisms and predispositions. The project is a collaboration between research organisations and biobanks within Finland and international industry partners.

When using these results in publications, please remember to:

Acknowledge the FinnGen study. You can use the following text:

“We want to acknowledge the participants and investigators of the FinnGen study”

Cite our latest publication:

Kurki, M.I., Karjalainen, J., Palta, P. et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature 613, 508–518 (2023). https://doi.org/10.1038/s41586-022-05473-8

Furthermore, if possible, include "FinnGen" as a keyword for your publication.

If you want to cite this website, use the following citation:

Methods

Participating biobanks/cohorts

Additionally to the biobanks mentioned in the previous releases, the following biobanks and cohorts are part of the R6 release:

Genotypes

FinnGen individuals were genotyped with Illumina and Affymetrix chip arrays (Illumina Inc., San Diego, and Thermo Fisher Scientific, Santa Clara, CA, USA).

Chip genotype data were imputed using the population-specific SISu v3 imputation reference panel of 3,775 whole genomes.

Merged imputed genotype data is composed of 75 data sets that include samples from multiple cohorts.

Total number of individuals: 271,341
Total number of variants (merged set): 16,962,023
Reference assembly: GRCh38/hg38

Genotype data

Chip genotype data processing and QC Samples were genotyped with Illumina (Illumina Inc., San Diego, CA, USA) and Affymetrix arrays (Thermo Fisher Scientific, Santa Clara, CA, USA).

Genotype calls were made with GenCall and zCall algorithms for Illumina and AxiomGT1 algorithm for Affymetrix data.

Chip genotyping data produced with previous chip platforms and reference genome builds were lifted over to build version 38 (GRCh38/hg38) following the protocol described here: dx.doi.org/10.17504/protocols.io.nqtddwn.

Quality control

In sample-wise quality control, individuals with ambiguous gender, high genotype missingness (>5%), excess heterozygosity (+-4SD) and non-Finnish ancestry were excluded. In variant-wise quality control variants with high missingness (>2%), low HWE P-value (<1e-6) and minor allele count, MAC<3 were excluded.

Pre-phasing

Prior imputation, chip genotyped samples were pre-phased with with the default parameters, except the number of conditioning haplotypes was set to 20,000.

Genotype imputation

Genotype imputation was done with the population-specific SISu v3 reference panel.

Variant call set was produced with GATK HaplotypeCaller algorithm by following GATK best-practices for variant calling.

Genotype-, sample- and variant-wise QC was applied in an iterative manner by using the Hail framework v0.1 and the resulting high-quality WGS data for 3,775 individuals were phased with Eagle 2.3.5 as described in the previous section.

Genotype imputation was carried out by using the population-specific SISu v3 imputation reference panel with Beagle 4.1 (version 08Jun17.d8b) as described in the following protocol: dx.doi.org/10.17504/protocols.io.nmndc5e.

Post-imputation quality-control involved checking expected conformity of the imputation INFO-value distribution, MAF differences between the target dataset and the imputation reference panel and checking chromosomal continuity of the imputed genotype calls.

SISu reference panel

SISu v3 consists of 3,775 WGS of Finnish individuals from six research cohorts:

METSIM (PIs Markku Laakso and Mike Boehnke)
FINRISK (PI Pekka Jousilahti)
Health2000 (PI Seppo Koskinen)
Finnish Migraine Family Study (PI Aarno Palotie)
Merck/Tienari samples (PI Pentti Tienari)
MESTA samples (PI Jaana Suvisaari)

High-coverage (25-30x) WGS data used to develop the SISu v3 reference panel were generated at the Broad Institute of MIT and Harvard and at the McDonnell Genome Institute at Washington University; and jointly processed at the Broad Institute.

Software used

Cromwell-42
Wdltool-0.14
Plink 1.9 and 2.0
BCFtools 1.7 and 1.9
Eagle 2.3.5
Beagle 4.1 (version 08Jun17.d8b)
R 3.4.1 (packages: data.table 1.10.4, sm 2.2-5.4)

LD estimation

The BCOR files were created using LDstore from the Finnish SISU panel v3.

The panel has been divided per chromosome. For example, to use the LD information in the first chromosome, FG_LD_chr1.bcor would be the file to use.

Settings used

number of samples: 3775
window size: 1500 kb
accuracy: low
number of threads: 96
LD threshold to include correlations: 0.05

Example usage

can be downloaded via:

And an example to extract variant range 20 Mb - 50 Mb from chromosome 7 is as follows:

Note

It is not preferred to use these LD estimate files for e.g. fine-mapping, since many of the fine-mapping methods (e.g. SuSiE) require in-sample LD information for good results!

Endpoints

Registries

The disease endpoints were defined using nationwide registries:

Drug purchase and Drug Reimbursement

We harmonized over the International Classification of Diseases (ICD) revisions 8, 9 and 10, cancer-specific ICD-O-3, (NOMESCO) procedure codes, Finnish-specific Social Insurance Institute (KELA) drug reimbursement codes and ATC-codes.

These registries spanning decades were electronically linked to the cohort baseline data using the unique national personal identification numbers assigned to all Finnish citizens and residents.

A full list of FinnGen endpoints is for release 6.

Excluded endpoints

The endpoints with fewer than 80 cases, and developmental “helper” endpoints were excluded from the final PheWas (“OMIT” tag in the endpoint definition file).

Endpoints with less than 150 cases are not released by (Finnish Institute for Health and Welfare).

Risteys

(Risteys = intersection in Finnish) allows browsing of the FinnGen data at the phenotype level, including endpoint definitions, statistics about number of individuals, gender distribution, and longitudinal relationships.

Sample QC and PCA

This is a description of the quality control procedures applied before running the GWAS.

PCA

The PCA for population structure has been run in the following way:

Association tests

Endpoint

We included 2,861 endpoints in the analysis. Endpoints with less than 80 cases among the 260,405 samples were excluded, as well as endpoints labeled with an OMIT tag in the endpoint definition file.

Null models

For null model computation for each endpoint, we used age, sex, 10 PCs and genotyping batch as covariates. Each genotyping batch was included as a covariate for an endpoint if there were at least 10 cases and 10 controls in that batch to avoid convergence issues. One genotyping batch need be excluded from covariates to not have them saturated. We excluded Thermo Fisher batch 16 as it was not enriched for any particular endpoints.

For calculating the genetic relationship matrix, only variants imputed with an INFO score > 0.95 in all batches were used. Variants with > 3 % missing genotypes were excluded as well as variants with MAF < 1 %. The remaining variants were LD pruned with a 1Mb window and r2 threshold of 0.1. This resulted in a set of 59,037 well-imputed not rare variants for GRM calculation.

options for the null computation:

LOCO = false
numMarkers = 30
traceCVcutoff = 0.0025

Association tests

We ran association tests against each of the 2,861 endpoints with for each variant with a minimum allele count of 5 from the imputation pipeline (SAIGE optionminMAC = 5). We filtered the results to include variants with an imputation INFO > 0.6.

PheWeb

The PheWeb portal can be used to browse results from FinnGen's predetermined endpoints (or 'phenotypes') a.k.a. core analysis results. FinnGen PheWeb tutorial is available here.

These clinician curated endpoints were analysed for genetic associations, which allows for disproportionate case-control numbers and corrects for relatedness between samples with a sparse genetic relatedness matrix.

The results from each association run are uploaded onto the PheWeb portal, which can be accessed by clicking this link:

https://r6.finngen.fi/

Home Page

The figure below shows the a table of the first few endpoints ('phenotypes') in FinnGen with the highest numbers of GWAS significant loci, along with the summary of case-control analyses and the number of hits.

You can reorder the table by clicking on the appropriate header value (in the figure above, we clicked on GWAS significant loci to order the table based on the number of GWAS loci).

From home page in PheWeb, you can also go directly to coding variant browser by clicking the icon '' in the top right corner.

Endpoint Page

Upon clicking an endpoint ('phenotype'), you will then be directed to the endpoint's page which will contain information such as case-control numbers and results from the association scan of the endpoint. In the following screenshot, we show the endpoint results for “Type 2 diabetes, wide definition”.

On the endpoint page, you will find a similar Manhattan plot from the association scan which summarizes the association results for your endpoint.

Scrolling further, you will also be able to see the Manhattan plot in a tabular format, distinguished by either the traditional GWAS hits or based on a.

Variant Page

You can also browse based on a variant of your choice and see a PheWas plot:

The variant page shows the information on the gene that the variant is in, the most severe consequence annotation of the variant (from ), its allele frequency, whether the variant was imputed or not (INFO score), and links to external sites to obtain further information on the variant such as , the , and the

The Manhattan plot shown in the figure above also shows p-values from the association scans for FinnGen endpoints. Scrolling down, you will again be able to see the association scan results for the FinnGen endpoints in this variant in a tabular format.

To see the corresponding , you can click show lavaa plot on top of the manhattan plot.

All results (endpoint and variant-wise) can be downloaded in a tabular format by clicking Download table.

Gene Page

Gene pQTL and disease colocalizations

The gene page of the FinnGen PheWeb browser can be found from by specifying the gene symbol of interest. The bottom section of the page contains gene pQTL and disease colocalization data available for the FinnGen imputed SNPs. The main table contains gathered from finemapping results and combined across Olink and Somascan proteomics QTL platforms (FinnGen and UK Biobank Pharma Proteomics Project). The main table includes the following columns:

source - pQTL platform source (i.e. FinnGen Olink, FinnGen Somascan, UKB-PPP)
region - region for which the fine-mapping was run
CS - running number for independent credible sets in a region

The nested sub-table for a single gene pQTL contains a list of disease colocalizations between the FinnGen endpoints and the pQTL in question colocalizing with the lead variant of the pQTL (read more about ). The sub-table includes the following columns:

phenotype - FinnGen endpoint (by clicking to the phenotype you will be navigated to the PheWeb region page corresponding to the phenotype in question)
description - FinnGen endpoint description
clpp - causal posterior probability calculated for a colocalization

All results can be downloaded in a tabular format by clicking Download table.

Note: PheWeb is continuously being developed, and some features available in newer DFs may not be available in PheWeb versions for earlier DFs.

Colocalization

Colocalizations in FinnGen

Our colocalization approach uses the probabilistic model for integrating GWAS and eQTL data presented in eCAVIAR (Hormozdiari et al. 2016). Compared to eCAVIAR, we are using SuSiE (Wang et al. 2019) to fine-map our inputs and provide an additional colocalization metric (CLPA).

Our goal is to extract a list of genomic regions that show colocalization between two phenotypes p1 and p2. Further, we assume that the summary statistics of p1 and p2 have been fine-mapped. The fine-mapping output for each phenotype contains three columns: the variant identifier (VAR), posterior inclusion probability (PIP), and the credible set (CS) identifier.

CLPP

The Causal Posterior Probability (CLPP) is computed between two credible sets cs1 and cs2, with cs1 coming from a given phenotype p1 and cs2 coming from phenotype p2. CLPP is defined as follows: For vectors x and y, containing the PIP for variants in cs1 and cs2, respectively, CLPP is calculated by

This CLPP calculation is similar to equation 8 in Hormozdiari et al. 2016.

CLPP is dependent on the credible set size. By definition, any credible set size > 1 will yield a CLPP < 1.

CLPA

We derived another colocalization metric called causal posterior agreement (CLPA) that is independent of credible set size.

The picture below shows how colocalizations are defined.

Example Comparison

This rough example shows why we mostly use CLPA since it is independent of sample size.

Data

The colocalization is performed between FinnGen endpoints as well as between FinnGen endpoints and various QTL resources, as shown in the image below.

These resources are listed below:

FinnGen resources

The SuSiE finemapping results for the release were used as the FinnGen data.

Expression QTL datasets

GTEx v8: SuSiE fine-mapping, 49 tissues, donors of mixed ancestry, Aguet et al. (2019, BioRxiv) (49 tissues only involve tissues with a sample size of n >= 50). Fine-mapping performed by Hilary Finucane, Jacob Ulirsch, Masahiro Kanai from the . Effect size interpretation: change in normalised gene expression (sd units) per alternate allele. Normalization = inverse normal transformation.
EMBL-EBI (European Bioinformatics Institute) . eQTL data from 24 tissues/cell types, 16 RNAseq sources, 6 Microarray, SuSiE fine-mapping, donors of 88% European ancestry, Kerimov et al. (2020, BioRxiv). For RNAseq data, four quantification methods (gene expression, exon expression, transcript usage, txrevise event usage). Fine-mapping was performed by . Effect size interpretation: change in normalised gene expression (sd units) per alternate allele. Normalization = inverse normal transformation.

Metabolon QTL datasets

GeneRISK: 186 lipid species QTLs, SuSiE fine-mapping of Widen et al. (2020), 7632 Finnish samples. Effect size interpretation: change in standard deviation of the lipid species per alternate allele.

Biomarkers

UK Biobank: 36 continuous endpoints, 57 biomarkers from UKBB prepared by , SuSiE fine-mapping. Effect size interpretation for quantitative traits: change in standard deviation of the normalized outcome per alternate allele. Effect size interpretation for binary traits increase in log(odds ratios) per alternate allele.

Post-colocalization QC

Only unique source1-source2-pheno1-pheno2-tissue2-quant2-locus_id1-locus_id2 combinations were included in the results. FinnGen endpoints with _COMORB-definition were left out of the results.

Acknowledgements

We thank the following people for helping us assembling the QTL resources:

Kaur Alasoo and Nurlan Kerimov provided us the fine-mapped EMBL-EBI eQTL catalogue datasets.
Hilary Finucane, Jacob Ulirsch, Masahiro Kanai gave us access to their fine-mapped GTEx data.

Fine-mapping

We used two state-of-the-art methods, FINEMAP (Benner, C. et al., 2016; Benner, C. et al., 2018) and SuSiE (Wang, G. et al., 2020) to fine-map genome-wide significant loci in FinnGen endpoints.

Briefly, there are three main steps:

1. Preprocessing

For each genome-wide significant locus (default configuration: P < 5e-8), we define a fine-mapping region by taking a 3 Mb window around a lead variant (and merge regions if they overlap). We preprocess an input GWAS summary statistics into separate files per region for the following steps.

2. LD computation

We compute in-sample dosage LD using for each fine-mapping region.

3. Fine-mapping

With the inputs of summary statistics and in-sample LD from the steps 1-2, we conduct fine-mapping using and with the maximum number of causal variants in a locus L = 10.

Integration to PheWeb

The "Credible Sets"-table on a phenotype page in the browser shows the SuSiE-fine-mapped credible sets of that phenotype. The variant shown per credible set is the maximum PIP (posterior inclusion probability) variant of that credible set. In addition to the causal variants, variants that were in sufficient LD (pearsonr^2 > 0.05), had a small enough p-value (pval < 0.01), and were close enough to the lead variant (distance to lead variant < 1.5 megabases) were clumped together with the credible set. Variants have been compared against GWAS Catalog and annotated. The LD grouping, annotation and GWAS Catalog comparison were done using the autoreporting pipeline.

The columns of the table are explained below:

Contact

For matters related to this documentation, click Edit on GitHubor send us an email to finngen-info@helsinki.fi.

for the latest updates on the project as well as additional background information please consider visiting the study website https://www.finngen.fi/en or follow FinnGen on twitter @FinnGen_FI.

If you want to host FinnGen summary statistics on your website, please get in contact with us at: humgen-servicedesk@helsinki.fi.

R6

Introduction

Data download

hashtagUsing FinnGen data for publications

hashtagManifest

Data description

hashtagSummary association statistics

hashtagFine-mapping results

hashtagLD estimation

hashtagVariant annotation

Data releases

How to cite

Methods

Participating biobanks/cohorts

Genotypes

Genotype data

hashtagQuality control

hashtagPre-phasing

Genotype imputation

SISu reference panel

Software used

LD estimation

hashtagSettings used

hashtagExample usage

hashtagNote

Endpoints

hashtagRegistries

hashtagExcluded endpoints

hashtagRisteys

Sample QC and PCA

hashtagPCA

Association tests

hashtagEndpoint

hashtagNull models

hashtagAssociation tests

PheWeb

Colocalization

hashtagColocalizations in FinnGen

hashtagCLPP

hashtagCLPA

hashtagExample Comparison

hashtagData

hashtagFinnGen resources

hashtagExpression QTL datasets

hashtagMetabolon QTL datasets

hashtagBiomarkers

hashtagPost-colocalization QC

hashtagAcknowledgements

Fine-mapping

hashtag1. Preprocessing

hashtag2. LD computation

hashtag3. Fine-mapping

hashtagIntegration to PheWeb

Contact

Data description

hashtagSummary association statistics

hashtagFine-mapping results

hashtagLD estimation

hashtagVariant annotation

SISu reference panel

Software used

Genotype data

hashtagQuality control

hashtagPre-phasing

Fine-mapping

hashtag1. Preprocessing

hashtag2. LD computation

hashtag3. Fine-mapping

hashtagIntegration to PheWeb

Sample QC and PCA

hashtagPCA

Contact

hashtagPCA outlier detection

hashtagKinship

hashtagFinal PCA

hashtagSample filtering based on phenotype data

hashtagFurther info

hashtagBayesian outlier detection

Introduction

Genotypes

Using FinnGen data for publications

Manifest

Summary association statistics

Fine-mapping results

LD estimation

Variant annotation

Quality control

Pre-phasing

Settings used

Example usage

Note

Registries

Excluded endpoints

Risteys

PCA

Endpoint

Null models