1 of 25

R2

Introduction

FinnGen a public-private partnership project combining genotype data from Finnish biobanks and digital health record data from Finnish health registries. FinnGen provides a unique opportunity to study genetic variation in relation to disease trajectories in an isolated population.

FinnGen is a growing project, aiming at 500,000 individuals in 2023.

FinnGen results are subjected to one year embargo and, after that, available to the larger scientific community via the or through .

Data download

To download FinnGen summary statistics you will need to fill the online form at this link. You will then receive an email containing the detailed instructions for downloading the data.

Using FinnGen data for publications

Please remember to acknowledge the FinnGen study when using these results in publications.

You can use the following text:

We want to acknowledge the participants and investigators of FinnGen study.

Manifest

The Manifest file with the link to all the downloadable summary statistics is available at:

Description

GWAS summary stats (tab-delimited, bgzipped, genome build 38, filtered to INFO > 0.6, index files included) are named as {endpoint}.gz. For example, endpoint I9_CHD has I9_CHD.gz and I9_CHD.gz.tbi.

To learn more about the methods used, see section .

The {endpoint}.gz have the following structure:

Data releases

Timeline for releases:

Release

Date release to partners

Date release to public

Total sample size

Q4 2018 (27th Nov)

Q1 2020

96,499

Q2 2019 (13th May)

How to cite

Please use the following description when referring to our project:

The FinnGen study is a large-scale genomics initiative that has analyzed over 500,000 Finnish biobank samples and correlated genetic variation with health data to understand disease mechanisms and predispositions. The project is a collaboration between research organisations and biobanks within Finland and international industry partners.

When using these results in publications, please remember to:

Pheweb Browser

Getting started

The web browser contains all FinnGen GWAS results from release 2 and provides you with three options:

Search for the GWAS result of a , or .
Explore the loss-of-function burden (LoF) for gene-phenotypes combinations.

Phenotype list

Contains all endpoints/phenotypes for which a GWAS was run (if more than 100 cases).

Column

Description

phenotype

description

category

13 phenotype categories

genome-wide significant loci

Variant(s) with within a +/- 500kb window.

GWAS overview

Clicking on any phenotype will show you an overview of the results:

Detailed info about phenotype definition
Manhattan plot

Locus zoom

FinnGen association locus zoom plot
Annotation with GWAS catalog variants + UK Biobank hits

Variant view

The variant view has the following URL: http://r2.finngen.fi/variant/CHR-POS-ALT-REF, e.g. http://r2.finngen.fi/variant/13-80757865-T-TA

CHR: chromosome on hg38 (1-22, X or 23)
POS: position on hg38
REF: reference allele
ALT: alternative allele

Gene view / LoF burden

Clicking on any gene will bring you to the gene view with association results for that gene region, the loss-of-function analysis results (for methods see LoF burden) and an annotated list of all loss of function and missense variants.

LoF burden results

Column

Methods

Participating biobanks/cohorts

The following biobanks and cohorts are part of the R2 release:

Genotypes

FinnGen individuals were with Illumina and Affymetrix chip arrays (Illumina Inc., San Diego, and Thermo Fisher Scientific, Santa Clara, CA, USA).

Chip genotype data were using the population-specific of 3,775 whole genomes.

Post-imputation QC involved excluding variants with imputation INFO < 0.7.

Total number of individuals: 102,739

Genotype data

Chip genotype data processing and QC Samples were genotyped with Illumina (Illumina Inc., San Diego, CA, USA) and Affymetrix arrays (Thermo Fisher Scientific, Santa Clara, CA, USA).

Genotype calls were made with GenCall and zCall algorithms for Illumina and AxiomGT1 algorithm for Affymetrix data.

Chip genotyping data produced with previous chip platforms and reference genome builds were lifted over to build version 38 (GRCh38/hg38) following the protocol described here: .

Quality control

Genotype imputation

Genotype imputation was done with the population-specific .

Variant call set was produced with GATK HaplotypeCaller algorithm by following GATK best-practices for variant calling.

Genotype-, sample- and variant-wise QC was applied in an iterative manner by using the and the resulting high-quality WGS data for 3,775 individuals were phased with Eagle 2.3.5 as described in the previous section.

Genotype imputation was carried out by using the population-specific SISu v3 imputation reference panel with (version 08Jun17.d8b) as described in the following protocol: .

Post-imputation quality-control involved checking expected conformity of the imputation INFO-value distribution, MAF differences between the target dataset and the imputation reference panel and checking chromosomal continuity of the imputed genotype calls.

SISu reference panel

SISu v3 consists of 3,775 high coverage (30x) WGS Finnish individuals from six cohorts:

METSIM (PIs Markku Laakso and Mike Boehnke)
FINRISK (PI Pekka Jousilahti)
Health2000 (PI Seppo Koskinen)
Finnish Migraine Family Study (PI Aarno Palotie)
Merck/Tienari samples (PI Pentti Tienari)
MESTA samples (PI Jaana Suvisaari)

High-coverage (25-30x) WGS data used to develop the SISu v3 reference panel were generated at the Broad Institute of MIT and Harvard and at the McDonnell Genome Institute at Washington University; and jointly processed at the Broad Institute.

Software used

Cromwell-29 and 31
Wdltool-0.14
Plink 1.9 and 2.0

Endpoints

The disease endpoints were defined using nationwide registries:

We harmonized over the International Classification of Diseases (ICD) revisions 8, 9 and 10, cancer-specific ICD-O-3, (NOMESCO) procedure codes, Finnish-specific Social Insurance Institute (KELA) drug reimbursement codes and ATC-codes.

These registries spanning decades were electronically linked to the cohort baseline data using the unique national personal identification numbers assigned to all Finnish citizens and residents.

A full list of FinnGen endpoints is for release 2.

The endpoints with fewer than 100 cases, near-duplicate endpoints, and developmental “helper” endpoints were excluded from the final PheWas (column “OMIT”).

Endpoints with N<150 are not released by (Finnish Institute for Health and Welfare).

GWAS

We used the software for running the R2 GWAS.

is a mixed model logistic regression R/C++ package, able to account for related samples.

We analyzed:

1,122 endpoints

Quality control

This is a description of the quality control procedures applied before running the GWAS.

In summary, we removed 4,095 samples who were either of non-Finnish ancestry or twins/duplicates. Finnish ancestry was assessed with a combination of PCA and a Bayesian method for outlier detection.

Sample QC

Our data set initially consists of 102,739 samples, of which we kept 100,355 after removing duplicates. Next, we proceeded to exclude samples of non-Finnish ancestry using a PCA approach.

Association tests

Null models

For the null model calculation for each endpoint, we used age, sex, 10 PCs and genotyping batch as covariates.

For calculating the genetic relationship matrix, we used 49,811 independent, common, well-imputed variants with a posterior genotyping probability >0.95 and missingness <0.05 (LD r2 < 0.1, MAF > 0.05, INFO > 0.95).

SAIGE options for the null computation:

LOCO = false
numMarkers = 30
traceCVcutoff = 0.0025
ratioCVcutoff = 0.001

Association tests

We ran association tests against each of the 1,122 endpoints with for each variant with a minimum allele count of 10 from the imputation pipeline (SAIGE optionminMAC = 10). The alternative allele is always the effect allele.

Software

The code we used is available in . The original SAIGE codebase is available in .

Workflows

We ran the analysis in Google Cloud using WDL and . The WDL workflow metadata including SAIGE commands and their inputs are available at:

gs://finngen-production-library-green/R2/workflows

Loss of function burden

We estimated the loss of function (LoF) burden of each gene on every endpoint.

First, we calculated per individual and gene whether any loss of function variant(s) was present, yielding a matrix with 0 and 1 values ( being the number of individuals and the number of genes).

Then we used the new summarised variables as input in the SAIGE GWAS, replacing the genotype matrix that was used in the regular GWAS.

Contact

For matters related to this documentation, click Edit on GitHubor send us an email to finngen-info@helsinki.fi.

Please consider visiting the study website: and follow FinnGen on twitter:

If you want to host FinnGen summary statistics on your website, please get in contact with us at: humgen-servicedesk@helsinki.fi.

R2

Introduction

Data download

hashtagUsing FinnGen data for publications

hashtagManifest

hashtagDescription

Data releases

How to cite

Pheweb Browser

Getting started

Phenotype list

GWAS overview

Locus zoom

Variant view

Gene view / LoF burden

hashtagLoF burden results

Methods

Participating biobanks/cohorts

Genotypes

Genotype data

hashtagQuality control

Genotype imputation

SISu reference panel

Software used

Endpoints

GWAS

Quality control

hashtagSample QC

Association tests

hashtagNull models

hashtagAssociation tests

hashtagSoftware

Workflows

Loss of function burden

Contact

Introduction

Data download

hashtagUsing FinnGen data for publications

hashtagManifest

hashtagDescription

How to cite

Phenotype list

Getting started

Variant view

Gene view / LoF burden

hashtagLoF burden results

GWAS overview

Participating biobanks/cohorts

Locus zoom

Genotypes

hashtagManhattan plot

hashtagTop hits

hashtagDetails

Genotype data

hashtagQuality control

Genotype imputation

hashtagPre-phasing

Association tests

hashtagNull models

hashtagAssociation tests

hashtagSoftware

Workflows

SISu reference panel

Data releases

Software used

Loss of function burden

Endpoints

Contact

GWAS

Quality control

hashtagSample QC

hashtagMissing Data

hashtagFurther info

hashtagBayesian outlier detection

hashtagCentroid based outlier detection

Using FinnGen data for publications

Manifest

Description

LoF burden results

Quality control

Sample QC

Null models

Association tests

Software

Using FinnGen data for publications

Manifest

Description

LoF burden results

Manhattan plot

Top hits

Details

Quality control

Pre-phasing

Null models

Association tests

Software

Sample QC

Missing Data

Further info

Bayesian outlier detection

Centroid based outlier detection