arrow-left

Only this pageAll pages
gitbookPowered by GitBook
1 of 25

R2

Loading...

Loading...

Loading...

Loading...

Pheweb Browser

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Methods

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Introduction

FinnGen a public-private partnership project combining genotype data from Finnish biobanks and digital health record data from Finnish health registries. FinnGen provides a unique opportunity to study genetic variation in relation to disease trajectories in an isolated population.

FinnGen is a growing project, aiming at 500,000 individuals in 2023.

FinnGen results are subjected to one year embargo and, after that, available to the larger scientific community via the or through .

Pheweb browserarrow-up-right
data download

Data download

To download FinnGen summary statistics you will need to fill the online form at this linkarrow-up-right. You will then receive an email containing the detailed instructions for downloading the data.

hashtag
Using FinnGen data for publications

Please remember to acknowledge the FinnGen study when using these results in publications.

You can use the following text:

We want to acknowledge the participants and investigators of FinnGen study.

hashtag
Manifest

The Manifest file with the link to all the downloadable summary statistics is available at:

hashtag
Description

GWAS summary stats (tab-delimited, bgzipped, genome build 38, filtered to INFO > 0.6, index files included) are named as {endpoint}.gz. For example, endpoint I9_CHD has I9_CHD.gz and I9_CHD.gz.tbi.

To learn more about the methods used, see section .

The {endpoint}.gz have the following structure:

How to cite

Please use the following description when referring to our project:

The FinnGen study is a large-scale genomics initiative that has analyzed over 500,000 Finnish biobank samples and correlated genetic variation with health data to understand disease mechanisms and predispositions. The project is a collaboration between research organisations and biobanks within Finland and international industry partners.

When using these results in publications, please remember to:

Acknowledge the FinnGen study. You can use the following text:

“We want to acknowledge the participants and investigators of the FinnGen study”

  1. Cite our latest publication:

Kurki, M.I., Karjalainen, J., Palta, P. et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature 613, 508–518 (2023). https://doi.org/10.1038/s41586-022-05473-8

Furthermore, if possible, include "FinnGen" as a keyword for your publication.

If you want to cite this website, use the following citation:

@online{finngen,
  author = {FinnGen},
  title = {{FinnGen} Documentation of R2 release},
  year = 2020,
  url = {https://finngen.gitbook.io/documentation},
  urldate = {YYYY-MM-DD}
}

nearest gene name from variant

pval

p-value from

beta

effect size estimated with for the alternative allele

sebeta

standard deviation of effect size estimated with

maf

minor allele frequency

maf_cases

minor allele frequency among cases

maf_controls

minor allele frequency among controls

Column name

Description

chrom

chromosome on build GRCh38 (1-22, X)

pos

position in base pairs on build GRCh38

ref

reference allele

alt

alternative allele (effect allele)

rsids

variant identifier

https://storage.googleapis.com/finngen-public-data-r2/summary_stats/r2_manifest.tsvarrow-up-right
tabixarrow-up-right
GWAS

nearest_genes

SAIGEarrow-up-right
SAIGEarrow-up-right
SAIGEarrow-up-right

Phenotype list

Genome-wide significant loci = ??

Contains all endpoints/phenotypes for which a GWAS was run (if more than 100 cases).

Column

Description

phenotype

description

category

13 phenotype categories

genome-wide significant loci

Variant(s) with within a +/- 500kb window.

Getting started

The web browser contains all FinnGen GWAS results from release 2 and provides you with three options:

  1. Search for the GWAS result of a , or .

  2. Explore the loss-of-function burden (LoF) for gene-phenotypes combinations.

Find a particular phenotype/endpoint.

1: Search for the GWAS result of a variant, phenotype or gene. 2: Explore the loss-of-function burden (LoF) for gene-phenotypes combinations. 3: Find a particular phenotype/endpoint.

r2.finngen.fiarrow-up-right
variant
phenotype
gene
P≤5⋅10−8P \leq 5 \cdot 10^{-8}P≤5⋅10−8
Endpoint

Variant view

The variant view has the following URL: http://r2.finngen.fi/variant/CHR-POS-ALT-REF, e.g. http://r2.finngen.fi/variant/13-80757865-T-TAarrow-up-right

  • CHR: chromosome on hg38 (1-22, X or 23)

  • POS: position on hg38

  • REF: reference allele

  • ALT: alternative allele

Gene view / LoF burden

Clicking on any gene will bring you to the gene view with association results for that gene region, the loss-of-function analysis results (for methods see LoF burden) and an annotated list of all loss of function and missense variants.

hashtag
LoF burden results

Column

GWAS overview

Clicking on any phenotype will show you an overview of the results:

  • Detailed info about phenotype definition

  • Manhattan plot

Participating biobanks/cohorts

The following biobanks and cohorts are part of the R2 release:

Locus zoom

  • FinnGen association locus zoom plot

  • Annotation with GWAS catalog variants + UK Biobank hits

Genotypes

FinnGen individuals were with Illumina and Affymetrix chip arrays (Illumina Inc., San Diego, and Thermo Fisher Scientific, Santa Clara, CA, USA).

Chip genotype data were using the population-specific of 3,775 whole genomes.

Post-imputation QC involved excluding variants with imputation INFO < 0.7.

  • Total number of individuals: 102,739

Blood Service Biobankarrow-up-right
  • Borealis Biobankarrow-up-right

  • Botnia studyarrow-up-right

  • Eastern Finland Biobankarrow-up-right

  • FinHealtharrow-up-right

  • FINRISKarrow-up-right

  • GENERISKarrow-up-right

  • Health 2000/2011arrow-up-right

  • Helsinki Biobankarrow-up-right

  • Migraine Family Studyarrow-up-right

  • THL Diabetesarrow-up-right

  • SUPERarrow-up-right
    Auria Biobankarrow-up-right

    Total number of variants (merged set): 17,054,975

  • Reference assembly: GRCh38/hg38

  • genotyped
    imputed
    SISu v3 imputation reference panel
    List of top hits
  • Q-Q-plot

  • hashtag
    Manhattan plot

    Clicking on any point will lead you to the locus zoom view.

    hashtag
    Top hits

    Column

    Description

    Gene

    Clicking on a gene brings you to analysis.

    FIN enrichment

    (NFE = non-Finnish European)

    p-value

    OR

    From (alternative allele = effect allele)

    UKBB

    P-value in UKBB (if available)

    GWAS
    annotation

    hashtag
    Details

    • URL locus zoom: http://r2.finngen.fi/region/endpoint/CHR:START-END, e.g. http://r2.finngen.fi/region/J10_ASTHMA_EXMORE/5:132261855-132661855arrow-up-right (CHR: chromosome on hg38,START/END: window start and end position on hg38)

    • For chromosome X, use either X or 23.

    ClinVararrow-up-right

    Description

    p-value beta

    P-value and beta from association test.

    variants

    All LoF variants within that gene.

    Variant view: displaying on the x-axis all phenotypes and phenotype categories and on the y-axis the p-values.

    Genotype data

    Chip genotype data processing and QC Samples were genotyped with Illumina (Illumina Inc., San Diego, CA, USA) and Affymetrix arrays (Thermo Fisher Scientific, Santa Clara, CA, USA).

    Genotype calls were made with GenCall and zCall algorithms for Illumina and AxiomGT1 algorithm for Affymetrix data.

    Chip genotyping data produced with previous chip platforms and reference genome builds were lifted over to build version 38 (GRCh38/hg38) following the protocol described here: .

    hashtag
    Quality control

    Genotype imputation

    Genotype imputation was done with the population-specific .

    Variant call set was produced with GATK HaplotypeCaller algorithm by following GATK best-practices for variant calling.

    Genotype-, sample- and variant-wise QC was applied in an iterative manner by using the and the resulting high-quality WGS data for 3,775 individuals were phased with Eagle 2.3.5 as described in the previous section.

    Genotype imputation was carried out by using the population-specific SISu v3 imputation reference panel with (version 08Jun17.d8b) as described in the following protocol: .

    Post-imputation quality-control involved checking expected conformity of the imputation INFO-value distribution, MAF differences between the target dataset and the imputation reference panel and checking chromosomal continuity of the imputed genotype calls.

    AFFIN/AFNFE\textrm{AF}_{FIN}/\textrm{AF}_{NFE}AFFIN​/AFNFE​
    LoF burden
    association test
    In sample-wise quality control, individuals with ambiguous gender, high genotype missingness (>5%), excess heterozygosity (+-4SD) and non-Finnish ancestry were excluded. In variant-wise quality control variants with high missingness (>2%), low HWE P-value (<1e-6) and minor allele count, MAC<3 were excluded.

    hashtag
    Pre-phasing

    Prior imputation, chip genotyped samples were pre-phased with Eagle 2.3.5 (https://data.broadinstitute.org/alkesgroup/Eagle/arrow-up-right) with the default parameters, except the number of conditioning haplotypes was set to 20,000.

    dx.doi.org/10.17504/protocols.io.nqtddwnarrow-up-right
    Optional: Post-imputation quality control also involved excluding variants imputed with imputation INFO<0.7.
    SISu v3 reference panel
    Hail framework v0.1arrow-up-right
    Beagle 4.1arrow-up-right
    dx.doi.org/10.17504/protocols.io.nmndc5earrow-up-right

    Association tests

    hashtag
    Null models

    For the null model calculation for each endpoint, we used age, sex, 10 PCs and genotyping batch as covariates.

    For calculating the genetic relationship matrix, we used 49,811 independent, common, well-imputed variants with a posterior genotyping probability >0.95 and missingness <0.05 (LD r2 < 0.1, MAF > 0.05, INFO > 0.95).

    SAIGEarrow-up-right options for the null computation:

    • LOCO = false

    • numMarkers = 30

    • traceCVcutoff = 0.0025

    • ratioCVcutoff = 0.001

    hashtag
    Association tests

    We ran association tests against each of the 1,122 endpoints with for each variant with a minimum allele count of 10 from the imputation pipeline (SAIGE optionminMAC = 10). The alternative allele is always the effect allele.

    hashtag
    Software

    The code we used is available in . The original SAIGE codebase is available in .

    SAIGEarrow-up-right
    github.com/FINNGEN/SAIGE-IT/tree/master/SAIGEarrow-up-right
    https://github.com/weizhouUMICH/SAIGE/arrow-up-right

    Workflows

    We ran the analysis in Google Cloud using WDL and . The WDL workflow metadata including SAIGE commands and their inputs are available at:

    gs://finngen-production-library-green/R2/workflows

    Cromwellarrow-up-right

    SISu reference panel

    SISuarrow-up-right v3 consists of 3,775 high coverage (30x) WGS Finnish individuals from six cohorts:

    1. METSIM (PIs Markku Laakso and Mike Boehnke)

    2. FINRISK (PI Pekka Jousilahti)

    3. Health2000 (PI Seppo Koskinen)

    4. Finnish Migraine Family Study (PI Aarno Palotie)

    5. Merck/Tienari samples (PI Pentti Tienari)

    6. MESTA samples (PI Jaana Suvisaari)

    High-coverage (25-30x) WGS data used to develop the SISu v3 reference panel were generated at the Broad Institute of MIT and Harvard and at the McDonnell Genome Institute at Washington University; and jointly processed at the Broad Institute.

    Data releases

    Timeline for releases:

    Release

    Date release to partners

    Date release to public

    Total sample size

    R2

    Q4 2018 (27th Nov)

    Q1 2020

    ​96,499​​

    R3

    Q2 2019 (13th May)

    Software used

    • Cromwell-29 and 31

    • Wdltool-0.14

    • Plink 1.9 and 2.0

    Loss of function burden

    We estimated the loss of function (LoF) burden of each gene on every endpoint.

    First, we calculated per individual and gene whether any loss of function variant(s) was present, yielding a matrix with 0 and 1 values ( being the number of individuals and the number of genes).

    Then we used the new summarised variables as input in the SAIGE GWAS, replacing the genotype matrix that was used in the regular GWAS.

    BCFtools 1.5 and 1.7

  • Eagle 2.3.5

  • Beagle 4.1 (version 08Jun17.d8b)

  • R 3.4.1 (packages: data.table 1.10.4, sm 2.2-5.4)

  • Q2 2020

    135,638

    R4

    Q4 2019 (1st Oct)

    Q4 2020

    176,899

    R5

    Q2 2020

    Q2 2021

    n×pn \times pn×p
    nnn
    pp p

    Endpoints

    The disease endpoints were defined using nationwide registries:

    • Drug purchase and Drug Reimbursementarrow-up-right

    • Digital and Population Data Services Agencyarrow-up-right

    We harmonized over the International Classification of Diseases (ICD) revisions 8, 9 and 10, cancer-specific ICD-O-3, (NOMESCO) procedure codes, Finnish-specific Social Insurance Institute (KELA) drug reimbursement codes and ATC-codes.

    These registries spanning decades were electronically linked to the cohort baseline data using the unique national personal identification numbers assigned to all Finnish citizens and residents.

    A full list of FinnGen endpoints is for release 2.

    The endpoints with fewer than 100 cases, near-duplicate endpoints, and developmental “helper” endpoints were excluded from the final PheWas (column “OMIT”).

    Endpoints with N<150 are not released by (Finnish Institute for Health and Welfare).

    Contact

    For matters related to this documentation, click Edit on GitHubor send us an email to finngen-info@helsinki.fi.

    Please consider visiting the study website: and follow FinnGen on twitter:

    If you want to host FinnGen summary statistics on your website, please get in contact with us at: humgen-servicedesk@helsinki.fi.

    GWAS

    We used the software for running the R2 GWAS.

    is a mixed model logistic regression R/C++ package, able to account for related samples.

    We analyzed:

    • ​1,122 endpoints

    Quality control

    This is a description of the quality control procedures applied before running the GWAS.

    In summary, we removed 4,095 samples who were either of non-Finnish ancestry or twins/duplicates. Finnish ancestry was assessed with a combination of PCA and a Bayesian method for outlier detection.

    hashtag
    Sample QC

    Our data set initially consists of 102,739 samples, of which we kept 100,355 after removing duplicates. Next, we proceeded to exclude samples of non-Finnish ancestry using a PCA approach.

    Statistics Finlandarrow-up-right
    Register of primary health care visits: AVOHILMOarrow-up-right
    Care Register for Health Care: HILMOarrow-up-right
    Finnish cancer registryarrow-up-right
    available onlinearrow-up-right
    THLarrow-up-right
    https://www.finngen.fi/enarrow-up-right
    @FinnGen_FIarrow-up-right
    96,499 samples
  • 17,054,975 variants

  • We included the following covariates in the model: sex, age, 10 PCs, genotyping batch.

    SAIGEarrow-up-right
    SAIGEarrow-up-right
    hashtag
    PCA

    After filtering for high quality HQ variants (36,073 variants) we merged the data set with the thousand genomes dataarrow-up-right (EUR individuals only). At this point we performed a PCA on the merged data set and used a Bayesian approach to determine outliers (see below). This process allowed us to identify samples from outside the Central/Northern European region (1,023 samples). Western European and British samples are still present, but are not enough to drive a signal in the PCA. Thus we used a different approach; we ran a PCA on the 99,333 samples left and we projected the 98 Finnish (FIN) and 89 non-Finnish European (EUR) samples from the thousand genomes project who survived round one onto the same space. Then, for each Finngen sample, we calculate its Mahalanobis distance to the FIN and EUR centroid. The distance is mapped to a probability with a χ2\chi^2 χ2distribution with 3 degrees of freedom. Then, we define as being Finns, those sample for whom the relative probability of being Finnish vs European is > 95%. This left us with 98,644 samples.

    hashtag
    Missing Data

    Of the 98,644 non-duplicate PCA inliers, we removed 2,145 individuals that didn’t have phenotype or age data. Thus the final number of analyzed individuals was ​96,499​​.

    hashtag
    Further info

    hashtag
    Bayesian outlier detection

    Code for the method can be found here:​ github.com/FINNGEN/pca_outlier_detectionarrow-up-right.

    Documentation from the original developers of the algorithm can be found here: http://www.well.ox.ac.uk/~spencer/Aberrant/aberrant-manual.pdfarrow-up-right.

    hashtag
    Centroid based outlier detection

    The Figure below shows how the centroid based outlier detection works by plotting the distribution of the first 3 components of the PCA. We can see that the FinnGen samples labelled as Western European (in blue) are extremely close to the Western European centroid in the first two components.

    Principal components 1-3, with FinnGen's Finnish individuals shown in red, FinnGen outliers in blue, and thousand genomes Finnish samples labelled in purple, Western European in green.

    Purple and green dots represent samples of Finnish and Western European (EUR) respectively from the thousand genome data set. The blue dots are FinnGen samples who have been found to be more likely to belong to the EUR group rather than to the Finnish one. Dots in red on the other hand are labelled as belonging to the Finnish centroid.