arrow-left

Only this pageAll pages
gitbookPowered by GitBook
1 of 19

R5

Loading...

Loading...

Loading...

Loading...

Loading...

Methods

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Data download

To download FinnGen summary statistics you will need to fill the online form at this linkarrow-up-right. You will then receive an email containing the detailed instructions for downloading the data.

Release 5 contains

  • GWAS summary association statistics

  • Fine-mapping results

  • from

hashtag
Using FinnGen data for publications

Please remember to acknowledge the FinnGen study when using these results in publications.

You can use the following text:

We want to acknowledge the participants and investigators of FinnGen study.

hashtag
Manifest

The manifest file with the link to all the downloadable summary stats is available at:

LD estimation
SISu v3
Variant annotation

Introduction

FinnGen a public-private partnership project combining genotype data from Finnish biobanks and digital health record data from Finnish health registries. FinnGen provides a unique opportunity to study genetic variation in relation to disease trajectories in an isolated population.

FinnGen is a growing project, aiming at 500,000 individuals in 2023.

FinnGen results are subjected to one year embargo and, after that, available to the larger scientific community via the or through .

Pheweb browserarrow-up-right
data download

Participating biobanks/cohorts

Additionally to the biobanks mentioned in the previous releases, the following biobanks and cohorts are part of the R5 release:

  • Auria Biobankarrow-up-right

  • Biobank Borealis of Northern Finlandarrow-up-right

Biobank of Eastern Finlandarrow-up-right
Central Finland Biobankarrow-up-right
Finnish Red Cross Blood Service Biobankarrow-up-right
Finnish Clinical Biobank Tamperearrow-up-right
Helsinki Biobankarrow-up-right
Terveystalo Biobankarrow-up-right
THL Biobankarrow-up-right

How to cite

Please use the following description when referring to our project:

The FinnGen study is a large-scale genomics initiative that has analyzed over 500,000 Finnish biobank samples and correlated genetic variation with health data to understand disease mechanisms and predispositions. The project is a collaboration between research organisations and biobanks within Finland and international industry partners.

When using these results in publications, please remember to:

  1. Acknowledge the FinnGen study. You can use the following text:

“We want to acknowledge the participants and investigators of the FinnGen study”

  1. Cite our latest publication:

Kurki, M.I., Karjalainen, J., Palta, P. et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature 613, 508–518 (2023). https://doi.org/10.1038/s41586-022-05473-8

Furthermore, if possible, include "FinnGen" as a keyword for your publication.

If you want to cite this website, use the following citation:

@online{finngen,
  author = {FinnGen},
  title = {{FinnGen} Documentation of R5 release},
  year = 2021,
  url = {https://finngen.gitbook.io/documentation/},
  urldate = {YYYY-MM-DD}
}

Data description

File naming pattern and file structure

hashtag
Summary association statistics

GWAS summary statistics (tab-delimited, bgzipped, genome build 38, tabixarrow-up-right index files included) are named as {endpoint}.gz. For example, endpoint I9_CHD has I9_CHD.gz and I9_CHD.gz.tbi. Note that the results are based on imputed genotype data and produced using SAIGE and that is why the data is not presented as integers but might contain digits.

To learn more about the methods used, see section .

The {endpoint}.gz have the following structure:

*)Note that the results are based on imputed genotype dosages and produced using SAIGE and that is why the data is not presented as integers but might contain digits.

hashtag
Fine-mapping results

Two fine-mapping methods were used:

Fine-mapping results are tab-delimited and bgzipped.

SuSiE results have the following filename pattern:

  • {endpoint}.SUSIE.cred.bgz

  • {endpoint}.SUSIE.snp.bgz

FINEMAP results have the following filename pattern:

  • {endpoint}.FINEMAP.region.bgz

  • {endpoint}.FINEMAP.snp.bgz

  • {endpoint}.FINEMAP.config.bgz

To learn more about the methods used, see section .

SuSiE output files {endpoint}.SUSIE.snp.bgz have the following structure:

hashtag
LD estimation

Linkage disequilibrium (LD) was estimated from for each chromosome. Use the tool for further usage of the bcor files.

ldstore --bcor FG_LD_chr1.bcor --incl-range 20000000-50000000 --table output_file_name.table

To learn more about the methods used, see section .

hashtag
Variant annotation

The variant annotation has measures (HWE, INFO, ...) listed per batch.

nearest gene name from variant

pval

p-value from

beta

effect size estimated with for the alternative allele

sebeta

standard deviation of effect size estimated with

maf

alternative (effect) allele frequency

maf_cases

alternative (effect) allele frequency among cases

maf_controls

alternative (effect) allele frequency among controls

n_hom_cases

number of homozygous cases*

n_het_cases

number of heterozygous cases*

n_hom_controls

number of homozygous controls*

n_het_controls

number of heterozygous cases*

position in base pairs on build GRCh38

allele1

reference allele

allele2

alternative allele (effect allele)

maf

minor allele frequency

beta

effect size GWAS

se

standard error GWAS

p

p-value GWAS

mean

posterior expectation of true effect size

sd

posterior standard deviation of true effect size

prob

posterior probability of association

cs

identifier of 95% credible set (-1 = variant is not part of credible set)

Column name

Description

#chrom

chromosome on build GRCh38 (1-22, X)

pos

position in base pairs on build GRCh38

ref

reference allele

alt

alternative allele (effect allele)

rsids

variant identifier

Column name

Description

trait

endpoint name

region

chr:start-end

v

variant identifier

rsid

rs variant identifier

chromosome

chromosome on build GRCh38 (1-22, X)

GWAS
SuSiEarrow-up-right
FINEMAParrow-up-right
Fine-mapping
SISU v3
LDstore (v1.1)arrow-up-right
LD estimation

nearest_genes

position

Data releases

Timeline for releases:

Release

Date release to partners

Date release to public

Total sample size [1]

R2

Q4 2018 (Nov)

Q1 2020

​96,499​​

R3

Q2 2019 (May)

[1] samples used for PheWAS.

Q2 2020

135,638

R4

Q4 2019 (Oct)

Q4 2020

176,899

R5

Q2 2020 (March)

Q2 2021

218,792

R6

Q3 2020

~Q3 2021

~260,000

R7

Q1 2021

~Q1 2022

~300,000

R8

Q3 2021

~Q3 2022

~340,000

R9

Q1 2022

~Q1 2023

~375,000

R10

Q3 2022

~Q3 2023

~410,000

R11

Q1 2023

~Q1 2024

~445,000

R12

Q3 2023

~Q3 2024

~480,000

R13

Q1 2024

~Q1 2025

~500,000

SAIGEarrow-up-right
SAIGEarrow-up-right
SAIGEarrow-up-right

SISu reference panel

SISuarrow-up-right v3 consists of 3,775 high coverage (30x) WGS Finnish individuals from six cohorts:

  1. METSIM (PIs Markku Laakso and Mike Boehnke)

  2. FINRISK (PI Pekka Jousilahti)

  3. Health2000 (PI Seppo Koskinen)

  4. Finnish Migraine Family Study (PI Aarno Palotie)

  5. Merck/Tienari samples (PI Pentti Tienari)

  6. MESTA samples (PI Jaana Suvisaari)

High-coverage (25-30x) WGS data used to develop the SISu v3 reference panel were generated at the Broad Institute of MIT and Harvard and at the McDonnell Genome Institute at Washington University; and jointly processed at the Broad Institute.

Software used

  • Cromwell-29 and 31

  • Wdltool-0.14

  • Plink 1.9 and 2.0

  • BCFtools 1.7 and 1.9

  • Eagle 2.3.5

  • Beagle 4.1 (version 08Jun17.d8b)

  • R 3.4.1 (packages: data.table 1.10.4, sm 2.2-5.4)

Genotypes

FinnGen individuals were genotyped with Illumina and Affymetrix chip arrays (Illumina Inc., San Diego, and Thermo Fisher Scientific, Santa Clara, CA, USA).

Chip genotype data were imputed using the population-specific SISu v3 imputation reference panel of 3,775 whole genomes.

Merged imputed genotype data is composed of 63 data sets that include samples from multiple cohorts.

  • Total number of individuals: 224,737

  • Total number of variants (merged set): 16,962,023

  • Reference assembly: GRCh38/hg38

LD estimation

The BCORarrow-up-right files were created using LDstorearrow-up-right from the Finnish SISU panel v3.

The panel has been divided per chromosome. For example, to use the LD information in the first chromosome, FG_LD_chr1.bcor would be the file to use.

hashtag
Settings used

  • number of samples: 3775

  • window size: 1500 kb

  • accuracy: low

  • number of threads: 96

  • LD threshold to include correlations: 0.05

hashtag
Example usage

can be downloaded via:

And an example to extract variant range 20 Mb - 50 Mb from chromosome 7 is as follows:

hashtag
Note

It is not preferred to use these LD estimate files for e.g. fine-mapping, since many of the fine-mapping methods (e.g. SuSiE) require in-sample LD information for good results!

Endpoints

hashtag
Registries

The disease endpoints were defined using nationwide registries:

LDstore v1.1arrow-up-right
wget http://www.christianbenner.com/ldstore_v1.1_x86_64.tgz
ldstore --bcor FG_LD_chr7.bcor --incl-range 20000000-50000000 --table output_file_name.table

  • We harmonized over the International Classification of Diseases (ICD) revisions 8, 9 and 10, cancer-specific ICD-O-3, (NOMESCO) procedure codes, Finnish-specific Social Insurance Institute (KELA) drug reimbursement codes and ATC-codes.

    These registries spanning decades were electronically linked to the cohort baseline data using the unique national personal identification numbers assigned to all Finnish citizens and residents.

    A full list of FinnGen endpoints is for release 5.

    hashtag
    Excluded endpoints

    The endpoints with fewer than 80 cases, and developmental “helper” endpoints were excluded from the final PheWas (“OMIT” tag in the endpoint definition file).

    Endpoints with less than 150 cases are not released by (Finnish Institute for Health and Welfare).

    hashtag
    Risteys

    (Risteys = intersection in Finnish) allows browsing of the FinnGen data at the phenotype level, including endpoint definitions, statistics about number of individuals, gender distribution, and longitudinal relationships.

    Drug purchase and Drug Reimbursementarrow-up-right
    Digital and Population Data Services Agencyarrow-up-right
    Statistics Finlandarrow-up-right
    Register of primary health care visits: AVOHILMOarrow-up-right
    Care Register for Health Care: HILMOarrow-up-right
    Finnish cancer registryarrow-up-right
    available onlinearrow-up-right
    THLarrow-up-right
    risteys.finngen.fiarrow-up-right

    Genotype imputation

    Genotype imputation was done with the population-specific .

    Variant call set was produced with GATK HaplotypeCaller algorithm by following GATK best-practices for variant calling.

    Genotype-, sample- and variant-wise QC was applied in an iterative manner by using the and the resulting high-quality WGS data for 3,775 individuals were phased with Eagle 2.3.5 as described in the previous section.

    Genotype imputation was carried out by using the population-specific SISu v3 imputation reference panel with (version 08Jun17.d8b) as described in the following protocol: .

    Post-imputation quality-control involved checking expected conformity of the imputation INFO-value distribution, MAF differences between the target dataset and the imputation reference panel and checking chromosomal continuity of the imputed genotype calls.

    SISu v3 reference panel
    Hail framework v0.1arrow-up-right
    Beagle 4.1arrow-up-right
    dx.doi.org/10.17504/protocols.io.nmndc5earrow-up-right

    Contact

    For matters related to this documentation, click Edit on GitHubor send us an email to finngen-info@helsinki.fi.

    for the latest updates on the project as well as additional background information please consider visiting the study website or follow FinnGen on twitter .

    If you want to host FinnGen summary statistics on your website, please get in contact with us at: humgen-servicedesk@helsinki.fi.

    https://www.finngen.fi/enarrow-up-right
    @FinnGen_FIarrow-up-right

    Genotype data

    Chip genotype data processing and QC Samples were genotyped with Illumina (Illumina Inc., San Diego, CA, USA) and Affymetrix arrays (Thermo Fisher Scientific, Santa Clara, CA, USA).

    Genotype calls were made with GenCall and zCall algorithms for Illumina and AxiomGT1 algorithm for Affymetrix data.

    Chip genotyping data produced with previous chip platforms and reference genome builds were lifted over to build version 38 (GRCh38/hg38) following the protocol described here: dx.doi.org/10.17504/protocols.io.nqtddwnarrow-up-right.

    hashtag
    Quality control

    In sample-wise quality control, individuals with ambiguous gender, high genotype missingness (>5%), excess heterozygosity (+-4SD) and non-Finnish ancestry were excluded. In variant-wise quality control variants with high missingness (>2%), low HWE P-value (<1e-6) and minor allele count, MAC<3 were excluded.

    hashtag
    Pre-phasing

    Prior imputation, chip genotyped samples were pre-phased with with the default parameters, except the number of conditioning haplotypes was set to 20,000.

    Eagle 2.3.5arrow-up-right

    Association tests

    hashtag
    Endpoint

    We included ​​2,803​ endpoints from the phenotype/registry teams’ pipeline in the analysis. Endpoints with less than 100 cases among the 218,792 samples were excluded.

    hashtag
    Null models

    For the null model computation for each endpoint, we used age, sex, 10 PCs and genotyping batch as covariates. Each genotyping batch was included as a covariate for an endpoint if there were at least 10 cases and 10 controls in that batch to avoid convergence issues. One genotyping batch need be excluded from covariates to not have them saturated. We excluded Thermo Fisher batch 16 as it was not enriched for any particular endpoints.

    For calculating the genetic relationship matrix, we used the genotype dataset where genotypes with GP < 0.95 have been set missing. Only variants imputed with an INFO score > 0.95 in all batches were used. Variants with > 3 % missing genotypes were excluded as well as variants with MAF < 1 %. The remaining variants were LD pruned with a 1Mb window and r2 threshold of 0.1. This resulted in a set of 58,702 well-imputed not rare variants for GRM calculation.

    options for the null computation:

    • LOCO = false

    • numMarkers = 30

    • traceCVcutoff = 0.0025

    hashtag
    Association tests

    We ran association tests against each of the 2,803 endpoints with for each variant with a minimum allele count of 5 from the imputation pipeline (SAIGE optionminMAC = 5). We filtered the results to include variants with an imputation INFO > 0.6.

    ratioCVcutoff = 0.001

    SAIGEarrow-up-right
    SAIGEarrow-up-right

    GWAS

    We used the SAIGE software for running R5 GWAS as we did in previous releases. SAIGE is a mixed model logistic regression R/C++ package. We used code of version 0.36.3.2: https://github.com/weizhouUMICH/SAIGE/tree/finngen_r5_jk arrow-up-right We made two modifications to SAIGE 0.36.3.2 codebase (neither modification affects the method):

    • Null model .rda objects were trimmed to reduce RAM consumption

    • Het and alt hom counts in cases and controls are included in the output

    We analyzed:

    • ​2,803 endpoints

    • 218,792 samples

    • 16,962,023 variants

    We included the following covariates in the model: sex, age, 10 PCs, genotyping batch.

    Fine-mapping

    We fine-mapped each region from the GWASs where a variant reached p < 1-6. Each region was fine-mapped with SuSiEarrow-up-right 0.8.1.0545 and FINEMAParrow-up-right v1.4_0510.

    We used a 3-megabase window (+- 1.5M) around each lead variant and merged overlapping regions into one. After merging, a handful of regions became too large to computationally handle with SuSiE. For such regions, we only merged two overlapping pieces when the LD between the two lead variants was r2 > 0.2. When LD was less than that, we made each of the two overlapping regions non-overlapping by splitting the overlap in half.

    The codebase, workflow, and inputs we used for R5 fine-mapping is here: https://github.com/FINNGEN/finemapping-pipeline/releases/tag/r5arrow-up-right

    hashtag
    Integration to PheWeb

    The "Credible Sets"-table on a phenotype page in the browser shows the SuSiE-fine-mapped credible sets of that phenotype. The variant shown per credible set is the maximum PIP (posterior inclusion probability) variant of that credible set. In addition to the causal variants, variants that were in sufficient LD (pearsonr^2 > 0.05), had a small enough p-value (pval < 0.01), and were close enough to the lead variant (distance to lead variant < 1.5 megabases) were clumped together with the credible set. Variants have been compared against GWAS Catalog and annotated. The LD grouping, annotation and GWAS Catalog comparison were done using the autoreporting pipeline.

    The columns of the table are explained below:

    Finnish enrichment of the top PIP variant

    Alternate allele frequency

    alternate allele frequency of the top PIP variant

    Lead Variant Gene

    A probable gene of the top PIP variant

    # coding in cs

    number of coding variants in the credible set. Hover over the number to see the variant, the consequence, and the correlation (pearsonr squared) to the lead variant

    # credible variants

    number of variants in the credible set

    Credible set bayes factor (log10)

    The bayes factor related to the credible set

    CS matching Traits

    Number of matches found in GWAS Catalog for the credible set variants. Hover over the number to see the trait, as well as the associated variant's LD (pearsonr squared) to the lead variant.

    LD Partner Traits

    Number of matches found in GWAS Catalog to the group of credible variants and variants in LD with the top PIP variant.Hover over the numbr to see the trait, as well as the associated variant's LD (pearsonr squared) to the lead variant.

    UKBB

    Matching Pan-UKBB trait association

    Column name

    Explanation

    top PIP variant

    variant with largest PIP int he credible set. Click the arrow to the left of the variant to show the credible set variants

    CS quality

    This column shows whether the credible set is well-formed. a 'true' value means that the credible set is likely trustworthy, and a 'false' value means that the credible set is likely not trustworthy. Undefined for R5

    chromosome

    The chromosome in which the credible set lies

    p-value

    p-value of the top PIP variant

    effect size (beta)

    effect size of the top PIP variant

    PheWebarrow-up-right

    Finnish Enrichment

    Sample QC and PCA

    This is a description of the quality control procedures applied before running the GWAS.

    hashtag
    PCA

    The PCA for population structure has been run in the following way:

    hashtag
    Variant filtering and LD pruning

    The following filters were applied:

    • Exclusion of chromosome 23

    • Exclusion of variants with info score < 0.95

    • Exclusion of variants with missingness > 0.01 (based on the GP; see conversion)

    This filtering step produced 41,678 variants, that were used for the rest of the analysis.

    hashtag
    PCA outlier detection

    Then, FinnGen data was merged with the 1k genome project (1kgp) data, using the variants mentioned above. A round of PCA was performed and a bayesian algorithm was used to spot outliers. This process got rid of 5,520 outliers, of which 3,138 are from the FinnGen samples. The figure below shows the scatter plots for the first 3 PCs. Outliers, in brown, are separated from the FinnGen yellow cluster.

    While the method automatically detected as being outliers the 1kgp samples with non European and southern European ancestries, it did not manage to exclude some samples with Western European origins. Since the signal from these samples would have been too small to allow a second round to be performed without detecting substructures of the Finnish population, another approach was used. The FinnGen samples that survived the first round were used to compute another PCA. The EUR and FIN 1kg samples were then projected onto the space generated by the first 3 PCs. Then, the centroid of each cluster was calculated and used to calculate the squared mahalanobis distance of each FinnGen sample to each of the centroids. Being the squared distance a sum of squared variables (with unitary variance, due to the mahalanobis distance), we could see it as a sum of 3 independent squared variables. This allowed to map the squared distance into a probability (chi squared with 3 degrees of freedom). Therefore, for each cluster, a probability of being part of it was computed. Then, a threshold of 0.95 was used to exclude FinnGen samples whose relative chance of being part of the Finnish cluster was below the level. This method produced another 538 outliers. The figure below shows the first three principal components.

    FIN 1kgp samples are in purple, while EUR 1kgp sample are in Blue. Samples in green are FinnGen samples who are flagged as being non Finnish, while red ones are considered Finnish.

    hashtag
    Kinship

    In a next step, all pairs of Finngen samples up to second degree were returned. The figure shows the distribution of kinship values.

    Then, the previously defined “non Finnish” samples were excluded and 2 algorithms were used to return a unique subset of unrelated samples:

    • one called greedy would continuously remove the highest degree node from the network of relations, until no more links are left in the network.

    • one called native, based on a native implementation of python’s networkx package, performed on each subgraph of the network.

    The largest independent set of either algorithm would be used to keep those sample, while flagging the others as “outliers” for the final PCA.

    Then, the subset of outliers who also belong to the set of duplicates/twins was identified.

    hashtag
    Final PCA

    To compute the final step the Finngen samples were ultimately separated in three groups:

    • 156,977 inliers: unrelated samples with Finnish ancestry.

    • 61,980 outliers: non duplicate samples with Finnish ancestries, but who are also related to the inliers.

    • 5,780 rejected samples: either of non Finnish ancestry or are twins/duplicates with relations to other samples.

    Finally, the PCA for the inliers was calculated, and then outliers were projected on the same space, allowing to calculate covariates for a total of 218,957 samples.

    hashtag
    Sample filtering based on phenotype data

    Of the 218,957 non-duplicate population inlier samples from PCA, we excluded 154 samples from analysis because of missing minimum phenotype data, and 11 samples because of mismatch between imputed sex and sex in registry data. ​A total of 218,792 samples was used for core analysis.

    hashtag
    Further info

    hashtag
    Bayesian outlier detection

    Documentation from the original developers of the algorithm can be found here: .

    Exclusion of variants with MAF < 0.05
  • LD pruning with window 500kb, step 50kb, r^2 filter of 0.1

  • http://www.well.ox.ac.uk/~spencer/Aberrant/aberrant-manuarrow-up-right
    https://storage.googleapis.com/finngen-public-data-r5/summary_stats/R5_manifest.tsvstorage.googleapis.comchevron-right