arrow-left

Only this pageAll pages
gitbookPowered by GitBook
1 of 23

R11

Loading...

Loading...

Loading...

Loading...

Loading...

Methods

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Colocalization

Colocalizations in FinnGen

Our colocalizationarrow-up-right approach uses the probabilistic model for integrating GWAS and eQTL data presented in eCAVIAR (Hormozdiari et al. 2016arrow-up-right). Compared to eCAVIAR, we are using SuSiE (Wang et al. 2019arrow-up-right) to fine-map our inputs and provide an additional colocalization metric (CLPA).

Our goal is to extract a list of genomic regions that show colocalization between two phenotypes p1 and p2. Further, we assume that the summary statistics of p1 and p2 have been fine-mapped. The fine-mapping output for each phenotype contains three columns: the variant identifier (VAR), posterior inclusion probability (PIP), and the credible set (CS) identifier.

hashtag
CLPP

The Causal Posterior Probability (CLPP) is computed between two credible sets cs1 and cs2, with cs1 coming from a given phenotype p1 and cs2 coming from phenotype p2. CLPP is defined as follows: For vectors x and y, containing the PIP for variants in cs1 and cs2, respectively, CLPP is calculated by

This CLPP calculation is similar to equation 8 in Hormozdiari et al. 2016.

CLPP is dependent on the credible set size. By definition, any credible set size > 1 will yield a CLPP < 1.

hashtag
CLPA

We derived another colocalization metric called causal posterior agreement (CLPA) that is independent of credible set size.

The picture below shows how colocalizations are defined.

hashtag
Example Comparison

This rough example shows why we mostly use CLPA since it is independent of sample size.

hashtag
Data

The colocalization is performed between FinnGen endpoints as well as between FinnGen endpoints and various QTL resources, as shown in the image below.

These resources are listed below:

hashtag
FinnGen resources

The SuSiE finemapping results for the release were used as the FinnGen data.

hashtag
Expression QTL datasets

  • GTEx v8: SuSiE fine-mapping, 49 tissues, donors of mixed ancestry, Aguet et al. (2019, BioRxiv) (49 tissues only involve tissues with a sample size of n >= 50). Fine-mapping performed by Hilary Finucane, Jacob Ulirsch, Masahiro Kanai from the . Effect size interpretation: change in normalised gene expression (sd units) per alternate allele. Normalization = inverse normal transformation.

  • EMBL-EBI (European Bioinformatics Institute) . eQTL data from 24 tissues/cell types, 16 RNAseq sources, 6 Microarray, SuSiE fine-mapping, donors of 88% European ancestry, Kerimov et al. (2020, BioRxiv). For RNAseq data, four quantification methods (gene expression, exon expression, transcript usage, txrevise event usage). Fine-mapping was performed by . Effect size interpretation: change in normalised gene expression (sd units) per alternate allele. Normalization = inverse normal transformation.

hashtag
Metabolon QTL datasets

  • GeneRISK: 186 lipid species QTLs, SuSiE fine-mapping of Widen et al. (2020), 7632 Finnish samples. Effect size interpretation: change in standard deviation of the lipid species per alternate allele.

hashtag
Biomarkers

  • UK Biobank: 36 continuous endpoints, 57 biomarkers from UKBB prepared by , SuSiE fine-mapping. Effect size interpretation for quantitative traits: change in standard deviation of the normalized outcome per alternate allele. Effect size interpretation for binary traits increase in log(odds ratios) per alternate allele.

hashtag
Post-colocalization QC

Only unique source1-source2-pheno1-pheno2-tissue2-quant2-locus_id1-locus_id2 combinations were included in the results. FinnGen endpoints with _COMORB-definition were left out of the results.

hashtag
Acknowledgements

We thank the following people for helping us assembling the QTL resources:

  • Kaur Alasoo and Nurlan Kerimov provided us the fine-mapped EMBL-EBI eQTL catalogue datasets.

  • Hilary Finucane, Jacob Ulirsch, Masahiro Kanai gave us access to their fine-mapped GTEx data.

hashtag

FUSION studyarrow-up-right (RNAseq), muscle and adipose tissue.
  • Kolbergarrow-up-right: mega-analysis of immune cells from the microarray datasets.

  • Finucane Labarrow-up-right
    eQTL catalogue datasetsarrow-up-right
    Kaur Alasoo and Nurlan Kerimovarrow-up-right
    Finucane lab, 361'194 White British samplesarrow-up-right

    Data releases

    Timeline for releases:

    Release

    Date release to partners

    Date release to public

    Total sample size [1]

    R2

    Q4 2018 (Nov)

    Q1 2020

    ​96,499​​

    R3

    Q2 2019 (May)

    [1] samples used for PheWAS.

    Q2 2020

    135,638

    R4

    Q4 2019 (Oct)

    Q4 2020

    176,899

    R5

    Q2 2020 (March)

    Q2 2021

    218,792

    R6

    Q3 2020

    Q1 2022

    260,405

    R7

    Q2 2021

    Q2 2022

    309,154

    R8

    Q3 2021

    Q4 2022

    342,499

    R9

    Q1 2022

    Q2 2023

    377,277

    R10

    Q3 2022

    Q4 2023

    412,181

    R11

    Q1 2023

    Q2 2024

    453,733

    R12

    Q3 2023

    ~Q3 2024

    ~500,000

    Data download

    To download FinnGen summary statistics you will need to fill the online form at this linkarrow-up-right. You will then receive an email containing the detailed instructions for downloading the data.

    Release 11 contains

    • GWAS summary association statistics

    • Fine-mapping results

    hashtag
    Using FinnGen data for publications

    When using these results in publications, please remember to:

    1) Acknowledge the FinnGen study. You can use the following text:

    “We want to acknowledge the participants and investigators of the FinnGen study”

    2) Cite our latest publication:

    Kurki M.I., et al. . Nature 2023 Jan;613(7944):508-518. doi: 10.1038/s41586-022-05473-8. Epub 2023 Jan 18.

    Furthermore, if possible, include "FinnGen" as a keyword for your publication.

    If you want to cite this website, use the following citation:

    hashtag
    Manifest

    The manifest file with the link to all the downloadable summary stats is available at:

    Variant annotation
    HLA region analysis
    LoF variant burden test results
    FinnGen provides genetic insights from a well-phenotyped isolated populationarrow-up-right
    @online{finngen,
      author = {FinnGen},
      title = {{FinnGen} Documentation of R11 release},
      year = 2024,
      url = {https://finngen.gitbook.io/documentation/},
      urldate = {YYYY-MM-DD}
    }

    Genotype data

    Chip genotype data processing and QC Samples were genotyped with Illumina (Illumina Inc., San Diego, CA, USA) and Affymetrix arrays (Thermo Fisher Scientific, Santa Clara, CA, USA).

    Genotype calls were made with GenCall and zCall algorithms for Illumina and AxiomGT1 algorithm for Affymetrix data.

    Chip genotyping data produced with previous chip platforms and reference genome builds were lifted over to build version 38 (GRCh38/hg38) following the protocol described here: dx.doi.org/10.17504/protocols.io.xbhfij6.arrow-up-right

    hashtag
    Quality control

    In sample-wise quality control steps, individuals with ambiguous gender, high genotype missingness (>5%), excess heterozygosity (+-4SD) and non-Finnish ancestry were excluded. In variant-wise quality control steps, variants with high missingness (>2%), low HWE P-value (<1e-6) and low minor allele count (MAC<3) were excluded.

    hashtag
    Pre-phasing

    Before imputation, chip-genotyped samples were pre-phased with using the default parameters, except the number of conditioning haplotypes, which was set to 20,000.

    Participating biobanks/cohorts

    Genotypes

    FinnGen individuals were with Illumina and Affymetrix chip arrays (Illumina Inc., San Diego, and Thermo Fisher Scientific, Santa Clara, CA, USA).

    Chip genotype data were using the population-specific of 8,554 whole genomes.

    Merged imputed genotype data is composed of 116 data sets that include samples from multiple cohorts.

    • Total number of individuals: 473,681

    Genotype imputation

    Genotype imputation was done with the population-specific .

    The reference panel variant call set was produced with the GATK HaplotypeCaller algorithm by following GATK best practices for variant calling.

    Genotype-, sample- and variant-wise QC was carried out iteratively by using the and the resulting high-quality WGS data for 8,554 individuals were phased with as described in the previous section.

    Genotype imputation was carried out by using the population-specific SISu v4.2 imputation reference panel with (version 27Jan18.7e1) as described in the following protocol: .

    Post-imputation quality control involved checking the expected conformity of the imputation INFO-value distribution, MAF differences between the target dataset and the imputation reference panel and checking chromosomal continuity of the imputed genotype calls.

    SISu reference panel

    v4.2 consists of 8,554 WGS of Finnish individuals from 5 research cohorts from:

    1. METSIM (PIs Markku Laakso and Mike Boehnke)

    2. FINRISK (PI Pekka Jousilahti)

    Eagle 2.3.5arrow-up-right

    Total number of variants (merged set): 21,311,942

  • Reference assembly: GRCh38/hg38

  • genotyped
    imputed
    SISu v4.2 imputation reference panel
    SISu v4.2 reference panelarrow-up-right
    Hail framework v0.2arrow-up-right
    Eagle 2.3.5arrow-up-right
    Beagle 4.1arrow-up-right
    dx.doi.org/10.17504/protocols.io.xbgfijwarrow-up-right
    Corogene (PI Juha Sinisalo)
  • Biobank of Eastern Finland (PI Arto Mannermaa)

  • Finnish EUFAM Dyslipidemia Study (PIs Marja-Riitta Taskinen and Samuli Ripatti)

  • High-coverage (25x) WGS data used to develop the SISu v4.2 reference panel were generated at the McDonnell Genome Institute at Washington University (PIs Ira Hall and Nathan Stitziel).

    SISuarrow-up-right

  • Arctic Biobankarrow-up-right
    Auria Biobankarrow-up-right
    Biobank Borealis of Northern Finlandarrow-up-right
    Biobank of Eastern Finlandarrow-up-right
    Central Finland Biobankarrow-up-right
    Finnish Red Cross Blood Service Biobankarrow-up-right
    Finnish Clinical Biobank Tamperearrow-up-right
    Helsinki Biobankarrow-up-right
    Terveystalo Biobankarrow-up-right
    THL Biobankarrow-up-right

    Software used

    • Hail v0.2

    • Cromwell-42

    • Wdltool-0.14

    • Plink 1.9 and 2.0

    • BCFtools 1.7 and 1.9

    • Eagle 2.3.5

    • Beagle 4.1 (version 27Jan18.7e1)

    • R 3.4.1 (packages: data.table 1.10.4, sm 2.2-5.4)

    Data description

    File naming pattern and file structure

    hashtag
    Summary association statistics

    GWAS summary statistics (tab-delimited, bgzipped, genome build 38, tabixarrow-up-right index files included) are named as {endpoint}.gz. For example, endpoint I9_CHD has I9_CHD.gz and I9_CHD.gz.tbi.

    To learn more about the methods used, see section .

    The {endpoint}.gz have the following structure:

    hashtag
    Fine-mapping results

    Two fine-mapping methods were used:

    Fine-mapping results are tab-delimited and bgzipped.

    SuSiE results have the following filename pattern:

    • {endpoint}.SUSIE.cred.bgz

    • {endpoint}.SUSIE.cred_99.bgz

    • {endpoint}.SUSIE.snp.bgz

    FINEMAP results have the following filename pattern:

    • {endpoint}.FINEMAP.config.bgz

    • {endpoint}.FINEMAP.region.bgz

    • {endpoint}.FINEMAP.snp.bgz

    To learn more about the methods used, see section .

    {endpoint}.SUSIE.cred.bgz contain credible set summaries from SuSiE fine-mapping for all genome-wide significant regions. {endpoint}.SUSIE.cred_99.bgz contain the 99% credible set summaries while the default is 95%. They have the following structure:

    Column name
    Description

    {endpoint}.SUSIE.snp.bgz contain variant summaries with credible set information and have the following structure:

    {endpoint}.FINEMAP.config.bgz contain summary fine-mapping variant configurations from FINEMAP method and have the following structure:

    Column name
    Description

    {endpoint}.FINEMAP.region.bgz contain summary statistics on number of independent signals in each region and have the following structure:

    Column name
    Description

    {endpoint}.FINEMAP.snp.bgz has summary statistics of variants and into what credible set they may belong to. Columns:

    Column name
    Description

    hashtag
    LD estimation

    Linkage disequilibrium (LD) was estimated from for each chromosome. Use the tool for further usage of the bcor files.

    ldstore --bcor FG_LD_chr1.bcor --incl-range 20000000-50000000 --table output_file_name.table

    To learn more about the methods used, see section .

    hashtag
    Variant annotation

    The variant annotation has measures (HWE, INFO, ...) listed per batch.

    LD estimation

    The files were created using from the Finnish panel v4.2.

    The panel has been divided per chromosome. For example, to use the LD information in the first chromosome, FG_LD_chr1.bcor would be the file to use.

    hashtag
    Settings used

    Introduction

    FinnGen research project is a public-private partnership combining genotype data from Finnish biobanks and digital health record data from Finnish health registries. FinnGen provides a unique opportunity to study genetic variation in relation to disease trajectories in an isolated population.

    FinnGen is a growing project, aiming at 500,000 individuals in the end of 2023.

    FinnGen results are subjected to one year embargo and, after that, available to the larger scientific community via the or through .

    PheWeb

    The PheWeb portal can be used to browse results from FinnGen's predetermined endpoints (or 'phenotypes') a.k.a. core analysis results. FinnGen PheWeb tutorial is available .

    These were analysed for genetic associations, which allows for disproportionate case-control numbers and corrects for relatedness between samples with a sparse genetic relatedness matrix.

    The results from each association run are uploaded onto the PheWeb portal, which can be accessed by clicking this link:

    Home Page

    The figure below shows the a table of the first few endpoints ('phenotypes') in FinnGen with the highest numbers of GWAS significant loci, along with the summary of case-control analyses and the number of hits.

    Pheweb browserarrow-up-right
    data download

    number of samples: 3775

  • window size: 1500 kb

  • accuracy: low

  • number of threads: 96

  • LD threshold to include correlations: 0.05

  • hashtag
    Example usage

    LDstore v1.1arrow-up-right can be downloaded via:

    And an example to extract variant range 20 Mb - 50 Mb from chromosome 7 is as follows:

    hashtag
    Note

    It is not preferred to use these LD estimate files for e.g. fine-mapping, since many of the fine-mapping methods (e.g. SuSiE) require in-sample LD information for good results!

    BCORarrow-up-right
    LDstorearrow-up-right
    SISu
    wget http://www.christianbenner.com/ldstore_v1.1_x86_64.tgz
    ldstore --bcor FG_LD_chr7.bcor --incl-range 20000000-50000000 --table output_file_name.table

    nearest gene(s) (comma separated) from variant

    pval

    p-value from

    mlogp

    -log10(p-value)

    beta

    effect size (log(OR) scale) estimated with for the alternative allele

    sebeta

    standard error of effect size estimated with

    af_alt

    alternative (effect) allele frequency

    af_alt_cases

    alternative (effect) allele frequency among cases

    af_alt_controls

    alternative (effect) allele frequency among controls

    cs_avg_r2

    Average correlation R2 between variants in the credible set

    cs_min_r2

    minimum r2 between variants in the credible set

    low_purity

    cs_size

    how many snps does this credible set contain

    position in base pairs on build GRCh38

    allele1

    reference allele

    allele2

    alternative allele (effect allele)

    maf

    minor allele frequency

    beta

    effect size GWAS

    se

    standard error GWAS

    p

    p-value GWAS

    mean

    posterior expectation of true effect size

    sd

    posterior standard deviation of true effect size

    prob

    posterior probability of association

    cs

    identifier of 95% credible set (-1 = variant is not part of credible set)

    lead_r2

    r2 value to a lead variant (the one with maximum PIP) in a credible set

    alphax

    posterior inclusion probability for the x-th single effect (x := 1..L where L is the number of single effects (causal variants) specified; default: L = 10)

    prob

    probability across all n independent signal configurations

    log10bf

    log10 bayes factor for this configuration

    odds

    odds of this configuration

    k

    how many independent signals in this configuration

    prob_norm_k

    probability of this configuration within k independent signals solution

    h2

    snp heritability of this solution

    h2_0.95CI

    95% confidence interval limits of snp heritability of this solution

    mean

    marginalized shrinkage estimates of the posterior effect size mean

    sd

    marginalized shrinkage estimates of the posterior effect standard deviation

    h2g_lower95

    lower limit of 95% CI for snp heritability

    h2g_upper95

    upper limit of 95% CI for snp heritability

    log10bf

    log bayes factor compared against null (no signals in the region)

    prob_xSNP

    columns for probabilities of different number of independent signals

    expectedvalue

    expectation (average) of the number of signals

    rsid

    rs variant identifier

    chromosome

    chromosome

    position

    position

    allele1

    reference allele

    allele2

    alternative allele

    maf

    alternative allele frequency

    beta

    original marginal effect size

    se

    original standard error

    z

    original zscore

    prob

    post inclusion probability

    log10bf

    log10 bayes factor

    mean

    marginalized shrinkage estimates of the posterior effect size mean

    sd

    marginalized shrinkage estimates of the posterior effect standard deviation

    mean_incl

    conditional estimates of the posterior effect size mean

    sd_incl

    conditional estimates of the posterior effect size standard deviation

    p

    original p-value

    csx

    credible set index for given number of causal variants x

    Column name

    Description

    #chrom

    chromosome on build GRCh38 (1-23)

    pos

    position in base pairs on build GRCh38

    ref

    reference allele

    alt

    alternative allele (effect allele)

    rsids

    variant identifier

    Column name

    Description

    trait

    phenotype

    region

    region for which the fine-mapping was run

    cs

    running number for independent credible sets in a region

    cs_log10bf

    Log10 bayes factor of comparing the solution of this model (cs independent credible sets) to cs -1 credible sets

    Column name

    Description

    trait

    endpoint name

    region

    chr:start-end

    v

    variant identifier

    rsid

    rs variant identifier

    chromosome

    chromosome on build GRCh38 (1-22, X)

    Column name

    Description

    trait

    phenotype

    region

    region for which the fine-mapping was run

    rank

    rank of this configuration within a region

    config

    causal variants in this configuration

    Column name

    Description

    trait

    phenotype

    region

    region for which the fine-mapping was run

    h2g

    heritability of this region

    h2g_sd

    standard deviation of snp heritability of this region

    Column name

    Description

    trait

    phenotype

    region

    region for which the fine-mapping was run

    v

    variant

    index

    running index

    GWAS
    SuSiEarrow-up-right
    FINEMAParrow-up-right
    Fine-mapping
    SISu v4.2
    LDstore (v1.1)arrow-up-right
    LD estimation

    nearest_genes

    position

    You can reorder the table by clicking on the appropriate header value (in the figure above, we clicked on GWAS significant loci to order the table based on the number of GWAS loci).

    From home page in PheWeb, you can also go directly to coding variant browser by clicking the icon 'Codingarrow-up-right' in the top right corner.

    Endpoint Page

    Upon clicking an endpoint ('phenotype'), you will then be directed to the endpoint's page which will contain information such as case-control numbers and results from the association scan of the endpoint. In the following screenshot, we show the endpoint results for “Type 2 diabetes, wide definition”.

    On the endpoint page, you will find a similar Manhattan plot from the association scan which summarizes the association results for your endpoint.

    Scrolling further, you will also be able to see the Manhattan plot in a tabular format, distinguished by either the traditional GWAS hits or based on a credible setarrow-up-right.

    Variant Page

    You can also browse based on a variant of your choice and see a PheWas plot:

    The variant page shows the information on the gene that the variant is in, the most severe consequence annotation of the variant (from VEParrow-up-right), its allele frequency, whether the variant was imputed or not (INFO score), and links to external sites to obtain further information on the variant such as gnomADarrow-up-right, the UCSC genome browserarrow-up-right, and the GWAS catalog.arrow-up-right

    The Manhattan plot shown in the figure above also shows p-values from the association scans for FinnGen endpoints. Scrolling down, you will again be able to see the association scan results for the FinnGen endpoints in this variant in a tabular format.

    To see the corresponding LAVAA plotarrow-up-right, you can click show lavaa plot on top of the manhattan plot.

    All results (endpoint and variant-wise) can be downloaded in a tabular format by clicking Download table.

    Gene Page

    Gene pQTL and disease colocalizations

    The gene page of the FinnGen PheWeb browser can be found from https://r12.finngen.fi/gene/<gene>arrow-up-right by specifying the gene symbol of interest. The bottom section of the page contains gene pQTL and disease colocalization data available for the FinnGen imputed SNPs. The main table contains summary of credible sets arrow-up-rightgathered from Susie arrow-up-rightfinemapping results and combined across Olink and Somascan proteomics QTL platforms (FinnGen and UK Biobank Pharma Proteomics Project). The main table includes the following columns:

    • source - pQTL platform source (i.e. FinnGen Olink, FinnGen Somascan, UKB-PPP)

    • region - region for which the fine-mapping was run

    • CS - running number for independent credible sets in a region

    • variant - top variant associated with the credible set

    • CS bayes factor (log10)

    • CS min r2 - minimum R2 correlation between variants in the credible set

    • beta - top variant effect size

    • p-value - top variant p-value

    • CS PIP - overall Posterior Inclusion Probability (PIP) of the variant

    • consequence - most severe consequence of the variant

    • gene most severe - gene corresponding to most severe consequence of the variant

    The nested sub-table for a single gene pQTL contains a list of disease colocalizations between the FinnGen endpoints and the pQTL in question colocalizing with the lead variant of the pQTL (read more about colocalizations in FinnGen). The sub-table includes the following columns:

    • phenotype - FinnGen endpoint (by clicking to the phenotype you will be navigated to the PheWeb region page corresponding to the phenotype in question)

    • description - FinnGen endpoint description

    • clpp - causal posterior probability calculated for a colocalization

    • clpa - causal posterior agreement calculated for a colocalization

    • len intersect - CS intersect

    • len cs1 - FinnGen endpoint credible set size

    • len cs2 - pQTL credible set size

    All results can be downloaded in a tabular format by clicking Download table.

    PheWeb for previous data releases

    The PheWeb pages for previous data releases are available at

    DF10: https://r10.finngen.fi/arrow-up-right

    DF9: https://r9.finngen.fi/arrow-up-right

    DF8: https://r8.finngen.fi/arrow-up-right

    DF7: https://r7.finngen.fi/arrow-up-right

    DF6: https://r6.finngen.fi/arrow-up-right

    Note: PheWeb is continuously being developed, and some features available in newer DFs may not be available in PheWeb versions for earlier DFs.

    herearrow-up-right
    clinician curated endpointsarrow-up-right
    https://r11.finngen.fi/arrow-up-right

    Endpoints

    hashtag
    Registries

    The disease endpoints were defined using nationwide registries:

    • Drug purchase and Drug Reimbursementarrow-up-right

    We harmonized over the International Classification of Diseases (ICD) revisions 8, 9 and 10, cancer-specific ICD-O-3, (NOMESCO) procedure codes, Finnish-specific Social Insurance Institute (KELA) drug reimbursement codes and ATC-codes.

    These registries spanning decades were electronically linked to the cohort baseline data using the unique national personal identification numbers assigned to all Finnish citizens and residents.

    A full list of FinnGen endpoints is for release 11.

    hashtag
    Excluded endpoints

    The endpoints with fewer than 50 cases, and developmental “helper” endpoints were excluded from the final PheWas (“OMIT” tag in the endpoint definition file).

    hashtag
    Risteys

    (Risteys = intersection in Finnish) allows browsing of the FinnGen data at the phenotype level, including endpoint definitions, statistics about number of individuals, gender distribution, and longitudinal relationships. Please also note the R11 specific page

    Sample QC and PCA

    This is a description of the quality control procedures applied before running the GWAS.

    hashtag
    PCA

    The PCA for population structure has been run in the following way:

    hashtag
    Variant filtering and LD pruning

    The sisu version 4.2 imputation panel is pruned iteratively, until a target number of SNPs is reached:

    9,641,808 starting variants: only variants with a minimum info score of 0.9 in all batches are kept.

    The script starts with [500.0, 50.0, 0.9] params in plink (window,step,r2). It then decreases 0.05 in r2 iteratively pruning the imputation panel until the threshold of 200,000 snps is reached. Once the SNP count falls under 200,000 the closest pruning is returned.

    If the higher r2 is closer, 200,000 snps are randomly selected, else the last pruned snps are returned.

    Plink flags used: --snps-only --chr 1-22 --max-alleles 2 --maf 0.01 .

    For this run 180,032 snps are returned.

    hashtag
    PCA outlier detection

    Then, FinnGen data was merged with the 1k genome project (1kgp) data, using the variants mentioned above. A round of PCA was performed and a bayesian algorithm was used to spot outliers. This process got rid of 17,133 FinnGen samples. The figure below shows the scatter plots for the first 3 PCs. Outliers, in green, are separated from the FinnGen red cluster.

    While the method automatically detected as being outliers the 1kg samples with non European and southern European ancestries, it did not manage to exclude some samples with Western European origins. Since the signal from these samples would have been too small to allow a second round to be performed without detecting substructures of the Finnish population, another approach was used. The FinnGen samples that survived the first round were used to compute another PCA. The EUR and FIN 1kg samples were then projected onto the space generated by the first 3 PCs. Then, the centroid of each cluster was calculated and used to calculate the squared mahalanobis distance of each FinnGen sample to each of the centroids. Being the squared distance a sum of squared variables (with unitary variance, due to the mahalanobis distance), we could see it as a sum of 3 independent squared variables. This allowed us to map the squared distance into a probability (chi squared with 3 degrees of freedom). Therefore, for each cluster, a probability of being part of it was computed. Then, a threshold of 0.95 was used to exclude FinnGen samples whose relative chance of being part of the Finnish cluster was below the level. This method produced another 22 outliers. The figure below shows the first three principal components.

    FIN 1kg samples are in purple, while EUR 1kgp samples are in Blue. Samples in green are FinnGen samples who are flagged as being non Finnish, while red ones are considered Finnish.

    hashtag
    Kinship

    Then all pairs of FinnGen samples up to second degree were returned. The figure below shows the distribution of kinship values.

    Then, the previously defined “non Finnish” samples were excluded and 2 algorithms were used to return a unique subset of unrelated samples:

    • one called greedy would continuously remove the highest degree node from the network of relations, until no more links are left in the network.

    • one called native, based on a native implementation of python’s networkx package, performed on each subgraph of the network.

    The largest independent set of either algorithm would be used to keep those sample, while flagging the others as “outliers” for the final PCA.

    Then, the subset of outliers who also belong to the set of duplicates/twins was identified.

    hashtag
    Final PCA

    To compute the final step the Finngen samples were ultimately separated in three groups:

    • 277,053 inliers: unrelated samples with Finnish ancestry.

    • 176,844 outliers: non duplicate samples with Finnish ancestries, but who are also related to the inliers.

    • 19,784 rejected samples: either of non Finnish ancestry or are twins/duplicates with relations to other samples.

    Finally, the PCA for the inliers was calculated, and then outliers were projected on the same PC space, allowing to calculate covariates for a total of 453,897 samples.

    hashtag
    Sample filtering based on phenotype data

    Of the 453,897 non-duplicate population inlier samples from PCA, we excluded 136 samples from analysis because of missing minimum phenotype data, and 28 samples because of failing sex check with F thresholds of 0.4 and 0.7. A total of 453,733 samples were used for core analysis. There are 254,618 females and 199,115 males among these samples.

    hashtag
    Further info

    hashtag
    Bayesian outlier detection

    Documentation from the original developers of the algorithm can be found here: .

    How to cite

    Please use the following description when referring to our project:

    The FinnGen study is a large-scale genomics initiative that has analyzed over 500,000 Finnish biobank samples and correlated genetic variation with health data to understand disease mechanisms and predispositions. The project is a collaboration between research organisations and biobanks within Finland and international industry partners.

    When using these results in publications, please remember to:

    1. Acknowledge the FinnGen study. You can use the following text:

    “We want to acknowledge the participants and investigators of the FinnGen study”

    1. Cite our latest publication:

    Kurki, M.I., Karjalainen, J., Palta, P. et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature 613, 508–518 (2023). https://doi.org/10.1038/s41586-022-05473-8

    Furthermore, if possible, include "FinnGen" as a keyword for your publication.

    If you want to cite this website, use the following citation:

    HLA region analysis

    hashtag
    HLA imputation

    The HLA data was imputed from R11 genotype data, using HIBAG models created by Jarmo Ritari from the Finnish Blood Bank. More information can be found in the repository:

    https://github.com/FRCBS/HLA-imputation arrow-up-right

    as well as in the publication:

    Ritari J, Hyvä rinen K, Clancy J, FinnGen, Partanen J, Koskela S. Increasing accuracy of HLA imputation by a population-specific reference panel in a Finngen biobank cohort. NAR Genomics and Bioinformatics, Volume 2, Issue 2, June 2020, lqaa030,

    Genotype data was constructed from the dosage data using PLINK 2.

    hashtag
    Variant summary

    A snp-stats report was generated with

    hashtag
    Association testing

    Association testing was performed using Regenie 2.2.4. Same settings were used as in the core GWAS analysis. See for more information.

    hashtag
    Association summary

    A summary was created from the regenie summary statistic outputs. This summary contains the most significant variant (by p-value) for each phenotype. Pheweb links to phenotype and gene pages have been added as additional columns.

    LoF variant burden

    Gene-based burden test results of loss of function variants (LoFs).

    hashtag
    Variant Selection

    Loss of function (LoF) variants were generated from vcf files with VEP (https://github.com/Ensembl/ensembl-veparrow-up-right). LoF variants are defined as having consequences in the list [frameshift_variant,splice_donor_variant,stop_gained,splice_acceptor_variant]. Also, a max_maf (0.01) and minimum info score (0.8) filters are applied. This leaves 3,737 genes that can be used for the association tests.

    hashtag
    Endpoint

    We used all 2,444 core binary phenotypes in the analyses.

    hashtag
    Null Models

    We used as inputs the nulls already calculated for

    hashtag
    Association tests

    Tests are performed with regenie --step2 in burden mode using a max mask (i.e. using the maximum number of ALT alleles across sites)

    Fine-mapping

    We used two state-of-the-art methods, FINEMAP (Benner, C. et al., 2016arrow-up-right; Benner, C. et al., 2018arrow-up-right) and SuSiE (Wang, G. et al., 2020arrow-up-right) to fine-map genome-wide significant loci in FinnGen endpoints.

    Briefly, there are three main steps:

    hashtag
    1. Preprocessing

    For each genome-wide significant locus (default configuration: P < 5e-8), we define a fine-mapping region by taking a 3 Mb window around a lead variant (and merge regions if they overlap). If a merged window exceeds 10MB, we iteratively shrink the window by 10%, until the merged window fits into 10MB or is split into merged windows that each fit into 10MB. We preprocess an input GWAS summary statistics into separate files per region for the following steps.

    hashtag
    2. LD computation

    We compute in-sample dosage LD using for each fine-mapping region.

    hashtag
    3. Fine-mapping

    With the inputs of summary statistics and in-sample LD from the steps 1-2, we conduct fine-mapping using and with the maximum number of causal variants in a locus L = 10.

    hashtag
    Integration to PheWeb

    The "Credible Sets"-table on a phenotype page in the shows the SuSiE-fine-mapped credible sets of that phenotype. The variant shown per credible set is the maximum PIP (posterior inclusion probability) variant of that credible set. In addition to the causal variants, variants that were in sufficient LD (Pearson r^2 > 0.05), had a small enough p-value (pval < 0.01), and were close enough to the lead variant (distance to lead variant < 1.5 megabases) were clumped together with the credible set. Variants have been compared against GWAS Catalog and annotated. The LD grouping, annotation and GWAS Catalog comparison were done using the autoreporting pipeline.

    The columns of the table are explained below:

    Contact

    For matters related to this documentation, send us an email to finngen-info@helsinki.fi.

    for the latest updates on the project as well as additional background information please consider visiting the study website or follow FinnGen on twitter .

    If you want to host FinnGen summary statistics on your website, please get in contact with us at: humgen-servicedesk@helsinki.fi.

    Association tests

    hashtag
    Endpoint

    We included 2,447 endpoints in the analysis, which consisted of 2,444 binary endpoints and 3 quantitative endpoints (HEIGHT_IRN, WEIGHT_IRN, BMI_IRN). Endpoints with less than 50 cases among the 453,733 samples were excluded, as well as endpoints labeled with an OMIT tag in the endpoint definition file.

    The quantitative endpoints HEIGHT and WEIGHT were acquired from minimum phenotype data. After that, phenotype BMI was formed from them, and all of them were inverse normal transformed.

    7 endpoints did not progress past step1 in regenie pipeline due to convergence issues, and were discarded. The endpoints are:

    GWAS

    We used regenie for release 11. Regenie's main advantages are fast leave-one-chromosome-out relatedness calculation which avoids proximal contamination, and use of an approximate Firth test which gives more reliable effect size estimates for rare variants.

    We used regenie version 2.2.4.

    Links:

    Digital and Population Data Services Agencyarrow-up-right
    Statistics Finlandarrow-up-right
    Register of primary health care visits: AVOHILMOarrow-up-right
    Care Register for Health Care: HILMOarrow-up-right
    Finnish cancer registryarrow-up-right
    available onlinearrow-up-right
    risteys.finngen.fiarrow-up-right
    https://r11.risteys.finregistry.fi/arrow-up-right
    https://doi.org/10.1093/nargab/lqaa030 arrow-up-right
    qctoolarrow-up-right
    the Association tests page
    GWAS
    regeniearrow-up-right
    regeniearrow-up-right
    regeniearrow-up-right
    https://www.finngen.fi/enarrow-up-right
    @FinnGen_FIarrow-up-right
    regenie GitHub repositoryarrow-up-right
  • FinnGen regenie GitHub repositoryarrow-up-right

  • FinnGen regenie pipeline GitHub repositoryarrow-up-right

  • We analyzed:

    • ​2,447 endpoints

      • 2,444 binary endpoints

      • 3 quantitative endpoints (HEIGHT_IRN, WEIGHT_IRN, BMI_IRN)

    • 453,733 samples

      • 254,618 females

      • 199,115 males

    • 21,311,942 variants

    We included the following covariates in the model: sex, age, 10 PCs, Finngen chip version 1 or 2 , and legacy genotyping batch.

    regenie preprintarrow-up-right
    http://www.well.ox.ac.uk/~spencer/Aberrant/aberrant-manuarrow-up-right

    -log10(p-value)

    effect size (beta)

    effect size of the top PIP variant.

    Finnish Enrichment

    Finnish enrichment of the top PIP variant.

    Alternate allele frequency

    alternate allele frequency of the top PIP variant.

    Lead Variant Gene

    A probable gene of the top PIP variant.

    # coding in cs

    number of coding variants in the credible set. Hover over the number to see the variant, the consequence, and the correlation (pearsonr squared) to the lead variant.

    # credible variants

    number of variants in the credible set.

    Credible set bayes factor (log10)

    The bayes factor related to the credible set.

    CS matching Traits

    Number of matches found in GWAS Catalog for the credible set variants. Hover over the number to see the trait, as well as the associated variant's LD (pearsonr squared) to the lead variant.

    LD Partner Traits

    Number of matches found in GWAS Catalog to the group of credible variants and variants in LD with the top PIP variant.Hover over the numbr to see the trait, as well as the associated variant's LD (pearsonr squared) to the lead variant.

    Column name

    Explanation

    top PIP variant

    variant with largest PIP int he credible set. Click the arrow to the left of the variant to show the credible set variants.

    CS quality

    This column shows whether the credible set is well-formed. a 'true' value means that the credible set is likely trustworthy, and a 'false' value means that the credible set is likely not trustworthy.

    chromosome

    The chromosome in which the credible set lies.

    position

    The position of the lead variant

    p-value

    p-value of the top PIP variant.

    LDstore2arrow-up-right
    FINEMAParrow-up-right
    SuSiEarrow-up-right
    R11 PheWeb browserarrow-up-right

    -log10(p)

    @online{finngen,
      author = {FinnGen},
      title = {{FinnGen} Documentation of R11 release},
      year = 2024,
      url = {https://finngen.gitbook.io/documentation/},
      urldate = {YYYY-MM-DD}
    }
    hashtag
    Null models

    For regenie step 1 LOCO prediction computation for each endpoint, we used age, sex, 10 PCs, Finngen 1 or 2 chip or legacy genotyping batch as covariates. For sex-specific phenotypes, sample sex was left out from the covariates. We excluded covariates that had less than 10 cases.

    For calculating genetic relatedness in regenie step 1, we included variants 1) imputed with an INFO score > 0.95 in all batches and 2) > 97 % non-missing genotypes and 3) MAF > 1 %. The remaining variants were LD pruned with a 1.5Mb window and r2 threshold of 0.2. This resulted in a set of 215,152 well-imputed not rare variants for relatedness calculation.

    We used a genotype block size of 1,000 in regenie step 1.

    hashtag
    Association tests

    We ran association tests with regenie for each of the 2,440 endpoints for each variant with a minimum allele count of 5 among each phenotype’s cases and controls. We used the approximate Firth test for variants with an initial p-value of less than 0.01 and computed the standard error based on effect size and likelihood ratio test p-value (regenie options --firth --approx --pThresh 0.01 --firth-se).

    D3_HEREDHAEMOLYTICANAEMIAOTHER
    D3_QUALIPATELETDEF
    E4_CYSTFIBRO_NAS
    E4_SPHIGLOLIPNAS
    G6_HEREMOSEN
    G6_OTHINMUSC
    Q17_BALANC_REARR_STRUCTURAL_MARKERS_NOT_ELSEW_CLASSIFIED
    https://storage.googleapis.com/finngen-public-data-r11/summary_stats/finngen_R11_manifest.tsvstorage.googleapis.comchevron-right