Sisu v4 reference panel

As of August 2021, the Sisu v4 reference panel is used as the imputation panel in FinnGen.

Summary:

Descriptors

Total number of samples:

8,557

Total number of variant alleles:

20,175,454

Chromosomes:

chrs 1-22, chrX

Variants included:

SNV and InDel variants

Panel file format:

BREF, phased VCF

Reference genome build:

GRCh38

Docs last updated:

17.08.2021

SISu v4 reference panel contains samples from the following cohorts and sample collections: FINRISK, METSIM, Corogene, Finnish Dyslipidemia Study and Eastern Finland biobank samples. Genotype, sample and variant-wise quality control (QC) filtering procedures were applied in an iterative manner^ on the high-coverage WGS (hcWGS) data using the Hail framework v0.1 (unless mentioned otherwise).

1. Genotype-wise QC

genotypes were marked as ‘missing’ (./.) if:

All autosomal chromosomes:

  • Sequencing read depth (DP) > 200; or

  • PHRED-scaled genotype quality (GQ) < 20; or

  • the proportion of informative reads (total allele depth [AD] / depth [DP]) < 0.9; or

  • For homozygous reference calls (0/0): the proportion of informative reference reads (reference AD / DP) < 0.9;

  • For heterozygous variant calls (0/1): the proportion of informative alternate reads (alternative AD / DP) was not within the interval 0.2-0.8 or the normalized PHRED-scaled probability of the reference genotype (pl[0]) < 20;

  • For homozygous variant calls (1/1): the proportion of informative alternate reads (alternative AD / DP) < 0.9 or pl[0] < 20;\

Chromosome X:

genotypes were marked as ‘missing’ (./.) if:

  • DP > 200; or

  • GQ < 10 for male individuals; or

  • GQ < 20 for female individuals; or

  • the proportion of informative reads (total allele depth [AD] / depth [DP]) < 0.9; or

  • For homozygous reference calls (0/0): the proportion of informative reads (reference AD / DP) < 0.9;

  • For heterozygous variant calls (0/1): the proportion of informative reads (alternative AD / DP) was not within the interval 0.2-0.8 or the normalized PHRED-scaled probability of the reference genotype (pl[0]) < 20 or was male;

  • For homozygous variant calls (1/1): the proportion of informative reads (alternative AD / DP) < 0.9 or pl[0] < 20

2. Sample-wise QC/filtering

To include only high-quality samples, the following sample-wise QC criteria were first applied to bi-allelic variants on autosomes (excluding variants in low-complexity regions).

Variants were preserved (for calculating sample-wise metrics) if:

  • variant filter was PASS; and

  • quality by depth (QD) for SNPs was > 2 and for indels > 3; and

  • allele count (AC) was ≥ 3; and

  • Variant-wise call-rate (CR) > 90%; and

  • Hardy-Weinberg Equilibrium p-value (pHWE) > 1e-9.

Outlier samples with CR <= 95% or deviating more than ±3SD were identified from sample-wise QC metrics (nSNP, rHetHomVar, rTiTv, rInsertionDeletion, dpStDev) and excluded (Table 1). In addition, a number of low Indel quality individuals were removed from the data to fix observed batch-effects (Outliers with Indels showing obviously lower value of rInsertionDeletion).

In order to identify closely-related samples, the data was further LD-pruned with window size 1M and r2 = 0.2 by first keeping only bi-allelic variants and excluding variants on high-LD regions, pHWE <= 1e-9, minor allele frequency (MAF) < 0.05 and variant-wise CR <= 90%. With Plink v2.0 KING method, closely-related individuals (kinship coefficient < 0.177) were identified from the LD-pruned data and excluded.

Then, the top 20 principal components (PCs) were computed and outlier samples to be excluded were identified based on these 20 PCs. Next, individuals with ambiguous sex1 (samples with imputed sex conflicting with biobank reported sex or for which imputed sex is 'ambiguous') were identified and excluded. Furthermore, we additionally removed 53 samples as they were obvious QC outliers considering the following sample-wise metrics (rInsertionDeletion < 0.985, nSingleton>50000).

Outlier samples listed in the above steps were excluded from the data before applying variant-wise QC/filtering)

Table 1. Definition of terms

Name

Type

Description

callRate

Double

Fraction of genotypes called

nHomRef

Int

Number of homozygous reference genotypes

nHet

Int

Number of heterozygous genotypes

nHomVar

Int

Number of homozygous alternate genotypes

nCalled

Int

Sum of nHomRef + nHet + nHomVar

nNotCalled

Int

Number of uncalled genotypes

nSNP

Int

Number of SNP alternate alleles

nInsertion

Int

Number of insertion alternate alleles

nDeletion

Int

Number of deletion alternate alleles

nSingleton

Int

Number of private alleles

nTransition

Int

Number of transition (A-G, C-T) alternate alleles

nTransversion

Int

Number of transversion alternate alleles

nNonRef

Int

Sum of nHet and nHomVar

rTiTv

Double

Transition/Transversion ratio

rHetHomVar

Double

Het/HomVar genotype ratio

rInsertionDeletion

Double

Insertion/Deletion ratio

dpMean

Double

Depth mean across all genotypes

dpStDev

Double

Depth standard deviation across all genotypes

gqMean

Double

The average genotype quality across all genotypes

gqStDev

Double

Genotype quality standard deviation across all genotypes

3. Variant-wise QC/filtering

Mitochondrial and chromosome Y variants, variants within pseudo-autosomal regions on chromosome X (X PAR region) or low-complexity regions (LCR) on any chromosomes were excluded. In order to preserve multiallelic sites, they were decomposed into bi-allelic format.

Autosomal chromosomes:

Genotypes were set as missing based on the above-mentioned (step 1 Genotype-wise QC/filtering) thresholds.

Next, variants were preserved, if:

  • variant filter was PASS; and

  • QD for SNPs > 2 and for indels > 4; and

  • AC > 0; and

  • Variant-wise CR > 95%; and

  • pHWE > 1e-9.

Chromosome X:

genotypes were marked as ‘missing’ (./.) if:

  • DP > 200; or

  • GQ < 10 for male individuals; or

  • GQ < 20 for female individuals; or

  • the proportion of informative reads (total allele depth [AD] / depth [DP]) < 0.9; or

  • For homozygous reference calls (0/0): the proportion of informative reads (reference AD / DP) < 0.9;

  • For heterozygous variant calls (0/1): the proportion of informative reads (alternative AD / DP) was not within the interval 0.2-0.8 or the normalized PHRED-scaled probability of the reference genotype (pl[0]) < 20 or the individual was male;

  • For homozygous variant calls (1/1): the proportion of informative reads (alternative AD / DP) < 0.9 or pl[0] < 20;

Next, variants were preserved, if:

  • variant filter was PASS; and

  • QD for SNPs > 2 and for indels > 4; and

  • AC > 0; and

  • Variant-wise CR > 95%; and

  • pHWE > 1e-9 (pHWE was calculated by using only female individuals).

4. Imputation reference panel generation

To generate a high-quality hcWGS reference panel, the QC'ed data was further filtered and variants with AC < 3 (symmetrically*) were excluded. Haplotype phasing was carried out with Eagle 2.4.1 software (Source, Manual) with the default parameters, except that the number of conditioning haplotypes was set to 20,000. Beagle-specific reference panel files (.bref) were created as instructed by the Beagle authors.

5. Additional information

Sample sex inference

Bi-allelic variants on chromosome X, excluding the ones on low-complexity or pseudo-autosomal regions, were utilised to identify samples with ambiguous sex for exclusion.

Genotypes were set as missing, if:

  • Sequencing read depth (DP) > 200; or

  • PHRED-scaled genotype quality (GQ) < 20; or

  • For homozygous reference calls (0/0): the proportion of informative reads (reference AD / DP) < 0.9:

  • For heterozygous variant calls (0/1): the proportion of informative reads (total allele depth [AD] / depth [DP]) < 0.9 or the proportion of informative reads (alternative AD / DP) < 0.25 or the normalized PHRED-scaled probability of the reference genotype (pl[0]) < 20 or p-value for pulling the given allelic depth from a binomial distribution with mean 0.5 (pAB) < 1e-9;

  • For homozygous variant calls (1/1): the proportion of informative reads (alternative AD / DP) < 0.9 or pl[0] < 20;

Variants were preserved, if:

  • variant filter was PASS; and

  • QD for SNPs > 2 and for indels > 3; and

  • AC >= 1; and

  • Variant-wise CR > 80%

Genetic sex was imputed with Plink v1.90b6.18 using threshold 0.2 and 0.8 for females and males, respectively.

Samples with ambiguous sex information (opposite imputed-sex compared with biobank sex report or imputed as 'ambiguous') were identified and excluded.

Summary of SISu v4.0 reference panel as pdf.

See also Genotype Imputation section for general information about imputation.

Footer:

^Iterative manner: For a certain step, for example, sample-wise QC step: we started with removing outliers deviating more than ±4SD from sample-wise QC metrics (nSNP, rHetHomVar, rTiTv, rInsertionDeletion, dpStDev). After removing, we recalculated and replotted these metrics. We were not satisfied with the result, so we adjusted our threshold to ±3SD. ‘applied in an iterative manner’ means: repeating the same process (apply filter, calculate/plot metrics, check result), until we get satisfied results. Then, we move on to the next step.

*Symmertrically: means removing MAC<3 variants for both (lower and upper) ends of the frequency spectrum. In practise, with bcftools, it is applied as: bcftools view -e 'INFO/AC<3| INFO/AN-INFO/AC<3'. By doing this we can remove ‘singleton’ and ‘doubleton’ variants (from both ends of the allele frequency spectrum) that can’t be phased accurately.

Last updated