Sisu v4 reference panel
As of August 2021, the Sisu v4 reference panel is used as the imputation panel in FinnGen.
Summary:
Descriptors
Total number of samples:
8,554
Total number of variant alleles:
20,175,454
Chromosomes:
chrs 1-22, chrX
Variants included:
SNV and InDel variants
Panel file format:
BREF, phased VCF
Reference genome build:
GRCh38
Docs last updated:
17.08.2021
SISu v4 reference panel contains samples from the following cohorts and sample collections: FINRISK, METSIM, Corogene, Finnish Dyslipidemia Study and Eastern Finland biobank samples. Genotype, sample and variant-wise quality control (QC) filtering procedures were applied in an iterative manner^ on the high-coverage WGS (hcWGS) data using the Hail framework v0.1 (unless mentioned otherwise).
1. Genotype-wise QC
genotypes were marked as ‘missing’ (./.) if:
All autosomal chromosomes:
Sequencing read depth (DP) > 200; or
PHRED-scaled genotype quality (GQ) < 20; or
the proportion of informative reads (total allele depth [AD] / depth [DP]) < 0.9; or
For homozygous reference calls (0/0): the proportion of informative reference reads (reference AD / DP) < 0.9;
For heterozygous variant calls (0/1): the proportion of informative alternate reads (alternative AD / DP) was not within the interval 0.2-0.8 or the normalized PHRED-scaled probability of the reference genotype (pl[0]) < 20;
For homozygous variant calls (1/1): the proportion of informative alternate reads (alternative AD / DP) < 0.9 or pl[0] < 20;\
Chromosome X:
genotypes were marked as ‘missing’ (./.
) if:
DP > 200; or
GQ < 10 for male individuals; or
GQ < 20 for female individuals; or
the proportion of informative reads (total allele depth [AD] / depth [DP]) < 0.9; or
For homozygous reference calls (0/0): the proportion of informative reads (reference AD / DP) < 0.9;
For heterozygous variant calls (0/1): the proportion of informative reads (alternative AD / DP) was not within the interval 0.2-0.8 or the normalized PHRED-scaled probability of the reference genotype (pl[0]) < 20 or was male;
For homozygous variant calls (1/1): the proportion of informative reads (alternative AD / DP) < 0.9 or pl[0] < 20
2. Sample-wise QC/filtering
To include only high-quality samples, the following sample-wise QC criteria were first applied to bi-allelic variants on autosomes (excluding variants in low-complexity regions).
Variants were preserved (for calculating sample-wise metrics) if:
variant filter was PASS; and
quality by depth (QD) for SNPs was > 2 and for indels > 3; and
allele count (AC) was ≥ 3; and
Variant-wise call-rate (CR) > 90%; and
Hardy-Weinberg Equilibrium p-value (pHWE) > 1e-9.
Outlier samples with CR <= 95% or deviating more than ±3SD were identified from sample-wise QC metrics (nSNP, rHetHomVar, rTiTv, rInsertionDeletion, dpStDev) and excluded (Table 1). In addition, a number of low Indel quality individuals were removed from the data to fix observed batch-effects (Outliers with Indels showing obviously lower value of rInsertionDeletion).
In order to identify closely-related samples, the data was further LD-pruned with window size 1M and r2 = 0.2 by first keeping only bi-allelic variants and excluding variants on high-LD regions, pHWE <= 1e-9, minor allele frequency (MAF) < 0.05 and variant-wise CR <= 90%. With Plink v2.0 KING method, closely-related individuals (kinship coefficient < 0.177) were identified from the LD-pruned data and excluded.
Then, the top 20 principal components (PCs) were computed and outlier samples to be excluded were identified based on these 20 PCs. Next, individuals with ambiguous sex1 (samples with imputed sex conflicting with biobank reported sex or for which imputed sex is 'ambiguous') were identified and excluded. Furthermore, we additionally removed 53 samples as they were obvious QC outliers considering the following sample-wise metrics (rInsertionDeletion < 0.985, nSingleton>50000).
Outlier samples listed in the above steps were excluded from the data before applying variant-wise QC/filtering)
Table 1. Definition of terms
Name
Type
Description
callRate
Double
Fraction of genotypes called
nHomRef
Int
Number of homozygous reference genotypes
nHet
Int
Number of heterozygous genotypes
nHomVar
Int
Number of homozygous alternate genotypes
nCalled
Int
Sum of nHomRef + nHet + nHomVar
nNotCalled
Int
Number of uncalled genotypes
nSNP
Int
Number of SNP alternate alleles
nInsertion
Int
Number of insertion alternate alleles
nDeletion
Int
Number of deletion alternate alleles
nSingleton
Int
Number of private alleles
nTransition
Int
Number of transition (A-G, C-T) alternate alleles
nTransversion
Int
Number of transversion alternate alleles
nNonRef
Int
Sum of nHet and nHomVar
rTiTv
Double
Transition/Transversion ratio
rHetHomVar
Double
Het/HomVar genotype ratio
rInsertionDeletion
Double
Insertion/Deletion ratio
dpMean
Double
Depth mean across all genotypes
dpStDev
Double
Depth standard deviation across all genotypes
gqMean
Double
The average genotype quality across all genotypes
gqStDev
Double
Genotype quality standard deviation across all genotypes
3. Variant-wise QC/filtering
Mitochondrial and chromosome Y variants, variants within pseudo-autosomal regions on chromosome X (X PAR region) or low-complexity regions (LCR) on any chromosomes were excluded. In order to preserve multiallelic sites, they were decomposed into bi-allelic format.
Autosomal chromosomes:
Genotypes were set as missing based on the above-mentioned (step 1 Genotype-wise QC/filtering) thresholds.
Next, variants were preserved, if:
variant filter was PASS; and
QD for SNPs > 2 and for indels > 4; and
AC > 0; and
Variant-wise CR > 95%; and
pHWE > 1e-9.
Chromosome X:
genotypes were marked as ‘missing’ (./.
) if:
DP > 200; or
GQ < 10 for male individuals; or
GQ < 20 for female individuals; or
the proportion of informative reads (total allele depth [AD] / depth [DP]) < 0.9; or
For homozygous reference calls (0/0): the proportion of informative reads (reference AD / DP) < 0.9;
For heterozygous variant calls (0/1): the proportion of informative reads (alternative AD / DP) was not within the interval 0.2-0.8 or the normalized PHRED-scaled probability of the reference genotype (pl[0]) < 20 or the individual was male;
For homozygous variant calls (1/1): the proportion of informative reads (alternative AD / DP) < 0.9 or pl[0] < 20;
Next, variants were preserved, if:
variant filter was PASS; and
QD for SNPs > 2 and for indels > 4; and
AC > 0; and
Variant-wise CR > 95%; and
pHWE > 1e-9 (pHWE was calculated by using only female individuals).
4. Imputation reference panel generation
To generate a high-quality hcWGS reference panel, the QC'ed data was further filtered and variants with AC < 3 (symmetrically*) were excluded. Haplotype phasing was carried out with Eagle 2.4.1 software (Source, Manual) with the default parameters, except that the number of conditioning haplotypes was set to 20,000. Beagle-specific reference panel files (.bref
) were created as instructed by the Beagle authors.
5. Additional information
Sample sex inference
Bi-allelic variants on chromosome X, excluding the ones on low-complexity or pseudo-autosomal regions, were utilised to identify samples with ambiguous sex for exclusion.
Genotypes were set as missing, if:
Sequencing read depth (DP) > 200; or
PHRED-scaled genotype quality (GQ) < 20; or
For homozygous reference calls (0/0): the proportion of informative reads (reference AD / DP) < 0.9:
For heterozygous variant calls (0/1): the proportion of informative reads (total allele depth [AD] / depth [DP]) < 0.9 or the proportion of informative reads (alternative AD / DP) < 0.25 or the normalized PHRED-scaled probability of the reference genotype (pl[0]) < 20 or p-value for pulling the given allelic depth from a binomial distribution with mean 0.5 (pAB) < 1e-9;
For homozygous variant calls (1/1): the proportion of informative reads (alternative AD / DP) < 0.9 or pl[0] < 20;
Variants were preserved, if:
variant filter was PASS; and
QD for SNPs > 2 and for indels > 3; and
AC >= 1; and
Variant-wise CR > 80%
Genetic sex was imputed with Plink v1.90b6.18 using threshold 0.2 and 0.8 for females and males, respectively.
Samples with ambiguous sex information (opposite imputed-sex compared with biobank sex report or imputed as 'ambiguous') were identified and excluded.
Summary of SISu v4.0 reference panel as pdf.
See also Genotype Imputation section for general information about imputation.
Footer:
^Iterative manner
: For a certain step, for example, sample-wise QC step: we started with removing outliers deviating more than ±4SD from sample-wise QC metrics (nSNP, rHetHomVar, rTiTv, rInsertionDeletion, dpStDev). After removing, we recalculated and replotted these metrics. We were not satisfied with the result, so we adjusted our threshold to ±3SD. ‘applied in an iterative manner’ means: repeating the same process (apply filter, calculate/plot metrics, check result), until we get satisfied results. Then, we move on to the next step.
*
Symmertrically
: means removing MAC<3 variants for both (lower and upper) ends of the frequency spectrum. In practise, with bcftools, it is applied as: bcftools view -e 'INFO/AC<3| INFO/AN-INFO/AC<3'. By doing this we can remove ‘singleton’ and ‘doubleton’ variants (from both ends of the allele frequency spectrum) that can’t be phased accurately.
Last updated