Sisu v3 reference panel

As of FinnGen R2, the SISu v3 reference panel is used as the imputation panel in FinnGen.

High-coverage (25-30x) WGS data for 4,083 Finn were generated at the Broad Institute of MIT and Harvard, and at the McDonnell Genome Institute at Washington University; and jointly processed at the Broad Institute. Variant call set was produced with GATK HaplotypeCaller algorithm by following GATK best-practices for variant calling. Genotype-, variant- and sample-wise QC was applied in an iterative manner by using the Hail v0.1. To generate the imputation reference panel, the quality controlled data for 3,775 high-quality samples were further filtered for allele count (AC) < 3. Haplotype phasing was carried out with Eagle 2.3.5 software with the default parameters, except that the number of conditioning haplotypes was set to 20,000.

As of July 2021, FinnGen genotype imputation is carried out according to Genotype imputation workflow v3.0 V.2 protocol. See also Genotype imputation for details on the imputation process.

Detailed description:

The SISu v3 reference panel contains samples from the following cohorts/studies:

  • METSIM* (PIs: Markku Laakso and Mike Boehnke)

  • FINRISK**,#¤ (PI: Pekka Jousilahti)

  • Health2000 **,#%(PI: Seppo Koskinen)

  • Finnish Migraine Family Study**,#& (PI: Aarno Palotie)

  • Merck/Tienari samples** (PI: Pentti Tienari)

  • MESTA samples** (PI: Jaana Suvisaari)

*sequenced at the McDonnell Genome Institute at Washington University

** sequenced at the Broad Institute of MIT and Harvard

#part of the SISu v3-THL reference panel available through National Institute for Health and Welfare (THL)

¤ FINRISK, N=1160, (National Institute for Health and Welfare/THL Biobank). A representative, cross-sectional population surveys from six areas of Finland. Baseline years 1992, 1997, 2002 and 2007.

% Health 2000 and 2011 Surveys, N=205, (National Institute for Health and Welfare/THL Biobank). A nationally representative sample of individuals living in Finland in 2000-2001.

& The Finnish Migraine Family Study, N=403, (University of Helsinki/THL Biobank). The sample collection includes persons who have been diagnosed with migraine and their family members. The samples have been collected since 1992. First-degree relatives in the Migraine Study have been removed from the reference panel data

High-coverage WGS data processing

Genotype-, sample- and variant-wise quality control (QC) filtering procedures were applied by an iterative manner on the high-coverage WGS (hcWGS) data using the Hail framework v0.1 (unless mentioned otherwise)

1. Genotype-wise QC

Genotypes were set as missing if:

  • sequencing read depth (DP) was > 200; or

  • PHRED-scaled genotype quality (GQ) value was < 20; or

  • the proportion of informative reads (total allele depth [AD] / depth [DP]) was < 0.9; or

for homozygous reference calls:

  • the proportion of informative reads (reference AD / DP) was < 0.9;

for heterozygous variant calls (autosomal chromosomes):

  • the proportion of informative reads (alternative AD / DP) was not within the interval 0.2-0.8; or

  • the normalized PHRED-scaled probability of the reference genotype (pl[0]) was < 20;

for heterozygous variant calls (chromosome X):

  • the proportion of informative reads (total AD / DP) was < 0.9; or

  • the proportion of informative reads (alternative AD / DP) < 0.25; or

  • pl[0] was < 20; or

  • p-value for pulling the given allelic depth from a binomial distribution with mean 0.5 (pAB) was < 1e-9; or

  • the gender was male

for homozygous variant calls:

  • the proportion of informative reads (alternative AD / DP) was < 0.9; or

  • pl[0] was < 20.

2. Sample-wise QC/filtering

To identify the first list of excludable samples, relatively stringent QC thresholds were first applied for autosomal chromosomes and bi-allelic variants excluding the ones on low-complexity regions.

Variants were preserved (for calculating sample-wise metrics) if:

  • variant quality score recalibration (VQSR) filter was PASS; and

  • quality by depth (QD) for SNPs was ≥ 2 and for indels ≥ 3; and

  • allele count (AC) was ≥ 3; and

  • call-rate (CR) ≥ 90%; and

  • Hardy-Weinberg Equilibrium p-value (pHWE) ≥ 1e-9.

Outlier samples deviating more than ±3SD were identified from sample-wise QC metrics (nSNP, rHetHom, rInsertionDeletion, rTiTv). The data was further LD-pruned with window size 1M and r2 = 0.2 by first excluding variants on high-LD regions (lifted over from GRCh37 by NCBI Remap, minor allele frequency (MAF) < 0.05 and CR < 90%. With Plink v2.0 KING method, related individuals (kinship coefficient < 0.177) were identified from the LD-pruned data and excluded. Then, the top 20 principal component (PC) scores were computed and outlier samples to be excluded were identified based on the first 10 PCs.

Outlier samples (sample-wise QC outliers, related individuals, PCA outliers, individuals with ambiguous gender) were excluded from the data before applying variant-wise QC/filtering.

3. Variant-wise QC/filtering

Mitochondrial and chromosome Y variants, variants on alternative haplotypes or within pseudo-autosomal regions on chromosome X or low-complexity regions on any chromosomes were excluded. In order to preserve multiallelic sites, they were decomposed into bi-allelic records.

Genotypes were set as missing based on the above-mentioned (step 1) thresholds.

Next, variants were preserved, if:

  • variant quality score recalibration (VQSR) filter was PASS; and

  • quality by depth (QD) for SNPs was ≥ 2 and for indels ≥ 6; and

  • AC was > 0; and

  • CR was ≥ 90%; and

  • pHWE for autosomal chromosomes and for females on chromosome X was ≥ 1e-9.

Notable batch effects between samples sequenced at different sites (Washington University and Broad) and with PCR+ or PCR- protocols were observed in sample-wise QC metrics, mainly chrX coverage and for indels on autosomal chromosomes. Thus, for autosomal chromosomes, variant-wise CRs were calculated over samples grouped by the sequencing site. SNPs showing ≥ 5% and indels showing ≥ 3% difference in CRs between the sequencing sites were excluded. Remaining variants that still demonstrated some differences in relevant QC metrics were further enlisted in variant blacklists.

4. Imputation reference panel generation

To generate a high-quality hcWGS reference panel, the QC'ed data was further filtered and variants with AC < 3 were excluded. Haplotype phasing was carried out with Eagle 2.3.5 software with the default parameters, except that the number of conditioning haplotypes was set to 20,000. Beagle-specific reference panel files (.bref) were created as instructed by the Beagle.

Summary of SISu v3.0 reference panel as pdf.

See also Genotype Imputation section for general information about imputation.

Last updated