Genotype Data Processing Flow

Genotypes come into the FinnGen project from two sources, where we go into detail in the Genotype Arrays Used section

1) FinnGen specific Affymetrix arrays. Containing many rare variants specific to Finland

2) Legacy cohorts / batches. Obtained from other Finnish studies.

In the documentation section on genotype browser how to we give a rough estimate of how the legacy cohorts contribute to each data freeze and some of the chips that were used for those. Samples are then grouped into batches of 5000 samples.

Chip Genotype Data Processing and QC

Samples were genotyped with Illumina (Illumina Inc., San Diego, CA, USA) and Affymetrix arrays (Thermo Fisher Scientific, Santa Clara, CA, USA).

Genotype calls were made with GenCall and zCall algorithms for Illumina and AxiomGT1 algorithm for Affymetrix data. Chip genotyping data produced with previous chip platforms and reference genome builds were lifted over to build version 38 (GRCh38/hg38) following the protocol described here.

Basic QC immediately after receiving new dataset from FinnGen-specific Affymetrix arrays

  • Data set integrity comparison of received MD5sums to calculated ones

  • Sample missingness >95%

  • Exclude samples without sex information or incorrect sex

  • List samples with different subject ID but same genome (within dataset) -> Genotyping team verifies if the subjects are identical twins or sample mixups. Sample mix ups are added to the exclusion list.

  • Select best duplicate on the basis of call rate (within a dataset, among known control duplicates.

  • Variant wise QC metrics are reported.

Sample mixup reports are received from Biobanks and FinnGen DNA team and removed from further processing.

Genotype QC for imputation and chip data releases

Sample QC:

  • Remove samples where genetic sex does not match provided sex from registries (female: f < 0.4, male: f >0.7)

  • Remove samples with variant missingness >0.02

  • Remove samples with high heterozygosity in common variants (allele frequency > 0.05) ( > 3 standard deviations from the mean) per batch

  • Remove samples with excess relatedness to other samples ( π^\hat\pi > 0.1) in 2 rounds:

    • First, remove those with a lot of relatedness ( n > 500) and

    • Second, rerun n >50. The heterozygosity step handles FinnGen chip excess relatedness but legacy chips have a few outliers remaining where this step removes the problematic samples.

Variant QC:

  • Map variants to reference genome, left align and minimize allele representation. Annotate variant id as chr:pos:ref:alt

    • Remove variants with alleles other than represented with [ATCG]

  • Compare variants against imputation panel:

    • Remove if not in panel (for imputation only, these remain in the chip dataset)

    • Remove if allele frequency in panel < 0.001 (for imputation only, these remain in the chip dataset)

    • Remove variants where allele frequency differs significantly from panel ( p<5×108p < 5 \times10^{-8}) adjusted for the first 10 principal components

  • Remove variant from all batches (FinnGen chip data and legacy data processed separately) if:

    • HWE p-value < 10-10 across all batches (exception: variants below a given frequency that have a deficiency in homozygotes escape this exclusion)

    • more than 15% of the batches have missingness > 0.04

    • is non-PASS in more than 30% if the batches

  • Remove variants within a batch if:

  • PHWE<106P_{HWE} < 10^{-6}

  • missingess > 0.02 (0.05 for Y chromosome)

  • variant is non-PASS

A single variant (chr12_71584145_G_T) was force imputed (removed from all batches in imputation qc)

Chip genotyped samples were then pre-phased with Eagle 2.3.5 with the default parameters, except the number of conditioning haplotypes was set to 20,000.

The location of Markdown of all DF10 genotype QC for both chip and imputation data in Sandbox: /finngen/library-red/finngen_R10/R10_genotype_qc.md

Genotype imputation with the population-specific reference panel

High-coverage (25x) WGS data used to develop the SISu v4.2 reference panel (Palta et al., manuscript in preparation) were generated at the McDonnell Genome Institute at Washington University for imputation. Details can be found in the Imputation Panel section.

Genotype imputation was performed using Beagle 4.XX (check)

Click here to read more about Sisu reference panel.

Last updated