Genotype Imputation

Genotype imputation is a computational method for statistically inferring untyped genotypes in a sample of partially genotyped individuals. The imputation process uses LD and haplotype sharing/similarity to infer genotypes of a lower-density (e.g. chip-genotyped) target dataset from the reference dataset of more densely genotyped individuals, e.g from a whole-genome sequencing data based imputation reference panel (IRP).

The figure below shows diagramatically how genotype imputation works in unrelated individuals (adapted from the Abecasis group's review paper found here).

Panel A

Study samples (Panel A) has sparse genotypes which is then "filled in" (imputed) for the missing sparsity of the genotypes (Panel B).

Panel B

Depending on the data, you can impute stretches of up to >100 kb in length and of a minor allele frequency of down to 5-10%. With FinnGen, all of our data is imputed with a Finnish specific population imputation panel to resolve into much higher resolutions.

Panel C

The final product of genotype imputation is shown in Panel C to fill in the series of unobserved genotypes that was in the study sample.

With genotype imputation, geneticists can now study variants that have not been directly genotyped in studied samples; therefore, increasing the power and resolution of genome-wide association studies (GWAS) and meta-analyses: especially for combining association results across studies which use different genotyping arrays. Moreover, genotype imputation can be used to facilitate improvements in fine-mapping in order to localise association signals by considering all genetic variants in a certain region.

With the increased availability of various genotype imputation-related tools and Whole Genome Sequencing (WGS) based reference datasets, this practice has become widespread as it offers a cost-effective alternative to purely WGS-based study designs, especially when considering analyses of very large sample sets.

In FinnGen, we utilize Beagle to perform genotype imputation. Prior to utilizing Beagle for genotype imputation of FinnGen target datasets (comprising of ‘legacy’ datasets genotyped with various chip arrays and samples genotyped with ThermoFisher FinnGen Affymetrix genotyping chips), high-quality WGS based variant alleles are phased with Eagle to develop the IRPs.

Single Nucleotide Polymorphism (SNP) and Insertion-Deletion (indel) Imputation

Both SNPs (somethings known as SN Variants a.k.a SNV) and indels in FinnGen are currently imputed from the SISu v3 IRP, comprising of 3,775 whole-genome sequenced (~30x coverage) Finns and containing 16.9M variants.

We are currently in the progress of shifting over to the SISu v4 IRP, comprising of 8,554 whole-genome sequenced Finns.

To note, imputation with an ancestry specific reference panel e.g. SISu IRP, has been shown to improve genetic associations, especially of rare variants (as noted here).

Short Tandem Repeat (STR) or Simple Sequence Length Polymorphism (SSLP) Imputation

Both STRs and SSLPs, or microsatellites are one of the most plentiful source of variation in the human genome. These variants are strings of consecutively repeated, approximately 2-6 nucleotide sequence motifs that are repeated from a few to hundreds of copies (often written (CA)n, (GTT)n, (GATA)n, etc.) in which the length (number of repeats) is polymorphic in the population.

Repeat length expansion is widely studied as the genetic cause of many neurodegenerative conditions (usually caused by expansions within the coding regions of genes). However, these STRs are largely unexplored in GWAS since they are not well-called in the default short read sequencing analysis pipelines and not imputed well (like HLA typing variation, these are single variants where each individual has two alleles drawn from a population set of more than 2 possible alleles – not ‘binary’ variants like SNPs and single indels). Hence, like HLA, both the calling of these variants in the reference data, and then the imputation and association analysis, are slightly different.

In FinnGen, we will use a SSLP/STR reference panel developed from the same WGS data (for ~8,000 Finns) that was used to develop the SISu v4 SNV/indel reference panel. High-quality SSLP/STR calls are merged with biallelic SNV calls for the same individuals (from the SISu v4 SNV/indel reference panel) and phased. The resulting reference panel (now containing also SSLP/STR variants) will be used to carry out SSLP/STR imputation for all FinnGen chip datasets. We expect this data will be available in autumn 2021.

More information and diagrams:

The presentation “Genotype Imputation” is available in Documentation folder.

Click here to read more about Sisu imputation panel used in the FinnGen

Last updated