Genotype Data Processing and Quality Control (QC)
Last updated
Last updated
Genotype data is a relatively cheap and scalable way to characterize the genetic variants from DNA samples. Genotyping relies on prior knowledge of the genome, because the technology queries only particular coordinates in the genome where known variation exists. It differs from sequencing, where the full sequential order of bases is characterized.
In comparison, sequencing a human genome would require sequencing approx. 3 billion bases, but genotyping using a chip or array typically covers half a million to a million base coordinates distributed around the genome.
Genotyping doesn’t cover the whole catalog of variation in a person’s genome, but the selection of variants chosen to be covered by the array typically provides enough information to statistically infer the remaining variation by imputation, due to linkage disequilibrium, i.e. some variants tend to be inherited together.
Genotyping starts from the genotyping lab, where extracted DNA is fragmented and washed onto a chip typically containing engineered, complementary base pair fragments that flank the immediate region around the known variant location (Figure: Target prep). The DNA fragments hybridize (Figure: Hybridization) with these complementary sequences, and with varying technological solutions e.g. fluorescence intensity measurements.
Finally, upon Figure: Signal amplification, the captured data can be processed by genotype calling software to result in the output being genotypes from the DNA sample where
The researcher typically gets a file with data from all samples processed by the lab combined into one file with the genotypes data for all variants on the chip.
Before analysis, it is important that the data are passed through a number of quality control (QC) steps so that the results are not confounded by e.g. technical artifacts or biases in the sample population. The QC is typically split into sample level and variant level QC. A great overview of the process with some example commands here (PMID 21085122).
Variant QC
aims to remove specific variants that have issues with quality. One of the most important steps is the removal of variants that are missing (i.e. not called) in a larger proportion of the samples from the dataset than a set threshold, e.g. "missingness" in 2% or more samples. Low call rates can be due to several reasons, e.g. that the hybridization process did not work accordingly, or that the automated software that calls the genotypes was not able to deduce the genotypes accurately.
Sample level QC
aims to remove specific samples that for one reason or another have poor quality data or where the genetically inferred information does not correspond to information known prior to genotyping.
A typical data-related filter is to remove samples where a set % of variants for the sample are missing, i.e. a genotype was not called at many locations. A typical threshold for removing a sample is e.g. 2% or more of variants missing from the genotype calls.
Another important step is to infer the sex of the sample from the genetic data using the rate of homozygosity/heterozygosity on the X-chromosome. If the genetically inferred sex is discordant with the reported sex in the phenotype information typically accompanied with sample (e.g. reported by the clinician referring a patient to the study), this can imply e.g. a sample swap or mistake in the accompanied phenotype information. Neither of these is uncommon, especially in larger studies.
Another important sample level check is a duplicate sample / genetic relatedness check. Unintended duplicates (or twins) are easy to detect, as the genetic variants are (nearly) identical. This can happen e.g. during sample preparation, in the lab or if the same person is enrolled in the same study twice (e.g. at two different clinics). Typically one sample is kept in the data. In the same way that duplicates are inferred, the proximity of the relation between non-duplicated samples can be estimated from the genetic variants using the basic Mendelian inheritance expectations: e.g. parent-child or sibling pairs will share on average 50% of their chromosomes, which is reflected in the genotype data. Genetic relatedness is checked for all sample pairs. Previously, genome-wide association studies were regularly carried out using samples that were not closely related to each other, but recently statistical analysis methods are used where the genetic relatedness is controlled, resulting in fewer samples being removed from the data.
Another typical check has been the removal of samples that have a high or low proportion of heterozygous genotypes (e.g. +/3 standard deviations from the mean). Low heterozygosity can imply autozygosity, and high heterozygosity can imply admixture. Even without using this method for sample filtering, it is a good check for understanding the genetic landscape of the study population.
Finally, an important aspect to consider (which related to the previous topics like relatedness) is the genetic ancestry of the study population. Samples from the same ancestral population are more similar to each other, and allele frequencies of variants can be significantly different between populations. For example, if comparing cases from one population to controls from another (or even groups from different geographical locations from the same ancestral population), spurious associations can arise that have no relation to the case/control status, but instead are markers of differential genetic ancestry.
Figure 1. Spurious associations can arise if no control for differential genetic ancestry is applied. A. In this example, it would appear that the cases (each individual sample is a circle) are carriers of the allele more frequently than the controls are.
B. Upon closer inspection, when separating the cases and controls by geographical location of the recruiting hospital, it is found that the cases and controls carry the allele at equal frequencies. The allele is simply less common in the geographically more Northern populations. Because the majority of the controls were from this population, the overall allele frequency in the controls group was lower compared to the case group, who were mainly representatives of the more Southern genetic ancestry.
Reference populations such as 1000 Genomes or TOPMed contain genome data from carefully selected samples from many different geographical locations, which can be used as a reference “map” for the genetic signatures of different populations across the globe. We can then compare our own samples’ data to these reference samples and place our samples onto the “map”, using a method called projection principal component analysis (PCA).
You can read more about PCA from Matti Pirinen's notes. Typically, in genome-wide association, polygenic risk score analyses, or genetic epidemiological analyses, we would add at least the 10 first principal components as covariates into the model.
Click here to read more about genotype data processing in FinnGen