Cluster Plots
Last updated
Last updated
A cluster plot is a scatter plot of the raw variant intensities from the chips and enables us to broadly assess how accurate the genotype array calls are for a variant.
This is especially important for rare variants because with rare variants there may be no independent verification from imputation. Be sure to check your rare variants' cluster plots before doing an association analysis with them.
In the Genotype Browser we provide three ways to look at cluster plots. These all show the same points but with different colorings explained by the key on the plot.
You can also access the cluster plots from the green library.
The files with individual intensities for each point are only available in the red-level library.
All | Raw calls from the ThermoFisher software |
---|---|
Missing Imputation | Imputed genotype calls (confusing name meaning that missing data has been imputed) |
Sex-separated | Not calls, just indication of male/female status |
In a cluster plot, data points form clusters which represent genotype calls
These "clouds" of intensity occur because each variant is represented on the chip 100s of times and different DNA preparations may anneal differently to the chip. This is why the clusters are less compact than one might expect. Here are some diagrams from ThermoFisher showing the area on the chip for one variant.
In the legend of the plots, different genotypes are referred to with the following indices:
0 : Homozygous reference allele : AA (always on the X axis)
1 : Heterozygous : Aa
2 : Homozygous alternative allele : aa (always on the Y axis)
X : No call : NN / 1
Here is an example of a FinnGen cluster plot with high quality calls:
Things to check for are:
Are the clusters clearly separated?
Is the call rate sufficiently high? (not too many Xs, indicating missing calls)
Here is an example where these QC steps fail:
Problems:
No clear separation between clusters
Low call rate (many missing - usually hand-in-hand with poor separation of clusters)
Variants such as this one often fail QC of the chip data - in this case, none of the calls based on intensity information are used and the entire variant is imputed as if it had not been genotyped. If a variant passes QC, imputation does not change genotypes that have been called by the cluster analysis; however, imputation can still improve the raw calls by filling in missing ones. Imputation does not look at the intensity data, it uses only the linkage disequilibrium patterns in the genomic region, so it provides an independent confirmation and disambiguation of the genotype calls. Examining both pre- and post- imputation cluster plots (where only the genotypes, not the underlying data points, can change) can help increase confidence that intensities and imputed calls of missing data make sense - and also remember that the gnomAD frequencies, derived from independent exome and genome sequencing, should be consistent with the FinnGen frequencies and can provide an independent confirmation of the variant.
FinnGen genotypes are generated from germline DNA extracted from non-clonal population of cells. In rare occasions however, we observe somatic mutations among our chip genotypes. These are caused by clonal hematopoiesis associated with aging and hematologial cancers. The phenomenon is characterized by overrepresentation of blood cells derived from a single clone. In the cluster plots somatic mutations due to clonal hematopoiesis show up as a continuum of calls that deviate from the reference homozygotes (see example below of somatic missense mutation p.Val617Phe in JAK2). The calls do not form clear clusters. Sometimes these are hard to identify from true germline calls and may require further manual inspection (eg. frequency comparison to imputed genotypes).
For X chromosome variants you can check the sex-specific calling. Due to the males being hemizygous for X variants, you actually get 5 groups. Note that since males only have one copy, their intensities will always be on the lower end (since they have half as much material to hybridize on the chip). These five groups are most visible when looking at a common variant such as this one:
Here is an example of how a rare variant on the X chromosome might appear on the plot. You can see that only the females show up as heterozygotes (the middle group shows that because this variant is rare, there are no TT females so no 5th cluster appears.):
Designing a probe for a chip is not an exact science. The genomic context of a DNA variant determines how well probes can be designed to fit both alleles and can be affected by many factors such as:
Unusual sequence patterns in the region (e.g. run of As or Ts)
Exceptionally GC-rich or AT-rich regions for which it can be difficult to develop effective probes
The probe has sequence similarity elsewhere in the genome causing background hybridization to both alleles.
For more in-depth information, please see the next topics.
Like most experiments, array-based genotyping can have variability from a variety of sources. Some samples may have lower DNA quality or higher concentration that will make them land outside the average clusters more often. There can also be defects or artifacts in chip synthesis that would affect certain spots on the array and cause a subset of sites to perform aberrantly in certain individuals. Fortunately, as seen in a number of examples above, the raw, cluster-based genotype calling leaves these outliers as missing. Here imputation gives you the best information and can make a high-confidence assignment to a particular genotype since it is calculating the most likely genotype from the patterns at other variants using the deeply sequenced reference panel.
Here you’ll see that the raw genotyping has done its best to indicate that those are not clear from the intensity clusters. It does a good job with that based on the knowledge it has. The imputation is then able to impute them from the genetic data (with no use of the intensity data). Also, keep in mind that imputation inherently provides genotype probabilities (see the discussion above and links for further details on info scores) so plots which represent a color based on the most likely genotype will not be perfect as the most likely genotype will not always be the correct one.
If you would like to read about cluster plots in greater detail, you can refer to some of the detailed Affymetrix documentation.
Cluster plots are not updated with each data release - since the chip remains the same the distribution of the data is stable from release to release. FinnGen batches are combined on these cluster plots, but due to the enormous size the data is subsampled for viewing. From each batch, up to 100 individuals of each genotype are included in the viewing - low frequency variants, most or all of the heterozygous, and minor allele homozygotes will be shown - and all missing data points are also displayed. Keep in mind that this somewhat magnifies the number of missing data points since they are all displayed and listed in the count above the plot. (Since the common WT homozygotes (listed as “0”) shows a value of 5100 on the left, this means 51 batches were included in this plot from DF6. Note that of course there are many more than 5100 individuals in the genotype data, this is just how many are drawn on the plot! Usually for Aa, aa and missing calls you will see the full number represented here.)
Cluster plots are available in the finngen green-library at gs://finngen-production-library-green/finngen_R8/cluster_plots
and can be downloaded directly using gsutil or Google Cloud desktop. They are available in 3 types - raw (shows ThermoFisher/Affy calls -for rare variants this is all that is available), imputed (AA, AB, BB calls from imputation), and sex (showing the M/F status).
When looking at rare variants that are not imputable you may note that you could do a better job selecting the individuals by hand. Good news - we have a tool "V3C" for this! You can read about how to download and use it in the next section.