Genotype Browser how to

GenotypeBrowser is a graphical user interface to examine variant-level information within FinnGen Sandbox.

Use cases for the Genotype Browser

To quality control (QC) a SNP of interest interactively
- Is the allele frequency and imputation quality score (imp INFO) the same by sex, age, geographical region and biobank/legacy project?
- Is the variant on the FinnGen chip - how many samples have been chip genotyped and how many imputed?
- View indicators of genotype quality via cluster plots
To download to your IVM a list of the individuals carrying that variant (or multiple variants in a gene)
- This list can then be imported to the R libraries so you can look at detailed phenotypic information
- Output files from Genotype Browser can be uploaded, visualized, and analyzed using the Cohort Operations tool
- Output files from Genotype Browser can be uploaded and visualized for trajectories using the Trajectory Visualization tool
- Output files from Genotype Browser can be combined to phenotype information using the Cohort Operations tool and visualized further for trajectories using the Trajectory Visualization tool
To look at the geographical distribution of a variant
To examine the full set of genotyped or imputed coding variants available for a specific gene

How to launch the Genotype Browser

In the Sandbox, the Genotype Browser can be launched from the Applications>FinnGen>Genotype browser:

Imputed and Raw Data

One of the first choices you need to make is between Raw chip data (directly genotyped calls) and Imputed data so here we explain a bit about the differences. More detailed information is available in the background reading of our documentation as well.

Raw chip data - genotype calls directly from the Affymetrix chip’s ThermoFisher calling software. The array consists of 736,145 probes for 655 973 markers. In addition to the core GWAS markers (about 500,000a) it contains 116,402 coding variants enriched in Finland, 10,800 specific markers for the HLA/KIR region, 14,900 ClinVar variants, 4,600 pharmacogenomic variants and 57,000 selected markers that were of special interest for partners. The rare Finnish coding variants are often available only in the chip data.

Imputed data - using linkage disequilibrium (basics here and here) we can fill in the expected values of variants that have not been genotyped directly on the chip. This allows us to extend the 665k raw variants on the Affymetrix chip to a full set of ~17M genomic variants. To do this we use a specially developed Finnish imputation reference panel (Sisu v3 and soon Sisu v4). Note - if a variant exists on the chip but fails QC or contains missing values, these will also be filled in by the imputation process.

Rare variant - a rare variant is usually defined as one whose minor allele frequency is < 0.1%. As noted above, Affymetrix chips designed for FinnGen contain Finnish rare variants drawn from exome sequencing (frequencies from these sequencing studies can be browsed in the gnomAD resource).

Many rare variants below 0.1% are too rare to be imputed because they are not observed in a sufficient number of copies in the imputation reference panel to assess their linkage disequilibrium patterns. They will only be available in the raw chip data.

There are various combinatorics that determine which genotype is available in the plots and downloads. This chart attempts to summarize these possibilities:

NB: Before investigating the phenotypes of rare variant heterozygotes and homozygotes it is important to inspect the cluster plot and make sure the calls are reliable (See details and examples later in this document). Due to the sparsity of data for rare variants, it can be difficult for the Affymetrix/ThermoFisher software to call very rare genotype categories accurately.

PAR-FSBSelecting variant(s) - single variant or multiple from a gene

The next choice is to search for a particular variant or a gene, either can be put in the entry box. (Gene names and rs numbers can be entered in lower case as well.)

NB - When searching by location (vs. rs ID or gene), the variant must be entered with “-” (dash/viva) and the change basepair change must be listed as well, e.g. 9-95923269-AT-A (chr-position(hg38)-ref-alt).

If you choose a gene it will then display all the variants available for that gene. Note that in the right hand column you can select for consequences of the variant (missense, coding), this has been done in the view below. (Details of the other variables will be listed lower in the documentation.) You can select up to a maximum of 10 variants for output and interactive viewing.

If you select multiple variants (as shown below) they will then be combined in the next steps where you can look at them interactively and also download them.

Once you have selected the variant(s), to proceed to the next step, click

Variant annotation information

GT source - refers to what you searched - either Imputed or Raw Chip - NB when you change the source between imputed and chip you need to press Search for the page to update. This field is a good indicator of what data the current page is based on.

AF - AF - the FinnGen allele frequency of the alternative allele (usually the minor allele). The greater the AF is, the more prevalent the alternative allele is in the data.

Info - measure of the imputation info score - a calibrated average confidence metric self-reported by the imputation algorithm (more in depth info here). Most well imputed variants have info scores between 0.95 and 1 - associations to variants with values below 0.8 should be more carefully examined. Because many different genotyping arrays were used in legacy batches, variants with info scores less than 0.95 may still have a subset of highly confident genotypes - you can use the Imputed genotype probability filter (discussed below) to restrict summaries to higher confidence genotypes.

Fin enr gnomad2 genomes/exomes -

These measures indicate the Finnish enrichment of a particular allele - how much more common it is in Finland than other populations (genotypes with a similar frequency in Finland and Europe will be close to 1, Finnish-enriched variants will have a value > 1). These are calculated from the gnomAD 2 resource which provides allele frequencies in world populations.

We use what is referred to as “NFSEE” - Non-Finnish Non-Swedish Non-Estonian Europeans to calculate Finnish enrichment. (Swedish and Estonian populations share some genetic heritage with Finland so are removed from the comparison). Finnish enrichment is calculated as FIN AF / NFSEE AF from the gnomAD data - to do this manually you would look at the breakdown of European counts and exclude the Swedish and Estonian ones.

From Gnomad for NFSEE we would calculate (2+1)/(42176+30954+2670+11496) = 3.4 x 10-5. Finnish enrichment is therefore 0.001848/3.4x10-5 = 53.8. If the variant is a coding region variant, you can use the “exomes” value which is based on a larger sample size, otherwise, refer to the “genomes” value.

Interactive Filters

The Genotype Browser gives you filters to view the genotype data interactively. For instance, the known leukemia variant rs768081343, if you look by allele frequency and by region of birth, you can see that the variant is more enriched in certain northeast regions of Finland. Biobanks are able to recall individuals and have the ability to deconvolve the FinnGen IDs to be personal identifiers, so you could use this view to see if a biobank has individuals of a particular genotype.

Figures in the Genotype Browser are interactive. Hovering the mouse over the figures will show detailed information.

You can download the figures produced by Genome Browser for your publications and presentations. Hovering the mouse on the upper left corner will reveal menu for figure editing and saving the final figure as png. After the figure is saved you may request data download for your figure(s).

From Sandbox v10.3 onwards the Cluster Plots viewer V3C tool is also provided within Genotype Browser's Interactive Filters view.

Some of the other variables in the interactive filters you may consider: ‌

Legacy data‌

Over the history of the FinnGen project, samples from clinical or epidemiologic cohorts (e.g. Finrisk, Botnia, SETTI, Twin study) have been included in FinnGen. The distribution of these legacy samples over different data freezes is roughly like this:‌

These legacy samples may have been genotyped on a different chip than the usual FinnGen Axiom chips 1 & 2. When you download your genotypes you will see the chip listed where the variant was run. Genotypes from these arrays are not included in cluster plots but are of course included in the imputed data and downloads from that.‌

Here are the chips used for some of the legacy data sets:

‌

Most probable genotype/gp threshold‌

The imputed genotype probability box lets you set the imputation genotype probability (gp) threshold. For instance, if you were doing a recall study you might want to use the stringent 0.95 but for looking at phenotypes you might set a lower threshold. The table below shows the most probable genotype and how different settings of 0.8, 0.9 and 0.95 as the gp threshold will affect the output.‌

(Note - this is only relevant to (and therefore only available during) the examination of imputed results.)

‌

Output/download

‌

The next option allows you to download information about the individuals with a particular genotype. For a single variant, you will get an option like this:

If you are looking at more than one variant you will get this variation:

If you are looking at for example two variants and check the “count individuals heterozygous for more than one variant as homozygous” checkbox, individuals homozygous for either of the variants of heterozygous for both will be considered “homozygous”.‌

You can use the Excel-like spreadsheet tool LibreOffice Calc to view your data or you can save it to a file.‌

Here are mock-data output files and a description of each column:‌

Heterozygotes raw chip data‌

Homozygotes raw chip data‌

death: 1 if the individual has died, 0 if alive at the last Register refresh, NA if unknown‌

batch: genotyping batch - these can be used as covariates in core and custom analysis (this could be used to see if there was an exceptional bias where only one batch or array had homozygotes, this is very unlikely and you will likely not need this field).‌

chip: which genotyping chip was used, the endings such as .r2 and .r1_3 relate to the annotation database Affymetrix uses in calling the data.‌

array: if the sample was genotyped on one of the FinnGen Axiom arrays this value will be 1 otherwise 0.‌

three_gt_probs: the probability of each genotype - p00 (hom WT), p01/p10 (heterozygous), p11 (hom for alternate allele)‌

gt: the genotype called based on the threshold set in the interface‌

variant: which variant (in case there are multiple variants being output)‌

Understanding genotype output‌

In your genotype output you may see:

1|1

Imputed homozygote

0|1 or 1|0

Imputed and phased heterozygote

0|0

Imputed WT homozygote

1/1

Raw chip homozygote

0/1

Raw chip heterozygote

0/0

Raw chip WT homozygote

.|. or ./.

Missing data

‌

Downloading genotypes directly from the VCF files

‌

If you want to download genotypes from a different version of the data than the one in the GenotypeBrowser, you can use tabix commands to do this quickly. Here are some example files you can work off of:‌

/finngen/shared/sample_genotype_lookups_from_vcfs_using_tabix_/20210601_165912/