Genetic Ancestry

This page has been last updated for R11.

Sandbox directory

/finngen/library-red/finngen_R12/genetic_ancestry_1.0

Description

In the PCA pipeline we identify samples whose genetic background does not match the Finnish one. In this pipeline isntead, we try to provide further PC data that allows to provide more information regarding the ancestry of such samples. We perform PCA on 1k/HGDP data and project FinnGen outliers onto the same space in order to provide information about their ancestral background. Due to issues with PCs when merging different batches (shift away from populations due to imputed data), only the largest batches with enough common genotyped variants have been considered and merged togethers, with 15778 (93%) of outliers being included in the analysis and 3108 variants used for the PCA. PCA is computed on HGDP/1kg samples with the following plink args:

--pca 3 approx biallelic-var-wtsN.B. The main goal of this pipeline is not to provide an exact labelling of each sample, but rather to provide information about their ancestral background through the PCs. The labels we provide are meant to be seen as a rough estimation of such information and we encourage the user not to rely on these values out of the box, but to take them as a starting point for further analysis.

Data files

| File | Description |

|---|---|

|[PREFIX_BATCHES]_proj.eigenvec | HGDP PCA eigenvectors |

|[PREFIX_BATCHES]_proj.eigenvec.var | HGDP PCA eigenvector loadings |

|[PREFIX_BATCHES]_proj.eigenval | HGDP PCA eigenvalues|

|[PREFIX_BATCHES]_proj_proj.sscore | FG outliers projection onto HGDP space |

|[PREFIX_BATCHES]_proj_ref.sscore | HGDP projection back onto its EV space, so to harmonize with FG data | Probabilities

| File | Description |

|---|---|

|[PREFIX_BATCHES]_proj_probs.txt |Raw table with probabilities assigned to each FG outliers for each population label|

|[PREFIX_BATCHES]_proj_samples_most_likely_region.txt | Most likely population based on probs file (argmax) |

|[PREFIX_BATCHES]proj_samples_most_likely_region[PROB_CUTOFF].txt | Most likely population based on probs file (argmax) above certain cutoff threshold|

|[PREFIX_BATCHES]_proj_finless_samples_most_likely_region.txt | Most likely population based on probs file (argmax) removing FIN probs|

|[PREFIX_BATCHES]proj_finles_samples_most_likely_region[PROB_CUTOFF].txt | Most likely population based on probs file (argmax) above certain cutoff threshold removing FIN probs|

Documentation

| File | Description |

|---|---|

|[PREFIX_BATCHES]_proj_scatter_all.png/pdf | Pairwise PC plot of FG vs 1k data|

|[PREFIX_BATCHES]_proj_scatter_tags.png/pdf | Pairwise PC plot of FG vs 1k data with different 1k populations labelled separately|

|[PREFIX_BATCHES]_proj_tags_pc_density.png/pdf | Density plots for each PC of FG/1k data grouped by pop |

Notes

These are typical outputs of PCA when using a mixed set of samples across batches, thus mixing semi-randomly chip and imputed data. In the above figure, FG data is projected onto the 1k space, but the FG data components seem to be shrunk/shifted. The same happens when merging the two datasets into one

PreviousPolygenic risk scores (PRS)NextGenetic relationships (GRM)

Last updated 3 months ago

Was this helpful?