We included 1,801 endpoints from the phenotype/registry teams’ pipeline in the analysis. Endpoints with OMIT in the endpoint definition file were excluded, as well as endpoints with less than 100 cases among the 135,638 samples. “Smoking: yes” and “Smoking: current or former” were created based on the respective smoking data in the phenotype data file.
For the null model calculation for each endpoint, we used age, sex, 10 PCs and genotyping batch as covariates.
For calculating the genetic relationship matrix, we used the genotype dataset where genotypes with GP < 0.95 have been set missing. Only variants imputed with an INFO score > 0.95 in all batches were used. Variants with > 3 % missing genotypes were excluded as well as variants with MAF < 5 %. The remaining variants were LD pruned with a 1Mb window and r2 threshold of 0.1. This resulted in a set of 35,557 common, well-imputed variants for GRM calculation.
SAIGE options for the null computation:
LOCO = false
numMarkers = 30
traceCVcutoff = 0.0025
ratioCVcutoff = 0.001
We ran association tests against each of the 1,801 endpoints with SAIGE for each variant with a minimum allele count of 10 from the imputation pipeline (SAIGE optionminMAC = 10
). The alternative allele is always the effect allele.
This is a description of the quality control procedures applied before running the GWAS.
In summary, we removed 10,992 samples who were either of non-Finnish ancestry or twins/duplicates. Finnish ancestry was assessed with a combination of PCA and a Bayesian method for outlier detection.
The PCA for population structure has been run in the following way:
The following filters were applied:
Exclusion of chromosome 23
Exclusion of variants with info score < 0.95
Exclusion of variants with missingness > 0.01 (based on the GP,see conversion)
Exclusion of variants with MAF > 0.05
LD pruning with window 500kb, step 50kb, r^2 filter of 0.1
This filtering step produced 42,805 variants, that were used for the rest of the analysis.
Then, FinnGen data was merged with the 1k genome project (1kgp) data, using the variants mentioned above. A round of PCA was performed and a Bayesian algorithm was used to spot outliers. This process removed 4,208 outliers, of which 1,820 are from the Finngen samples.
The figure below shows the scatter plots for the first 3 PCs. Outliers, in red, are separated from the FinnGen (blue cluster). While the method automatically detected as being outliers the 1kgp samples with non European and southern European ancestries, it did not manage to exclude 12 samples with Western European origins.
Since the signal from these sample would have been too small to allow a second round to be performed without detecting substructures of the Finnish population, another approach was used. The Finngen samples that survived the first round were used to compute another PCA. The EUR and FIN 1kgp samples were then projected onto the space generated by the first 3 PCs. Then, the centroid of each cluster was calculated and used it to calculate the squared mahalanobis distance of each Finngen sample to each of the centroids. Being the squared distance a sum of squared variables (with unitary variance, due to the mahalanobis distance), we could see it as a sum of 3 independent squared variables. This allowed to map the squared distance into a probability (chi squared with 3 degrees of freedom). Therefore, for each cluster, a probability of being part of it was computed.
Next, a threshold of 0.95 was used to exclude Finngen samples whose relative chance of being part of the Finnish cluster was below the level. This method produced another 359 outliers.
FIN 1kgp samples are in purple, while EUR 1kgp sample are in Blue. Samples in green are Finngen samples who are flagged as being non Finnish, while red ones are.
In a next step, all pairs of Finngen samples up to second degree were returned. The figure shows the distribution of kinship values.
Then, the previously defined “non Finnish” samples were excluded and 2 algorithms were used to return a unique subset of unrelated samples:
one called greedy would continuously remove the highest degree node from the network of relations, until no more links are left in the network.
one called native, based on a native implementation of python’s networkx package, performed on each subgraph of the network. The largest independent set of either algorithm would be used to keep those sample, while flagging the others as “outliers” for the final PCA.
Then, the subset of outliers who also belong to the set of duplicates/twins was identified.
To compute the final step the Finngen samples were ultimately separated in three groups:
109184 inliers: unrelated samples with Finnish ancestry.
33302 outliers: non duplicate samples with Finnish ancestries, but who are also related to the inliers.
4144 rejected samples: either of non Finnish ancestry or are twins/duplicates with relations to other samples.
Finally, the PCA for the inliers was calculated, and then outliers were projected on the same same, allowing to calculate covariates for a total of 142,486 samples.
Of the 142,486 non-duplicate population inlier samples from PCA, 5,846 were excluded from analysis because of missing minimum phenotype data. Finally, 1,002 samples of age less than 18 were excluded. A total of 135,638 samples was used for core analysis.
Documentation from the original developers of the algorithm can be found here: http://www.well.ox.ac.uk/~spencer/Aberrant/aberrant-manu.
We used the SAIGE (r3 release) software for running the R3 GWAS.
SAIGE is a mixed model logistic regression R/C++ package, able to account for related samples.
We analyzed:
1,801 endpoints
135,638 samples
16,962,023 variants
We included the following covariates in the model: sex, age, 10 PCs, genotyping batch.