We used regenie for release 9. Regenie's main advantages are fast leave-one-chromosome-out relatedness calculation which avoids proximal contamination, and use of an approximate Firth test which gives more reliable effect size estimates for rare variants.
We used regenie version 2.2.4.
Links:
The Docker image used in the analysis is available in Docker Hub.
We analyzed:
2,272 endpoints
2,269 binary endpoints
3 quantitative endpoints (HEIGHT_IRN, WEIGHT_IRN, BMI_IRN)
377,277 samples
210,870 females
166,407 males
20,175,454 variants
We included the following covariates in the model: sex, age, 10 PCs, genotyping batch.
This is a description of the quality control procedures applied before running the GWAS.
The PCA for population structure has been run in the following way:
The sisu version 4 imputation panel is pruned iteratively, until a target number of SNPS is reached:
9,385,753 starting variants: only variants with a minimum info score of 0.9 in all batches are kept.
The script starts with [500.0, 50.0, 0.9] params in plink (window,step,r2). It then decreases 0.05 in r2 iteratively pruning the imputation panel until the threshold of 200,000 snps is reached. Once the SNP count falls under 200,000 the closest pruning is returned.
If the higher r2 is closer, 200,000 snps are randomly selected, else the last pruned snps are returned.
Plink flags used: --snps-only --chr 1-22 --max-alleles 2 --maf 0.01 .
For this run the final ld params are --indep-pairwise 500.0 50.0 0.15 and 187,068 snps are returned.
Then, FinnGen data was merged with the 1k genome project (1kgp) data, using the variants mentioned above. This reduced the number of variants from 187,068 to 185,517. A round of PCA was performed and a bayesian algorithm was used to spot outliers. This process got rid of 12,639 FinnGen samples. The figure below shows the scatter plots for the first 3 PCs. Outliers, in green, are separated from the FinnGen red cluster.
While the method automatically detected as being outliers the 1kg samples with non European and southern European ancestries, it did not manage to exclude some samples with Western European origins. Since the signal from these samples would have been too small to allow a second round to be performed without detecting substructures of the Finnish population, another approach was used. The FinnGen samples that survived the first round were used to compute another PCA. The EUR and FIN 1kg samples were then projected onto the space generated by the first 3 PCs. Then, the centroid of each cluster was calculated and used to calculate the squared mahalanobis distance of each FinnGen sample to each of the centroids. Being the squared distance a sum of squared variables (with unitary variance, due to the mahalanobis distance), we could see it as a sum of 3 independent squared variables. This allowed to map the squared distance into a probability (chi squared with 3 degrees of freedom). Therefore, for each cluster, a probability of being part of it was computed. Then, a threshold of 0.95 was used to exclude FinnGen samples whose relative chance of being part of the Finnish cluster was below the level. This method produced another 64 outliers. The figure below shows the first three principal components.
FIN 1kgp samples are in purple, while EUR 1kgp samples are in Blue. Samples in green are FinnGen samples who are flagged as being non Finnish, while red ones are considered Finnish.
Then all pairs of FinnGen samples up to second degree were returned. The figure below shows the distribution of kinship values.
Then, the previously defined “non Finnish” samples were excluded and 2 algorithms were used to return a unique subset of unrelated samples:
one called greedy would continuously remove the highest degree node from the network of relations, until no more links are left in the network.
one called native, based on a native implementation of python’s networkx package, performed on each subgraph of the network.
The largest independent set of either algorithm would be used to keep those sample, while flagging the others as “outliers” for the final PCA.
Then, the subset of outliers who also belong to the set of duplicates/twins was identified.
To compute the final step the Finngen samples were ultimately separated in three groups:
233,371 inliers: unrelated samples with Finnish ancestry.
144,127 outliers: non duplicate samples with Finnish ancestries, but who are also related to the inliers.
15,151 rejected samples: either of non Finnish ancestry or are twins/duplicates with relations to other samples.
Finally, the PCA for the inliers was calculated, and then outliers were projected on the same PC space, allowing to calculate covariates for a total of 377,498 samples.
Of the 377,498 non-duplicate population inlier samples from PCA, we excluded 215 samples from analysis because of missing minimum phenotype data, and 6 samples because of failing sex check with F thresholds of 0.4 and 0.7. Sex matched between genotype data and phenotype data for all individuals! A total of 377,277 samples were used for core analysis. There are 210,870 females and 166,407 males among these samples.
Documentation from the original developers of the algorithm can be found here: http://www.well.ox.ac.uk/~spencer/Aberrant/aberrant-manu.
We included 2,272 endpoints in the analysis, which consisted of 2,269 binary endpoints and 3 quantitative endpoints (HEIGHT_IRN, WEIGHT_IRN, BMI_IRN). Endpoints with less than 80 cases among the 377,277 samples were excluded, as well as endpoints labeled with an OMIT tag in the endpoint definition file.
The quantitative endpoints HEIGHT and WEIGHT were acquired from minimum phenotype data. After that, phenotype BMI was formed from them, and all of them were inverse normal transformed.
For regenie step 1 LOCO prediction computation for each endpoint, we used age, sex, 10 PCs, Finngen 1 or 2 chip or legacy genotyping batch as covariates. For sex-specific phenotypes, sample sex was left out from the covariates. We excluded covariates that had less than 10 cases.
For calculating genetic relatedness in regenie step 1, we included variants 1) imputed with an INFO score > 0.95 in all batches and 2) > 97 % non-missing genotypes and 3) MAF > 1 %. The remaining variants were LD pruned with a 1Mb window and r2 threshold of 0.1. This resulted in a set of 60,896 well-imputed not rare variants for relatedness calculation.
We used a genotype block size of 1,000 in regenie step 1.
We ran association tests with regenie for each of the 2,272 endpoints for each variant with a minimum allele count of 5 among each phenotype’s cases and controls. We used the approximate Firth test for variants with an initial p-value of less than 0.01 and computed the standard error based on effect size and likelihood ratio test p-value (regenie options --firth --approx --pThresh 0.01 --firth-se).