1 of 4

GWAS

We used the SAIGE software for running the R2 GWAS.

SAIGE is a mixed model logistic regression R/C++ package, able to account for related samples.

We analyzed:

1,122 endpoints
96,499 samples
17,054,975 variants

We included the following covariates in the model: sex, age, 10 PCs, genotyping batch.

Quality control

This is a description of the quality control procedures applied before running the GWAS.

In summary, we removed 4,095 samples who were either of non-Finnish ancestry or twins/duplicates. Finnish ancestry was assessed with a combination of PCA and a Bayesian method for outlier detection.

Sample QC

Our data set initially consists of 102,739 samples, of which we kept 100,355 after removing duplicates. Next, we proceeded to exclude samples of non-Finnish ancestry using a PCA approach.

PCA

After filtering for high quality HQ variants (36,073 variants) we merged the data set with the (EUR individuals only). At this point we performed a PCA on the merged data set and used a Bayesian approach to determine outliers (see below). This process allowed us to identify samples from outside the Central/Northern European region (1,023 samples). Western European and British samples are still present, but are not enough to drive a signal in the PCA. Thus we used a different approach; we ran a PCA on the 99,333 samples left and we projected the 98 Finnish (FIN) and 89 non-Finnish European (EUR) samples from the thousand genomes project who survived round one onto the same space. Then, for each Finngen sample, we calculate its Mahalanobis distance to the FIN and EUR centroid. The distance is mapped to a probability with a distribution with 3 degrees of freedom. Then, we define as being Finns, those sample for whom the relative probability of being Finnish vs European is > 95%. This left us with 98,644 samples.

Missing Data

Of the 98,644 non-duplicate PCA inliers, we removed 2,145 individuals that didn’t have phenotype or age data. Thus the final number of analyzed individuals was 96,499.

Further info

Bayesian outlier detection

Code for the method can be found here:.

Documentation from the original developers of the algorithm can be found here: .

Centroid based outlier detection

The Figure below shows how the centroid based outlier detection works by plotting the distribution of the first 3 components of the PCA. We can see that the FinnGen samples labelled as Western European (in blue) are extremely close to the Western European centroid in the first two components.

Purple and green dots represent samples of Finnish and Western European (EUR) respectively from the thousand genome data set. The blue dots are FinnGen samples who have been found to be more likely to belong to the EUR group rather than to the Finnish one. Dots in red on the other hand are labelled as belonging to the Finnish centroid.

Association tests

Null models

For the null model calculation for each endpoint, we used age, sex, 10 PCs and genotyping batch as covariates.

For calculating the genetic relationship matrix, we used 49,811 independent, common, well-imputed variants with a posterior genotyping probability >0.95 and missingness <0.05 (LD r2 < 0.1, MAF > 0.05, INFO > 0.95).

SAIGE options for the null computation:

LOCO = false
numMarkers = 30
traceCVcutoff = 0.0025
ratioCVcutoff = 0.001

Association tests

We ran association tests against each of the 1,122 endpoints with for each variant with a minimum allele count of 10 from the imputation pipeline (SAIGE optionminMAC = 10). The alternative allele is always the effect allele.

Software

The code we used is available in . The original SAIGE codebase is available in .

Workflows

We ran the analysis in Google Cloud using WDL and . The WDL workflow metadata including SAIGE commands and their inputs are available at:

gs://finngen-production-library-green/R2/workflows

Association tests

Null models

For the null model calculation for each endpoint, we used age, sex, 10 PCs and genotyping batch as covariates.

SAIGE options for the null computation:

LOCO = false
numMarkers = 30
traceCVcutoff = 0.0025
ratioCVcutoff = 0.001

Association tests

Software

The code we used is available in . The original SAIGE codebase is available in .

Quality control

This is a description of the quality control procedures applied before running the GWAS.

In summary, we removed 4,095 samples who were either of non-Finnish ancestry or twins/duplicates. Finnish ancestry was assessed with a combination of PCA and a Bayesian method for outlier detection.

Sample QC

Our data set initially consists of 102,739 samples, of which we kept 100,355 after removing duplicates. Next, we proceeded to exclude samples of non-Finnish ancestry using a PCA approach.

PCA

Missing Data

Of the 98,644 non-duplicate PCA inliers, we removed 2,145 individuals that didn’t have phenotype or age data. Thus the final number of analyzed individuals was 96,499.

Further info

Bayesian outlier detection

Code for the method can be found here:.

Documentation from the original developers of the algorithm can be found here: .

GWAS

Quality control

hashtagSample QC

hashtagPCA

hashtagMissing Data

hashtagFurther info

hashtagBayesian outlier detection

hashtagCentroid based outlier detection

Association tests

hashtagNull models

hashtagAssociation tests

hashtagSoftware

Workflows

Association tests

hashtagNull models

hashtagAssociation tests

hashtagSoftware

Workflows

GWAS

Quality control

hashtagSample QC

hashtagPCA

hashtagMissing Data

hashtagFurther info

hashtagBayesian outlier detection

hashtagCentroid based outlier detection

Sample QC

PCA

Missing Data

Further info

Bayesian outlier detection

Centroid based outlier detection

Null models

Association tests

Software

Null models

Association tests

Software

Sample QC

PCA

Missing Data

Further info

Bayesian outlier detection

Centroid based outlier detection