FinnGen individuals were genotyped with Illumina and Affymetrix chip arrays (Illumina Inc., San Diego, and Thermo Fisher Scientific, Santa Clara, CA, USA).
Chip genotype data were imputed using the population-specific SISu v3 imputation reference panel of 3,775 whole genomes.
Post-imputation QC involved excluding variants with imputation INFO < 0.7.
Total number of individuals: 102,739
Total number of variants (merged set): 17,054,975
Reference assembly: GRCh38/hg38
SISu v3 consists of 3,775 high coverage (30x) WGS Finnish individuals from six cohorts:
METSIM (PIs Markku Laakso and Mike Boehnke)
FINRISK (PI Pekka Jousilahti)
Health2000 (PI Seppo Koskinen)
Finnish Migraine Family Study (PI Aarno Palotie)
Merck/Tienari samples (PI Pentti Tienari)
MESTA samples (PI Jaana Suvisaari)
High-coverage (25-30x) WGS data used to develop the SISu v3 reference panel were generated at the Broad Institute of MIT and Harvard and at the McDonnell Genome Institute at Washington University; and jointly processed at the Broad Institute.
Chip genotype data processing and QC Samples were genotyped with Illumina (Illumina Inc., San Diego, CA, USA) and Affymetrix arrays (Thermo Fisher Scientific, Santa Clara, CA, USA).
Genotype calls were made with GenCall and zCall algorithms for Illumina and AxiomGT1 algorithm for Affymetrix data.
Chip genotyping data produced with previous chip platforms and reference genome builds were lifted over to build version 38 (GRCh38/hg38) following the protocol described here: dx.doi.org/10.17504/protocols.io.nqtddwn.
In sample-wise quality control, individuals with ambiguous gender, high genotype missingness (>5%), excess heterozygosity (+-4SD) and non-Finnish ancestry were excluded. In variant-wise quality control variants with high missingness (>2%), low HWE P-value (<1e-6) and minor allele count, MAC<3 were excluded.
Prior imputation, chip genotyped samples were pre-phased with Eagle 2.3.5 (https://data.broadinstitute.org/alkesgroup/Eagle/) with the default parameters, except the number of conditioning haplotypes was set to 20,000.
Cromwell-29 and 31
Wdltool-0.14
Plink 1.9 and 2.0
BCFtools 1.5 and 1.7
Eagle 2.3.5
Beagle 4.1 (version 08Jun17.d8b)
R 3.4.1 (packages: data.table 1.10.4, sm 2.2-5.4)
Genotype imputation was done with the population-specific SISu v3 reference panel .
Variant call set was produced with GATK HaplotypeCaller algorithm by following GATK best-practices for variant calling.
Genotype-, sample- and variant-wise QC was applied in an iterative manner by using the Hail framework v0.1 and the resulting high-quality WGS data for 3,775 individuals were phased with Eagle 2.3.5 as described in the previous section.
Genotype imputation was carried out by using the population-specific SISu v3 imputation reference panel with Beagle 4.1 (version 08Jun17.d8b) as described in the following protocol: dx.doi.org/10.17504/protocols.io.nmndc5e.
Post-imputation quality-control involved checking expected conformity of the imputation INFO-value distribution, MAF differences between the target dataset and the imputation reference panel and checking chromosomal continuity of the imputed genotype calls.
Optional: Post-imputation quality control also involved excluding variants imputed with imputation INFO<0.7.