Register data pre-processing

FinnGen register team receives raw register data from the registries, and performs pre-processing for the data, before creating phenotype files and releasing data files to the Sandbox.

Raw register data includes PICs (personal identification number) for each individual. The register team has created FINNGENIDs for each PIC, and these FINNGENIDs are used for both genotype and phenotype data.

Pre-processing actions of the register data

  • Replace PIC with FINNGENID

  • Create EVENT AGE using birth date from the PIC and event date (eg. arrival date to the hospital, or date when the drug was purchased)

  • Create SEX using PIC (if the 10th letter of the PIC is even the individual is female)

  • Harmonize variables from the different years of the registry (variable names have been changing during the years)

  • Combine different register data years to the same data file

  • Convert date variables to yyyy-mm-dd format

  • Create ICDVER based on the year of the diagnosis (ICD8: 1967-1986; ICD9: 1987-1995; ICD10: since 1996; ICD-O-3: cancer registry)

  • Separate inpatient and outpatient data based on PALA (service type) variable (HILMO)

  • Create other register-specific variables; eg, PARITY, NRO CHILD, NRO FETUSES in reproductive history register; or kidney variables in kidney register.

  • *Create HOSPDAYS variable (hospital departure date deducted from hospital arrival date; HILMO)

  • *Create APPROX EVENT DATE by blurring/masking the exact event date (see the link in this line for more information about this process)

  • *Remove denials (individuals who have asked to have their data removed from FinnGen)

*done later in data processing

Last updated