Extracting minimum phenotype data per biobank

Auria Biobank

30.3.2022

Data source: TYKS Datalake (EHR)

Longitudinal data reported: no

Gender: male/female, extracted from the personal identity code

Age: years, at the time of sample collection; calculated using the date of birth (extracted from the personal identity code) and the date of sample collection

Heigh: cm, extracted from structured EHR data (“hoitotaulukko”)

Date(height): date of the height measurement

Weight: kg, extracted from structured EHR data (“hoitotaulukko”)

Date(weight): date of the weight measurement

Smoking status: current smoker / former smoker / never smoker / NA, extracted from medical reports by text mining algorithms

Date(smoking): date of the smoking status

Note: Text mining of the smoking status has been developed by the data analysis team of Auria Biobank. The method is based on the classification rules. The accuracy of the method has been shown to be about 80-90% from the patients having the smoking information in their medical reports. Unfortunately, about 35% of the patients do not have any information, and they will get result NA.

Biobank of Eastern Finland

18.3.2022

Data source: Kys Datalake (EHR), CORE consent management system

Longitudinal data reported: no

Gender: male/female, extracted from the personal identity code

Age: years, at the time of sample collection; calculated using the date of birth and the date of sample collection

Heigh: cm, extracted from structured EHR data

Date(height): date of the height measurement

Weight: kg, extracted from structured EHR data

Date(weight): date of the weight measurement

Smoking status: current smoker / former smoker / never smoker / NA, extracted from medical reports by text mining algorithms

Date(smoking): date of the smoking status

Note: -

FRC Blood Service Biobank

4.6.2022

Data source: Questionnaire in connection with the biobank consent

Longitudinal data reported: no

Gender: male/female, extracted from the personal identity code

Age: years, at the time of sample collection; calculated using the date of birth and the date of sample collection

Height: cm, self-reported;

Date(height): date of the biobank consent;

Weight: kg, self-reported;

Date(weight): date of the biobank consent;

Smoking status: regular smoker (years) / irregular smoker (years)/ former smoker (years) / never smoker, self-reported

Date(smoking): date of the biobank consent;

Note: -

Central Finland Biobank

18.3.2022

Data source: Central Finland Health Care District EHR databases

Longitudinal data reported: no

Gender: male/female, extracted from the personal identity code

Age: years, at the time of sample collection; calculated using the date of birth and the date of sample collection

Height: cm, extracted from structured EHR data

Date (height): date of the height measurement

Weight: kg, extracted from structured EHR data

Date (weight): date of the weight measurement

Smoking status: current smoker / previous smoker / non-smoker / never smoked / NA, extracted from structured EHR data

Date (smoking): date of the smoking status

Note: Height, weight, and smoking status are constructed from several health records, which contain this information in structured format. Height and weight information are retrieved for about 25% of sample donors. Smoking status is retrieved for about 50% of sample donors. Data mining is not currently used.

Finnish Clinical Biobank Tampere

17.3.2022

Data source: PSHP Datalake (EHR)

Longitudinal data reported: no

Gender: male/female, extracted from structured EHR data

Age: years, at the time of sample collection; calculated using the date of birth and the date of sample collection

Height: cm, extracted from structured EHR data

Date(height): date of the height measurement, using the nearest date of the sampling time

Weight: kg, extracted from structured EHR data

Date(weight): date of the weight measurement, using the nearest date of the sampling time

Smoking status: NA

Note: -

Helsinki Biobank

11.3.2022

Data source: HUS Datalake (EHR)

Longitudinal data reported: no

Gender: male/female, extracted from the personal identity code

Age: years, at the time of sample collection; calculated using the date of birth and the date of sample collection

Height: cm, extracted from structured EHR data (for DF11 and beyond also extracted from medical reports by text mining algorithms)

Date(height): date of the height measurement

Weight: kg, extracted from structured EHR data (for DF11 and beyond also extracted from medical reports by text mining algorithms)

Date(weight): date of the weight measurement

Smoking status: current smoker / former smoker / never smoker / NA, extracted from medical reports by text mining algorithms

Date(smoking): date of the smoking status

Note: Text mining of the smoking status was based on FinBERT (https://github.com/TurkuNLP/FinBERT), which is Google's BERT deep transfer learning model for Finnish. The smoking status -classifier was evaluated with a set of 947 patients and the obtained accuracy and F-score was 94.5%. More specifically, F-scores for a smoker, ex-smoker, non-smoker, and NA were 92.2%, 91.8%, 98.0%, and 73.7%, respectively. The evaluation set had no duplicate patients among the development set.

Northern Finland Biobank Borealis

25.3.2022

Data source: BC Platforms, Esko Systems

Longitudinal data reported: no

Gender: male/female, extracted from the personal identity code

Age: years, at the time of sample collection; calculated using the date of birth and the date of sample collection (BC Platforms)

Height: cm, extracted manually from Esko patient information system.

Date(height): measurement closest to the blood sampling time; exact dates not recorded in the minimum data set file.

Weight: kg, extracted manually from Esko patient information system.

Date(weight): measurement closest to the blood sampling time; exact dates not recorded in the minimum data set file.

Smoking status: current smoker / former smoker / never smoker / NA, extracted manually from Esko patient information system.

Date(smoking): most recent recording in the EHR or recording closest to the blood sampling time (if available); exact dates not recorded in the minimum data set file.

Note: Extracting data from Esko patient information system has to be done manually one person at a time and the process is very laborious. Currently, Borealis does not have access to patient information in a structured format extractable from a database.

Terveystalo Biobank

30.3.2022

Data source: Terveystalo Datalake (EHR)

Longitudinal data reported: no

Gender: male/female, extracted from the personal identity code

Age: years, at the time of sample collection; calculated using the date of birth and the date of sample collection

Height: cm, extracted from structured EHR data

Date(height): date of the height measurement

Weight: kg, extracted from structured EHR data

Date(weight): date of the weight measurement

Smoking status: current smoker / non-smoker / NA, extracted from structured EHR data

Date(smoking): date of the smoking status

Note: -

THL Biobank

14.4.2022

Data source: THL Biobank phenotype database PhenoWeb (and biobank cohort data files extracted from cohort databases)

Longitudinal data reported: no

Gender: male/female. Mainly extracted from biobank cohort data. For a subset extracted from the personal identity code.

Age: years, at the time of sample collection; either directly obtained from the biobank cohort data or calculated using the date of birth and the date of sample collection

Height: cm, extracted from biobank database or cohort data files. Transformed from m into cm if needed.

Date(height): date of the height measurement. Reported only for those cohorts when sampling date didn’t match the height measurement date, otherwise the same as sampling date.

Weight: kg, extracted from biobank database or cohort data files.

Date(weight): date of the weight measurement Reported only for those cohorts when the sampling date didn’t match the height measurement date, otherwise the same as the sampling date.

Smoking status: The smoking-related attributes have been extracted from the biobank database and/or cohort data files and released for the project. The smoking data has been delivered to the extent that has been transferred to THL Biobank per cohort. Different cohorts have different attributes, and some cohorts don’t include smoking attributes at all. The broad smoking data available from THL Biobank has then been harmonized and used in FinnGen to cover smoking status in more detail.​

Date(smoking): date of the smoking data collection. Reported for those cohorts when sampling date did not match the smoking data collection date, otherwise the same as sampling date.​

Note: The minimum datasets were extracted one cohort at a time, and the protocol varied slightly depending on the cohort at hand.

Last updated