Extracting minimum phenotype data per biobank
Auria Biobank
30.3.2022
Data source: TYKS Datalake (EHR)
Longitudinal data reported: no
Gender: male/female, extracted from the personal identity code
Age: years, at the time of sample collection; calculated using the date of birth (extracted from the personal identity code) and the date of sample collection
Heigh: cm, extracted from structured EHR data (“hoitotaulukko”)
Date(height): date of the height measurement
Weight: kg, extracted from structured EHR data (“hoitotaulukko”)
Date(weight): date of the weight measurement
Smoking status: current smoker / former smoker / never smoker / NA, extracted from medical reports by text mining algorithms
Date(smoking): date of the smoking status
Note: Text mining of the smoking status has been developed by the data analysis team of Auria Biobank. The method is based on the classification rules. The accuracy of the method has been shown to be about 80-90% from the patients having the smoking information in their medical reports. Unfortunately, about 35% of the patients do not have any information, and they will get result NA.
Biobank of Eastern Finland
18.3.2022
Data source: Kys Datalake (EHR), CORE consent management system
Longitudinal data reported: no
Gender: male/female, extracted from the personal identity code
Age: years, at the time of sample collection; calculated using the date of birth and the date of sample collection
Heigh: cm, extracted from structured EHR data
Date(height): date of the height measurement
Weight: kg, extracted from structured EHR data
Date(weight): date of the weight measurement
Smoking status: current smoker / former smoker / never smoker / NA, extracted from medical reports by text mining algorithms
Date(smoking): date of the smoking status
Note: -
FRC Blood Service Biobank
4.6.2022
Data source: Questionnaire in connection with the biobank consent
Longitudinal data reported: no
Gender: male/female, extracted from the personal identity code
Age: years, at the time of sample collection; calculated using the date of birth and the date of sample collection
Height: cm, self-reported;
Date(height): date of the biobank consent;
Weight: kg, self-reported;
Date(weight): date of the biobank consent;
Smoking status: regular smoker (years) / irregular smoker (years)/ former smoker (years) / never smoker, self-reported
Date(smoking): date of the biobank consent;
Note: -
Central Finland Biobank
18.3.2022
Data source: Central Finland Health Care District EHR databases
Longitudinal data reported: no
Gender: male/female, extracted from the personal identity code
Age: years, at the time of sample collection; calculated using the date of birth and the date of sample collection
Height: cm, extracted from structured EHR data
Date (height): date of the height measurement
Weight: kg, extracted from structured EHR data
Date (weight): date of the weight measurement
Smoking status: current smoker / previous smoker / non-smoker / never smoked / NA, extracted from structured EHR data
Date (smoking): date of the smoking status
Note: Height, weight, and smoking status are constructed from several health records, which contain this information in structured format. Height and weight information are retrieved for about 25% of sample donors. Smoking status is retrieved for about 50% of sample donors. Data mining is not currently used.
Finnish Clinical Biobank Tampere
17.3.2022
Data source: PSHP Datalake (EHR)
Longitudinal data reported: no
Gender: male/female, extracted from structured EHR data
Age: years, at the time of sample collection; calculated using the date of birth and the date of sample collection
Height: cm, extracted from structured EHR data
Date(height): date of the height measurement, using the nearest date of the sampling time
Weight: kg, extracted from structured EHR data
Date(weight): date of the weight measurement, using the nearest date of the sampling time
Smoking status: NA
Note: -
Helsinki Biobank
11.3.2022
Data source: HUS Datalake (EHR)
Longitudinal data reported: no
Gender: male/female, extracted from the personal identity code
Age: years, at the time of sample collection; calculated using the date of birth and the date of sample collection
Height: cm, extracted from structured EHR data (for DF11 and beyond also extracted from medical reports by text mining algorithms)
Date(height): date of the height measurement
Weight: kg, extracted from structured EHR data (for DF11 and beyond also extracted from medical reports by text mining algorithms)
Date(weight): date of the weight measurement
Smoking status: current smoker / former smoker / never smoker / NA, extracted from medical reports by text mining algorithms
Date(smoking): date of the smoking status
Note: Text mining of the smoking status was based on FinBERT (https://github.com/TurkuNLP/FinBERT), which is Google's BERT deep transfer learning model for Finnish. The smoking status -classifier was evaluated with a set of 947 patients and the obtained accuracy and F-score was 94.5%. More specifically, F-scores for a smoker, ex-smoker, non-smoker, and NA were 92.2%, 91.8%, 98.0%, and 73.7%, respectively. The evaluation set had no duplicate patients among the development set.
Northern Finland Biobank Borealis
25.3.2022
Data source: BC Platforms, Esko Systems
Longitudinal data reported: no
Gender: male/female, extracted from the personal identity code
Age: years, at the time of sample collection; calculated using the date of birth and the date of sample collection (BC Platforms)
Height: cm, extracted manually from Esko patient information system.
Date(height): measurement closest to the blood sampling time; exact dates not recorded in the minimum data set file.
Weight: kg, extracted manually from Esko patient information system.
Date(weight): measurement closest to the blood sampling time; exact dates not recorded in the minimum data set file.
Smoking status: current smoker / former smoker / never smoker / NA, extracted manually from Esko patient information system.
Date(smoking): most recent recording in the EHR or recording closest to the blood sampling time (if available); exact dates not recorded in the minimum data set file.
Note: Extracting data from Esko patient information system has to be done manually one person at a time and the process is very laborious. Currently, Borealis does not have access to patient information in a structured format extractable from a database.
Terveystalo Biobank
30.3.2022
Data source: Terveystalo Datalake (EHR)
Longitudinal data reported: no
Gender: male/female, extracted from the personal identity code
Age: years, at the time of sample collection; calculated using the date of birth and the date of sample collection
Height: cm, extracted from structured EHR data
Date(height): date of the height measurement
Weight: kg, extracted from structured EHR data
Date(weight): date of the weight measurement
Smoking status: current smoker / non-smoker / NA, extracted from structured EHR data
Date(smoking): date of the smoking status
Note: -
THL Biobank
14.4.2022
Data source: THL Biobank phenotype database PhenoWeb (and biobank cohort data files extracted from cohort databases)
Longitudinal data reported: no
Gender: male/female. Mainly extracted from biobank cohort data. For a subset extracted from the personal identity code.
Age: years, at the time of sample collection; either directly obtained from the biobank cohort data or calculated using the date of birth and the date of sample collection
Height: cm, extracted from biobank database or cohort data files. Transformed from m into cm if needed.
Date(height): date of the height measurement. Reported only for those cohorts when sampling date didn’t match the height measurement date, otherwise the same as sampling date.
Weight: kg, extracted from biobank database or cohort data files.
Date(weight): date of the weight measurement Reported only for those cohorts when the sampling date didn’t match the height measurement date, otherwise the same as the sampling date.
Smoking status: The smoking-related attributes have been extracted from the biobank database and/or cohort data files and released for the project. The smoking data has been delivered to the extent that has been transferred to THL Biobank per cohort. Different cohorts have different attributes, and some cohorts don’t include smoking attributes at all. The broad smoking data available from THL Biobank has then been harmonized and used in FinnGen to cover smoking status in more detail.
Date(smoking): date of the smoking data collection. Reported for those cohorts when sampling date did not match the smoking data collection date, otherwise the same as sampling date.
Note: The minimum datasets were extracted one cohort at a time, and the protocol varied slightly depending on the cohort at hand.
Last updated