FAQ
Explanations for common questions about the Kanta lab values dataset
Last updated
Explanations for common questions about the Kanta lab values dataset
Last updated
The data covers the years 2014 to 2023. However, not all lab data providers sent their data right from 2014. Most of the data comes online starting in 2018.
The Kanta Lab data was requested for all FinnGen participants. However, the Kanta Lab data spans from 2014 to 2023 so only FinnGen participants alive during this period and having lab records in Kanta are in the dataset.
In total, there are 482k FinnGen participants with data in the Kanta Lab dataset and an average of 482 tests/individual.
Even if some rows are missing both the measurement value and the test outcome, we decided to retain these as there might still be information in knowing the test was ordered. There are a few reasons we are aware of that these can occur:
The test abbreviation refers to a panel of tests, such as a blood count/CBC panel. In this situation, the abbreviation lets you know the whole panel was run. Still, specific measurements and outcomes such as (H)igh, (L)ow or (A)bnormal would only be available for specific tests on red blood cells, white blood cells, hemoglobin, etc.
Reference range is a free text string that depends on:
age of the individual
sex of the individual
chemistry/detection method used by the particular lab
units of the test
date of the test (ranges may change over time or due to updated public health guidelines for tests such as cholesterol)
Unfortunately, only 1/3 of the events include a reference range specific to these combined values. Because so many tests lack this information we have created an imputed range for each OMOP ID. We have computed these based on events where the MEASUREMENT_VALUE
, MEASUREMENT_UNIT
and TEST_OUTCOME
are present for the event. Please see below for more about this.
APPROX_EVENT_DATETIME
in the data?The most common times for a test are 7:00 a.m. and 7:01 a.m., this is likely the time at which tests are ordered, but may be carried out throughout the morning. Other test times should be a more reliable indicator of when the test was actually taken.
While the date component of the APPROX_EVENT_DATETIME
column is randomized for privacy blurring (stable number of days per FINNGENID), the hour:minutes component comes from the raw data and is not randomized.
No. The original data is coming from different sources (different lab centers, different IT systems, etc.) and has undergone several data processing stages before reaching FinnGen. So, unfortunately, when the raw data has reached FinnGen we do not know to what the time information relates to, and is most likely inconsitent depending on the original data source.
TEST_OUTCOME
and TEST_OUTCOME_IMPUTED
values?TEST_OUTCOME
values may be recorded as:
(N) ormal
(L) & (LL) for low or very low
(H) & (HH) for high and very high
(A) & (AA) for abnormal and very abnormal. Note that if a test has a normal range between 10–20 that (A)bnormal can be marked for results <10 as well as >20.
(NA) in our QC of the data when the outcome is missing and listed as NA, most of the measured values are in the normal range; however, there is no guarantee of that.
TEST_OUTCOME_IMPUTED
Unfortunately, the raw data did not include defined reference ranges for all the lab tests. To approximate the ranges we have looked at tests that have the trio of measurement value, measurement units and outcomes and approximated a low and high for each type of lab tests. We have then applied these "imputed" high and low values to score tests which did not have an outcome.
Positive and negative are not concepts that exist in the raw data. In most cases you would want to look for N (Normal) vs. A (abnormal), AA (very abnormal) or H (High), HH (Very high) as analogous to negative vs. positive.
Vierimittaus refers to point-of-care testing (POCT), also called near-patient testing or bedside testing. It refers to sampling, whether the sample was taken at laboratory, bedside or by patient themself.
The mapping will be updated for the DF13 release in February 2025. However, we do not expect to add any additional lab results until DF14 in February 2026.
MEASUREMENT_VALUE
or MEASUREMENT_VALUE_HARMONIZED
for my analysis?MEASUREMENT_VALUE_HARMONIZED is the best one to use. Different labs may deliver results for the same type of test with different units. When we map lab results from different labs, we harmonize the units to the OMOP standard, and we apply a conversion factor so they are all expressed in the same units, which generates MEASUREMENT_VALUE_HARMONIZED. MEASUREMENT_VALUE is provided more as a safeguard in case there are any problems in mapping or conversion factor, you would still have access to the raw value.
No. Unfortunately we do not know which test center performed the test in the data that FinnGen received, so this information is also not currently available for FinnGen users.
FINNGENID
, APPROX_EVENT_DATETIME
, OMOP_CONCEPT_ID
and MEASUREMENT_VALUE
? We have preprocessed the Kanta lab data to make it as clean and usable as possible, but some oddities remain. We identified that ~0.7% of the rows in the Kanta lab dataset are duplicates by FINNGENID
, APPROX_EVENT_DATETIME
, OMOP_CONCEPT_ID
and MEASUREMENT_VALUE
.
We found that one reason for this duplication is that in the raw data the same record appears once with test ID referring to a local lab test code system, and then once more with its test ID referring to the national lab test code system.
We chose to keep these rows as it is not clear which of the duplicate rows to keep, and we have not yet done a systematic investigation on the origins of the row duplication.
One common one you will see is e9 - in blood cell counts, "e9/L" means "10^9 per liter" or "billion per liter". For example, a white blood cell count of 5.2 e9/L means 5.2 billion cells per liter. This unit is part of the International System of Units (SI) and is widely used in many countries for standardizing laboratory results. It's equivalent to the older unit "G/L" (giga per liter) or "10^9/L". Similarly, e12/L would mean "10^12 per liter" or "trillion per liter".
The full list of units can be found here - https://github.com/FINNGEN/kanta_lab_harmonisation_public/blob/main/MAPPING_TABLES/UNITSfi.usagi.csv
Actually, there are a number of lab tests that are expected to have negative values! Checking for negative values is something that our OHDSI Achilles QC testing suite examines and any negative values you see in the data have been checked and should be realistic. Here are some tests that are expected to have negative values:
Calcium Balance: calcium balance studies can show negative values when there's more calcium excretion than intake.
Acid-Base Balance: Tests like Base Excess (BE) can be negative, indicating metabolic acidosis.
Iron Balance Studies: Like calcium, iron balance can be negative if more iron is lost than absorbed.
Anion Gap: While usually positive, it can be negative in rare cases like bromide intoxication.
The data is originally extracted from Kanta and sent THL for pseudonymesation. Then, THL send the pseudonymised data to FinnGen. At this stage, FinnGen does a preprocessing step that includes: row deduplication, column subsetting, data cleaning, harmonization with OMOP, and more. Extensive documentation about the preprocessing of the Kanta lab data by FinnGen can be found here: https://github.com/FINNGEN/kanta_lab_preprocessing/tree/master.