How-to guides

How to use Athena to find a test by name?

  1. Go to OHDSI Athena using this link: Athena with lab test filters.

  2. Make sure to have lab test filters enabled:

    1. Domain: Measurement

    2. Concept: Standard

    3. Class: Lab test

    4. Vocabulary: LOINC

  3. Use the search box at the top the page to look up a test name, for example: "testosterone"

  4. Find the most relevant OMOP Concept IDs from the search results. The OMOP Concept IDs are in the left-most column ID

  5. You can now lookup these OMOP Concept IDs in the FinnGen Kanta lab values data.

How to load the data file using R?

  1. Start your Sandbox IVM with the machine type that has 128 GB of memory

  2. Copy the Parquet release file to your home folder.

  3. In the terminal, load R with lmod: module load tidyverse/4.3.1

  4. Start R and then use the pre-installed arrow package to load the data file: library(arrow) read_parquet("finngen_R12_kanta_lab_1.0.parquet")

How to load the full dataset using Python pandas?

  1. Start your Sandbox IVM with the machine type that has 128 GB of memory.

  2. Copy the Parquet release file to your home folder.

  3. In the terminal, load Python with lmod: module load fg-python/3.11.7

  4. Start python and then use pandas to load the data file: import pandas as pd df = pd.read_parquet("finngen_R12_kanta_lab_1.0.parquet")

How to load the dataset for only one OMOP Concept ID using Python pandas?

The Parquet file format allows for efficient data filtering, and Python pandas can take advantage of this:

  1. Start your Sandbox IVM with a machine type that has at least 16GB of memory.

  2. Copy the Parquet release file to your home folder.

  3. In the terminal, load Python with lmod: module load fg-python/3.11.7

  4. Start python and then use pandas to load the data file: import pandas as pd df_omop_id = pd.read_parquet("finngen_R12_kanta_lab_1.0.parquet", filters=[("OMOP_CONCEPT_ID", "==", "<your-omop-concept-id>")])

How to access the data using BigQuery SQL?

The Kanta lab values data is available in BigQuery in the table finngen-production-library.sandbox_tools_r12.kanta_r12_v1.

Here is how to access it using Python, though it also possible using R.

  1. Start your Sandbox IVM.

  2. In the terminal, load Python with lmod: module load fg-python/3.11.7

  3. Start python and use the following code to print the first 10 lines of the data:

from google.cloud import bigquery

client = bigquery.Client()

query = """ SELECT FINNGENID, APPROX_EVENT_DATETIME, TEST_NAME, OMOP_CONCEPT_ID
            FROM `finngen-production-library.sandbox_tools_r12.kanta_r12_v1`
	    LIMIT 10;
     	"""

for row in client.query(query):
    print(row)

You can change the above query to select different columns and add different conditions to extract the data.

How to look up what data is available for a test without using the Sandbox?

FinnGen provides a pulic tool to easily look up a test, without requiring Sandbox access. Results with a low individual count are not shown, for this you need to query the data directly in the Sandbox.

  1. At the top of the page, select the tab "Mapping status per test+unit combination".

  2. Use the filters to look up your test:

    1. If you have the short code name for the test, use the the text field for the "TestCode [unit]" column.

    2. If you have a descriptive name, use the text field for the "Omop Name" column.

  3. Each row in the search results correspond to the mapping of a TestCode and a unit to an OMOP ConceptId. Results are provided with statistics such as: number of events, test outcome percentages, measured value percentiles, percentage of missing values.

How to look up a test in the data from the Sandbox?

As an example, let's take the test ID 23133 from HUSLAB, wich stands for the B -PGx-D test.

We will use BigQuery with Python in this guide, but it's also possible using BigQuery with R, or using the Parquet file, or using the TSV-gzipped file.

Looking up the test by TEST_ID

  1. Start your Sandbox IVM.

  2. In the terminal, load Python with lmod: module load fg-python/3.11.7

  3. Start python and setup the BigQuery connection: from google.cloud import bigquery client = bigquery.Client()

  4. Then make this query to see how many rows and people there is in the data, split on the different TEST_NAMEs associated with our TEST_ID of interest: 23133

client.query(
    """
    SELECT
        TEST_ID,
        TEST_ID_IS_NATIONAL,
        TEST_NAME,
        OMOP_CONCEPT_ID, 
        COUNT(*) NRows,
        COUNT(DISTINCT(FINNGENID)) NPeople
    FROM `finngen-production-library.sandbox_tools_r12.kanta_r12_v1`
    WHERE TEST_ID = '23133'
    GROUP BY ALL
    """
).to_dataframe()

This prints a table where we see that there are ~1.2k rows for ~1.2k people with the test name b-pgx-d and the test ID 23133.

Extending the look up by using the TEST_NAME from the result

  1. Start your Sandbox IVM.

  2. In the terminal, load Python with lmod: module load fg-python/3.11.7

  3. Start python and setup the BigQuery connection: from google.cloud import bigquery client = bigquery.Client()

  4. Then make this query to see how many rows and people there are in the data for the TEST_NAMEs containing b-pgx-d:

client.query(
    """
    SELECT
        TEST_NAME,
        TEST_ID,
        TEST_ID_IS_NATIONAL,
        OMOP_CONCEPT_ID,
        COUNT(*) NRows,
        COUNT(DISTINCT(FINNGENID)) NPeople
    FROM `finngen-production-library.sandbox_tools_r12.kanta_r12_v1`
    WHERE CONTAINS_SUBSTR(TEST_NAME, 'b-pgx-d')
    GROUP BY ALL
    """
).to_dataframe()

This prints a table showing that they are ~1.2k rows with TEST_NAME = 'b-pgx-d' and TEST_ID = '23133', and ~200 rows with TEST_NAME = 'b-pgx-d' and TEST_ID = '54976'.

Last updated