BigQuery Python - Case Study - Patient Timeline - Scatter plot

In this we detail a scenario for how you can plot comorbidities of a FinnGen endpoint.

Location of the script

/finngen/library-green/scripts/code_snippets/codeSnippet_patientTimeline.py

You can copy paste from below explanation or take the code directly from the file itself.

Patient timeline in FinnGen from starting event recorded to the last/latest event.

The query is very simple as to extract whole information of a patient from the detailed longitudinal table. However, the detailed longitudinal table contains ICD, drug, reimbursement and other codes but no information what that code means. To get the proper mapping of what the overall theme of a each event means, we developed a code mapper SQL script with details here. The same SQL script is also exposed as JAVASCRIPT function codeProcess within Sandbox. Also, the selection of a FINNGENID is randomized just for this case study purpose.

# 
import os, sys
from google.cloud import bigquery
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

# Connect to client
client = bigquery.Client()

# Query
query = """  WITH temp AS (
                     SELECT ssdl.*,
                            # The function sandbox.codeProcess returns processed CODE1, CODE2 and CODE3 along with vocabulary_id
                            sandbox.codeProcess(TO_JSON_STRING(ssdl), 'CODE1_CODE2', 'MORPO_BEH_TOPO', 'REIMB', 'ATC', 5, 5, 5, 7, 5).*
                     FROM `finngen-production-library.sandbox_tools_r10.finngen_r10_service_sector_detailed_longitudinal_v1` AS ssdl
                     # Here we are randomly selecting a FinnGen patient FINNGENID. You can replace this with a particular FINNGENID
                     WHERE FINNGENID = (
                                        SELECT DISTINCT FINNGENID
                                        FROM `finngen-production-library.sandbox_tools_r10.finngen_r10_service_sector_detailed_longitudinal_v1` 
                                        WHERE RAND() < 10/(SELECT COUNT(*) FROM `finngen-production-library.sandbox_tools_r10.finngen_r10_service_sector_detailed_longitudinal_v1`) 
                                        LIMIT 1
                                       )
                     # The results of patient are ordered by the date
                     ORDER BY ssdl.APPROX_EVENT_DAY, ssdl.SOURCE
              )
              # The processed codes are mapped to English names
              SELECT t.*,fgc.concept_class_id,fgc.name_en,fgc.name_fi
              FROM temp AS t
              LEFT JOIN `finngen-production-library.medical_codes.fg_codes_info_v1` as fgc
              ON t.vocabulary_id = fgc.vocabulary_id AND
                 t.FG_CODE1 IS NOT DISTINCT FROM fgc.FG_CODE1 AND
                 t.FG_CODE2 IS NOT DISTINCT FROM fgc.FG_CODE2 AND
                 t.FG_CODE3 IS NOT DISTINCT FROM fgc.FG_CODE3
	 """

# Job configuration
job_config = bigquery.QueryJobConfig()

# Run the queries
query_result = client.query(query,job_config=job_config)

Save the query result to dataframe

columnNames = list(list(query_result)[0].keys())
temporaryList = []
for row in query_result:
      temporaryList.append(list(row))
# Convert the temporaryList to dataframe using pandas package
query_result_dataframe = pd. DataFrame(temporaryList, columns = columnNames)
# Change the data type of date column to datetime format
query_result_dataframe['APPROX_EVENT_DAY'] = pd.to_datetime(query_result_dataframe['APPROX_EVENT_DAY'])

Use the patient information stored in dataframe to plot a patient timeline

g1 = sns.scatterplot(x="APPROX_EVENT_DAY",y="CODE1",data=query_result_dataframe,hue="vocabulary_id")
g1.set(yticklabels=[])
g1.set(ylabel=None)
g1.tick_params(left=False)
g1.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.savefig('/home/ivm/patientTimeline.png')

Last updated