BigQuery Python - Downstream analysis - Active Ingredient - Bar plot

We introduced details on connection to BigQuery using python packages here. After the connection there are many different ways you can use the query results. In this we detail a use case scenario for how you can query for data from BigQuery tables and use it in your analysis.

Location of the script

/finngen/library-green/scripts/code_snippets/codeSnippet_activeIngredient.py

You can copy paste from below explanation or take the code directly from the file itself.

As example, you want to know what are the active ingredients of top drug purchases in the year 1995 and do a simple bar plot to showcase the results. The following logic is as follows

  1. We want to know what each individual drug purchase was in year 1995 which is found in table finngen_r10_service_sector_detailed_longitudinal in the dataset sandbox_tools_r10 in the project finngen-production-library.

  2. The drug information can be found in the column CODE3 which is the VNR code.

  3. We can get VNR code information from table finngen-vnr in the dataset medical_codes in the project finngen-production-library.

  4. We can extract active ingredient and see the how many purchases were made for each active ingredient which is the column Substance.

  5. We will plot the top 10 active ingredients

You can extract the information using the below query

# Import packages
import os, sys
from google.cloud import bigquery
from matplotlib import pyplot as plt

# Connect to client
client = bigquery.Client()

# Run the query
query = """ SELECT FVNR.Substance AS ActiveIngredient,
			 COUNT(*) AS N_PURCHASES
            FROM `finngen-production-library.sandbox_tools_r10.finngen_r10_service_sector_detailed_longitudinal_v1` AS FDL
            JOIN `finngen-production-library.medical_codes.finngen_vnr_v1` as FVNR
            ON FDL.CODE3 = FVNR.VNR
            WHERE FDL.SOURCE='PURCH' AND 
                  EXTRACT(YEAR from FDL.APPROX_EVENT_DAY) = 1995 AND 
                  FVNR.Substance IS NOT NULL AND 
                  FDL.CODE3 IS NOT NULL
            GROUP BY ActiveIngredient
            ORDER BY N_PURCHASES DESC              
            LIMIT 10
	"""
# Job configuration
job_config = bigquery.QueryJobConfig()
# Run the query
query_result = client.query(query,job_config=job_config)

You can print the results before plotting them

# Print the results
#  Print the column names
print('{}'.format(list(list(query_result)[0].keys())))
#  Print the rows
for row in query_result:
	print(list(row))

You can copy paste the following code to plot the top 10 active ingredients

# Import pyplot from matplotlib for bar plot
from matplotlib import pyplot as plt

# Save top 10 active ingredient names into list
# Save purchase count of each active ingredient into list
ingredientNames = []
purchases = []
for row in query_result:
	ingredientNames.append(row['ActiveIngredient'])
	purchases.append(row['N_PURCHASES'])
# Initiate the figure
plt.figure(figsize=(16,10)) # Figure size
plt.bar(range(len(ingredientNames)),purchases)
# Lables of xticks which will be the ingredient names
plt.xticks(range(len(ingredientNames)),labels = ingredientNames, rotation = -40)
# You can also save the plot
plt.savefig('/home/ivm/drugPurchases_1995.png')

You can also run the query and do bar plot of top 10 active ingredients in the anaconda python environment with much more ease because of pandas_gbq package.

# Pandas gbq package to convert the query results into pandas dataframe
import pandas_gbq 

# Another approach of directly loading results of a query to dataframe
query_result_dataframe = query_result.to_dataframe()
# Bar plot
query_result_dataframe.plot.bar(x='ActiveIngredient',y='N_PURCHASES',rot=0)

Last updated