BigQuery (relational database)

Introduction

BigQuery is a database technology useful in storing and analyzing large and complex datasets like longitudinal data in FinnGen. This makes it easier for user to query, using SQL, subset of data much faster than reading entire dataset and then filter out. Queried data can be directly inserted into downstream analysis.

Data

finngen-production-library project/database contains phenotype data for different release of FINNGEN data. The data in this project/database comes from LIBRARY_RED and LIBRARY_GREEN folders within Sandbox IVM folders.

BigQuery Project/DatabaseBigQuery Dataset/SchemaBigQuery Table NameBigQuery Table DetailsSandbox IVM folderRelease Version

finngen-production-library

sandbox_tools_r12

finngen_r12_service_sector_detailed_longitudinal_v1

LIBRARY_RED

DF12

endpoint_cohorts_r12_v1

Endpoint cohorts

code_counts_r12_v1

Code Counts

minimum_extended_r12_v1

medical_codes

finngen_vnr_v1

LIBRARY_GREEN

INDEPENDENT

fg_codes_info_v5

finngen_omop_r12

DF11

finngen_results_r12

Achilles tables

DF11

Superset

Superset is a web tool that can be used to explore FinnGen data stored in BiqQuery database. Users can do SQL queries to the data and see the results as a data visualization. Superset is accessed from IVM Applications menu (Applications>FinnGen>Superset).

In Explore tab the user can do different visualizations to the data using the graphical interface. In the SQL Lab tab direct SQL queries can be done to the data.

Take a look how to export FINNGENID from Atlas using SuperSet.

How to save results

It is possible to save results into Sandbox specific BQ database named “Sandbox” (see available databases from Superset Database tab). Users can import small custom tables BQ database “Sandbox” and export result tables to Sandbox IVM as a csv file.

Superset demo

See a Superset tutorial video from FinnGen Users' Meeting 22th Sebtember 2020 (at 50min).

More information:

Apache Superset

bq command-line tool

It is also possible to access the BigQuery tables using bq command-line tool which is a Python-based command-line tool. bq command-line tool can be used to run queries.

Before running a query, users can check how much does it cost of running query using bq command-line tool

bq query --dry_run

The dry run will not execute the query but rather gives out how much data it consumes. In general, queries are priced using on-demand with estimate of $6.25 per Tebibyte (TiB). More details on pricing structure can be found here.

Google Cloud BigQuery python and R drivers

It is also possible to use google cloud BigQuery python and R drivers to access data directly from IVM.

See a tutorial video about how to conduct BigQuery using Python and R scripts from Users' meeting recordings (at 30min 2sec).

Use cases

Links on how to connect to BigQuery in R and Python along with some use cases for downstream analysis

Example Python script:

from google.cloud import bigquery
def examplequery():
# Define project running queries == users own sandbox project, 
# by default it matches your SB environment.
# see Sandbox no from Sandbox IVM desktop;buckets.txt

client = bigquery.Client()

# Tables must be defined in format project.dataset.table, 
# note that cohort 5 must be generated in Atlas.

query_job = client.query(
 """
 SELECT person_source_value
 FROM finngen-production-library.finngen_omop_result.cohort
 LEFT JOIN finngen-production-library.finngen_omop.person ON cohort.subject_id = person.person_id
 WHERE cohort_definition_id = 5
 ORDER BY person_source_value
 LIMIT 20
 """
 )
 results = query_job.result()
 print("20 first finngen ids from atlas cohort 5")
 for row in results:
 print("{}".format(row.person_source_value))
if __name__ == "__main__":
 examplequery()

Example R script:

#!/usr/bin/Rscript
library(bigrquery)

# Define scope for queries. Currently only readonly is enabled
bq_auth(scopes="https://www.googleapis.com/auth/bigquery.")

# Define project running queries == users own sandbox project
projectid = "fg-production-sandbox-6"

#note that cohort 5 must be generated in Atlas
sql <- "SELECT person_source_value
 FROM finngen-production-library.finngen_omop_result.cohort
 LEFT JOIN finngen-production-library.finngen_omop.person ON cohort.subject_id = person.person_id
 WHERE cohort_definition_id = 5
 ORDER BY person_source_value
 LIMIT 20"
tb <- bq_project_query(projectid, sql)
df <- bq_table_download(tb)
print("20 first finngen ids from atlas cohort 5")
print(df[["person_source_value"]])

Last updated