BigQuery (relational database)
Introduction
BigQuery is a database technology useful in storing and analyzing large and complex datasets like longitudinal data in FinnGen. This makes it easier for user to query, using SQL, subset of data much faster than reading entire dataset and then filter out. Queried data can be directly inserted into downstream analysis.
Data
finngen-production-library
project/database contains phenotype data for different release of FINNGEN data. The data in this project/database comes from LIBRARY_RED and LIBRARY_GREEN folders within Sandbox IVM folders.
BigQuery Project/Database | BigQuery Dataset/Schema | BigQuery Table Name | BigQuery Table Details | Sandbox IVM folder | Release Version |
---|---|---|---|---|---|
finngen-production-library | sandbox_tools_r12 | finngen_r12_service_sector_detailed_longitudinal_v1 | LIBRARY_RED | DF12 | |
endpoint_cohorts_r12_v1 | Endpoint cohorts | LIBRARY_RED | DF12 | ||
code_counts_r12_v1 | Code Counts | LIBRARY_RED | DF12 | ||
minimum_extended_r12_v1 | LIBRARY_RED | DF12 | |||
birth_mother_r12_v1 | LIBRARY_RED | DF12 | |||
vision_r12_v1 | LIBRARY_RED | DF12 | |||
kidney_r12_v1 | LIBRARY_RED | DF12 | |||
kanta_r12_v1 | LIBRARY_RED | DF12 | |||
medical_codes | finngen_vnr_v1 | LIBRARY_GREEN | INDEPENDENT | ||
fg_codes_info_v6 | Code translation info table | LIBRARY_GREEN | |||
finngen_omop_r12 | LIBRARY_RED | DF12 | |||
finngen_results_r12 | Achilles tables | LIBRARY_RED | DF12 |
Superset
Superset is a web tool that can be used to explore FinnGen data stored in BiqQuery database. Users can do SQL queries to the data and see the results as a data visualization. Superset is accessed from IVM Applications menu (Applications>FinnGen>Superset).
In Explore tab the user can do different visualizations to the data using the graphical interface. In the SQL Lab tab direct SQL queries can be done to the data.
Take a look how to export FINNGENID from Atlas using SuperSet.
How to save results
It is possible to save results into Sandbox specific BQ database named “Sandbox” (see available databases from Superset Database tab). Users can import small custom tables BQ database “Sandbox” and export result tables to Sandbox IVM as a csv file.
Superset demo
See a Superset tutorial video from FinnGen Users' Meeting 22th Sebtember 2020 (at 50min).
More information:
bq command-line tool
It is also possible to access the BigQuery tables using bq command-line tool which is a Python-based command-line tool. bq command-line tool can be used to run queries.
Before running a query, users can check how much does it cost of running query using bq command-line tool
The dry run will not execute the query but rather gives out how much data it consumes. In general, queries are priced using on-demand with estimate of $6.25 per Tebibyte (TiB). More details on pricing structure can be found here.
Google Cloud BigQuery python and R drivers
It is also possible to use google cloud BigQuery python and R drivers to access data directly from IVM.
See a tutorial video about how to conduct BigQuery using Python and R scripts from Users' meeting recordings (at 30min 2sec).
Use cases
Links on how to connect to BigQuery in R and Python along with some use cases for downstream analysis
Example Python script:
Example R script:
Last updated