Case study – defining simple cohorts using medical codes for running case-control GWAS

In this example, we recreate Risteys endpoint M13_Arthrosis_Knee (https://risteys.finregistry.fi/endpoints/M13_ARTHROSIS_KNEE) as an example of how we can define cohorts using BigQuery and R.

This example uses dplyr to perform the database query, which is an alternative way to retrieve data from the database without needing to know SQL; instead some knowledge of dblyr-package is required.

Defining case cohort:

Step 1 (define registers and codes):

To get started, we need to select which registers and medical codes we want to use to define our cases. Atlas’ search functionality can be useful with this, as well as browsing existing Risteys endpoints to get some ideas. In this example of knee arthrosis, we use INPAT and OUTPAT registers with following codes:

ICD10**: M17*** (all icd10 codes starting with M17) ICD9: 7151F | 7152F ICD8: 71391

Step 2 (open Rstudio, load libraries and prepare BigQuery connection):

Once we have opened Rstudio, we start by loading some libraries that will help with data processing and BigQuery connection.

# Data manipuolation related
library(dplyr)
library(data.table)

# BigQuery connection related
library(bigrquery)
library(DBI)

Next, we need to prepare a connection object to use BigQuery with dplyr. This is done with following code. Be sure to change your sandbox number in projectid-variable:

# Running some authentication code:
projectid <- "fg-production-sandbox-<YOUR_SANDBOX_NO>" # NB: ADD YOUR SANDBOX NUMBER!
options(gargle_oauth_cache = FALSE)
bq_auth(scopes = "https://www.googleapis.com/auth/bigquery")

# Making a connection object that is used to connect to the tables:
con = dbConnect(
  bigrquery::bigquery(),
  project = "finngen-production-library",
  dataset = "sandbox_tools_r11",
  billing = projectid
)

Step 3: Create connection to required database tables:

With dplyr, it is easy to handle database tables just like they would be stores as local tibbles in memory. This is done by using tbl() -function to store table connection to variables. In this example, we need **service_sector_longitudinal -**table, finngenid_info -table and endpoint -table. To see list of available tables, look here.

detailed = tbl(con, "finngen_r11_service_sector_detailed_longitudinal_v1")
endpoint = tbl(con, "endpoint_cohorts_r11_v1")
info = tbl(con, "finngenid_info_r11_v1")

Step 4: Run BigQuery to retrieve your case cohort:

Using dplyr to get results from the service sector data is fairly generic, and with slight modifications the following formula can be applied to many similar cases. Typical workflow goes select columns -> filter rows that correspond to your cohort codes -> summarise results (full rows, minimum Event age, just cohort defining id’s etc.)

Following example shows that we can either specify codes one by one and use the %in% -operator to define the codes, or use the %like% -operator from _data.table-_library to use wildcard (% in BigQuery) to select codes with a formula.

The following code first selects the columns we want to use, filters the required registers and codes and finally summarises the results to one row/ Finngenid. We've highlighted a few ways we can summarise the results (comments at the end of the code block):

  • Distinct: returns just one column with the defined case-cohort

  • group_by -> summarise: returns list of FINNGENID's and 1 or more aggregate summary results such as (minimum, maximum, mean, medium etc. of columns like Event_age)

  • group_by -> filter(row_number): returns full rows/ for FINNGENID with minimum Event_age

Depending on whether you want to do further analysis on for example what codes were used to get records, or if defining the cohort is enough, we can select any of the summary methods.

In the code is also shown a way with how you could include codes from other registers, such as PURCH and PRIM_OUT (commented out), but here we use just the part with INPAT & OUTPAT.

#icd10_koa = c("M170", "M171", "M172", "M173", "M174", "M175", "M179")
icd9_koa = c("7151F", "7152F")
icd8_koa = c("71301")

# we could add some ATC codes etc., if we wanted to include PURCH register
list_of_purchcodes = c()

Case_df <- detailed %>%
  select(FINNGENID:CODE2, ICDVER, CATEGORY) %>%  # selecting only columns we need
  filter(                                        # filtering row conditions that define our cases
      (SOURCE %in% c("INPAT", "OUTPAT") & (
         (ICDVER == "10" & (CODE1 %like% "M17%" | CODE2 %like% "M17%")) |
         (ICDVER == "9" & (CODE1 %in% icd9_koa | CODE2 %in% icd9_koa)) |
         (ICDVER == "8" & (CODE1 %in% icd8_koa | CODE2 %in% icd8_koa))
      ) #|
      #(SOURCE == "PURCH" & CODE1 %in% list_of_purchcodes) |
      #(SOURCE == "PRIM_OUT" & CATEGORY %like% "ICP%" & CODE1 %in% icpc2_codes)
      # etc. conditions for rows to be included 
   ) %>%
  group_by(FINNGENID) %>% filter(row_number(EVENT_AGE) == 1) %>%       # full rows with min event age
  # group_by(FINNGENID %>% summarise(FIRST_EVENT = min(EVENT_AGE)) %>% # min event age
  # distinct(FINNGENID) %>%                                            # just cohort id's
  collect()

After running the code block, if we have a collect() function at the end, we execute the BigQuery and the result is saved as a Tibble/ Dataframe object.

Define and retrieve your control cohort:

Now that we have our case group defined and collected, we need to define our control cohort. Sometimes it might be ok for controls to be defined as those not being cases. Other times there might be some endpoints and/or medical codes that you want to exclude from the analysis altogether, as they are not specific to include in either case or control -cohorts. End of Follow-up can also often be a useful filter. In this example case of Knee arthrosis, we want to exclude any individuals from controls belonging to one of the following endpoints:

M13_PYOGARTH, M13_REACTARTH, M13_POLYARTHROPATHIES, M13_ARTHROSIS & M13_OTHERJOINT

To do this, we first find all the individuals we want to exclude from the analysis, and then remove them from the list of eligible controls.

# Optional step 1: set exclusion endpoint to variables
exclude_endpoints = c("M13_PYOGARTH", "M13_REACTARTH", "M13_POLYARTHROPATHIES",
                      "M13_ARTHROSIS", "M13_OTHERJOINT")
                      
# gets id's that should be excluded using endpoint table:
# "lazy table", which gets evaluated in the next part when assigning Control cohort.
exclude_ids = endpoint %>%
  filter(ENDPOINT %in% exclude_endpoints & CONTROL_CASE_EXCL = 1) %>%
  distinct(FINNGENID)
  
# Step 2: Execute Bigquery to get control cohort:
Control  = info %>%
  distinct(FINNGENID) %>%
  anti_join(id_exclude_query) %>% # Filters out endpoints of step1.
  collect() %>%                   # Executes BigQuery, collects results to R.
  anti_join(Case_df)              # In R, remove any overlapping cases from control cohort,
                                  # if there are some left from previous filtering step.
                                  

We are now done with the BigQuery step, so we can disconnect from the Database. This is done by running following command:

dbDisconnect(con)

Save cohorts and run Custom GWAS:

There are few options of how we can save the cohorts in our ivm to run GWAS with Custom GWAS CLI -tool:

1) Easiest way to save our case and control cohort is to save them on their separate text files

# step 1: create table with just column of FINNGENID's
Case = Case_df %>%
  select(FINNGENID)

# Control cohort already in right format

# step 2: save files
write.csv(Case, "/path_in_ivm/my_cases.txt", col.names = T, 
          row.names = F, sep = "\t")
write.csv(Control, "/path_in_ivm/my_controls.txt", col.names = T, 
          row.names = F, sep = "\t")

# step 3: use Custom GWAS-cli to start your GWAS run (run in terminal):
finngen-cli request-gwas \
--phenotype-name testPhenotype \ # testPhenotype -> select name to your phenotype
--analysis-description test_run \ # test_run -> add description of the run
--binary true \
--analysistype additive \    # additive/recessive/dominant
--casefile /path_in_ivm/my_cases.txt \
--controlfile /path_in_ivm/my_controls.txt \
--notification-email my.email@email.com \ # your email
--release finngen_R11

2) Second option is to prepare one so called phenofile, which contains both case and control cohorts. It has an additional phenotype column, with 1's being cases and 0's being controls.

# step 1: add phenotype column to case and control cohorts
Case = Case_df %>%
  select(FINNGENID) %>%
  mutate(MY_PHENONAME = 1) # add phenotype column

Control = Control %>%
  mutate(MY_PHENONAME = 0)

# step 2: combine cohorts rowwise
Phenofile = rbind(Case, Control)

# step 3: save files
write.csv(Phenofile, "/path_in_ivm/my_phenofile.txt", col.names = T, 
          row.names = F, sep = "\t")
          
# step 4: use Custom GWAS-cli to start your GWAS run (run in terminal):
finngen-cli request-gwas \
--phenotype-name MY_PHENONAME \  # NB: should match the phenotype column name in phenofile!!
--analysis-description test_run \  # test_run -> some description 
--analysistype additive \ # additive/recessive/dominant
--phenofile /path_in_ivm/my_phenofile.tsv \  
--notification-email my.email@email.com \
--release finngen_R11

3) Third option is to read data to Cohort Operations (CO) for further analysis of cohorts. With CO, we can in example analyze code enrichment in cohorts and run Custom Gwas directly.

In this example we create a minimum requirement text_file to import cohorts into cohort operation, as we haven't collected any additional data for control cohort. For this, we need to make tab-separated file with 3 columns: FINNGENID, filepath & cohort. Note that you can add other columns as weel such as GENDER etc., if you have collected it. See possible columns from the above link of CO-documentation.

# step 1: add COHORT and SOURCE columns to case and control cohorts
Case = Case_df %>%
  select(FINNGENID) %>%
  mutate(COHORT_SOURCE = "text file",
         COHORT_NAME = "pheno_case")    # name your case cohort

Control = Control %>%
  mutate(COHORT_SOURCE = "text file",
         COHORT_NAME = "pheno_control") # name your control cohort

my_CO_file = rbind(Case, Control)

# step 2: save CO import file:
write_table(my_CO_file, "/path_in_ivm/my_casecontrol_cohorts.txt"
            col.names = T, row.names = F, sep = "\t")

After having analyzed the cohorts further, Custom GWAS can be started directly from CO.

Running GWAS with Pipelines (Regenie)

If we want to select which covariates to use in the GWAS run and/or run multiple phenotypes in same job, we can use the Pipeline tool to do this. For instructions, see handbook sections on How to run GWAS using REGENIE and Adding new covariates in GWAS using REGENIE and SAIGE.

Last updated