Why does my IVM freeze while loading data into R/Rstudio

A common issue for Sandbox users is that their Interactive Virtual Machine (IVM) often freezes when loading FinnGen data (such as phenotypes, summary statistics, genotypes, etc.) into R/RStudio. This happens because FinnGen data sets are large and can easily exhaust the available memory of the IVM, especially when using a Sandbox IVM with 1 CPU and 3.75 GB of memory (Sandbox "Basic machine"). Although FinnGen data is usually stored in a compressed format, it expands significantly when loaded into R/RStudio, consuming more memory than it might appear.

Solution: To avoid this issue, it is recommended to subset the FinnGen data before analysis, as users rarely need all diagnosis codes or all samples simultaneously. Here are some strategies to help manage this problem:

  1. Choose an Appropriate VM Size: Select a VM configuration that suits the size of your data. Three different VM configurations are available for selection when you log in to Sandbox (link).

  2. Subset the Data Before Loading: Use shell scripting to subset the data before loading it into R/RStudio. This allows you to process the data one line at a time, minimizing memory consumption (link).

  3. Use BigQuery to Load Data: Start R/RStudio and load only a subset of data directly from the BigQuery database, rather than loading the entire file from the library (link).

Monitoring: It's also important to monitor memory usage in R/RStudio during your analysis. Be aware that duplicating data frames will double memory usage (link). [Link to memory monitoring guide]

Last updated