Managing memory in Sandbox and data filtering tips

Using optimal machine size for tasks performed will save costs as Sandbox billing is also based on the size of the machine used.

The 'Basic Machine' (1 vCPU, 3.75 GB) is good for standard use like navigating the Sandbox, building cohorts in Atlas, and starting pipelines. Loading phenotype data in R needs a lot of memory and 'Rather Big Machine' (16 vCPUs, 104 GB).

Saving data

Saving data to your home disk /home/ivm/ in Sandbox consume the space in home disk that is not dependent on the IVM size. Checking the space in home disk and Resizing the home disk.

It is possible to consume more memory than there is in your IVM. When memory runs out IVM gets very slow or stuck. If your IVM is unresponsive you may force your IVM to shut down after you can continue working normally by creating a new IVM (from the ‘Start machine’ button, see figure above).

To force IVM to shut down see if the start button in the left sidebar is available and click it. If the start button is not available contact humgen-servicedesk@helsinki.fi. Admin at the service desk can force your IVM to shut down. After the IVM is terminated you can continue working normally by creating a new IVM. Note! Forcing IVM to shut down will cause loss of all unsaved data. In the worst case forcing IVM to shut down may corrupt your persistent disk causing loss of all data at your /home/ivm folder.

To plan memory usage, you can check how much memory there is in your IVM. Open Terminal Emulator and type free -m

To check memory and cpu usage per process type in Terminal top and q to exit.

Memory managing in RStudio

Reading big files like phenotype, genotype, or longitudinal data into RStudio will consume a lot of memory and requires the ‘Rather Big Machine’ (16 vCPUs, 104 GB). You can check how much memory RStudio session is currently using and how much you have left from the memory usage widget in RStudio Environments. Here for example the RStudio session is currently using 232 MiB. For a detailed report of memory usage click the small triangle to see a drop-down menu and select "Memory Usage Report". Here current session is using 43% of the memory while 57% of the memory is free.

Filtering in Terminal

Filtering data with Unix commands consumes considerably less memory than filtering data with R. For example, filtering with RStudio needs loading e.g. detailed longitudinal data to RStudio and consequently ‘Rather Big Machine’. On the contrary, the same filtering can be done with ‘Basic Machine’ using Terminal. After the data is prefiltered in Terminal it may be loaded to R/RStudio for further analyses possibly with Basic Machine.

For example, to filter with Linux command for J45 (ICD10 code for Asthma) in Terminal

zcat path/to/finngen_R8_detailed_longitudinal.txt.gz | grep J45 > my_result_file.txt

The filtered file containing all rows with the text “J45” will appear in your /home/ivm directory. The result file can be loaded to R/RStudio and continue analyzing there. To load the pre-filtered table in R/Rstudio

library(R.utils)

my_result_file = fread("/home/ivm/my_result_file.txt", data.table = FALSE)

NB!! If you filter at the command line be careful in R to check the code set. For example, F29 = psychosis in ICD10 and eye discomfort in ICPC2 so you will get both sets filtering simply like this at the command line and will need to check in R that the code set is correct.

We may not need all the columns in the file to perform our analyses. Subsetting 10 columns to 5 columns will cut the size of the file in half.

To head columns

To select columns

Check free space in Home Disk

Home disk is the users' private disk (/home/ivm/ folder in Sandbox) where users can save their own files. No other users besides the account owner have access to the private home disk. By default, the size of the home disk is 10 GB. The amount of space in the home disk is not dependent on the IVM size (Basic, Advanced, or Rather Big Machine).

To check the size and amount of space in your home disk type in Terminal

df -h /home

The output will give the size of the home disk, used space, available space, percent of space used, and the folder

If the space in home disk is running out it is recommended to free space by removing unneeded files and folders e.g. with the rm command in Terminal. Note that the rm command is irreversible. Be careful when using rm as restoration of removed files and folders is not possible after rm command. Using -i flag option will prompt before removal.

To remove a file

rm -i my_file.txt

To remove a folder and all of its content

rm -ri my_folder

It is also possible to resize the home disk to enable more space for the user's files and folders.

The trash bin may hold a lot of files consuming home disk space. Make sure to clear the trash bin from time to time.

Resizing Home Disk

Note that the change is permanent! Once you have upgraded your home disk size, you can’t reduce it.

By default, the size of the home disk is 10 GB. Open Terminal and type

df -h | grep home

Then close IVM, resize home disk size up to 20 GB, and start the smallest IVM again Note that once done you can’t revert this action. Your smallest IVM will permanently be 20 GB instead of 10 GB and it will cost accordingly.

After the home disk is resized repeating the command df -h | grep home shows that IVM now has in total 20 G of memory from which 95 M is used and 19 G is free.

To see how many CPU type lscpu | grep 'CPU(s):'. Note that CPU has increased from 1 to 2.

Before you resize your home disk, please consider that it can’t be reverted and it will affect on your IVM costs.

Last updated