Sandbox download requests – rules and examples for minimum N

Background: Due to data privacy reasons, only aggregate level data from at least 5 individuals are allowed to be exported from Sandbox. Thus, all subgroups used and visible in analysis results must have >=5 individuals to be allowed to download. No individual data points or IDs can be shown.

Most common download request types

1) Case/control analysis results (for instance GWAS results run with SAIGE or REGENIE)

Are allowed to be exported if case and control groups have >=5 individuals. If there are columns that show genotype counts for each variant among cases and controls they can be kept even if the count is <5.

2) Histograms

Each bar shown should be from >=5 individual data. If the bar has other identifiers (like red colored area for females and blue colored area for males), these should also refer to sample groups of >=5 individuals.

3) Curves

Each curve should be drawn from >=5 individuals. For instance in survival curves the N will at some point fall <5 but this is OK if the entire group from which the curve has been drawn from has enough individuals. However, there should not be any other identifiers in the curve unless they also point to >=5 individuals (like colored areas). Please note that curves should not contain vertical bars or any other pointers that show individual events (however a staircase-like curve is allowed as illustrated in the below example).

A slightly-modified real-life example of an approved curve (endpoint name, event details, SNP, and genotype counts changed)

4) Scatter plots

Are allowed if each dot is an average of at least 5 individuals. Sometimes instead of a scatter plot, you can consider something else. For instance PCA plots can be drawn as density plots instead of scatter plots.

Slightly modified real-life example (variable name changed) of an approved PCA density plot

5) Pie charts

Each section should derive from >=5 individual's data. If there are other identifiers such as the sector being further divided, the subparts should also be derived from >=5 individuals.

6) Code

Can be exported as long as code has "pure" commands only and not any table or header views of the data analyzed with it. If there are summary stats or counts or similar shown in the code or in comments, the minimum N should be reported. The code can't contain any FinnGen IDs, so they should be removed prior to export.

Some exceptions:

1) Allele frequencies and counts

Such statistics are allowed to be exported even if the allele is present in <5 individuals.

The allele counts for SNPs within haplotypes should be derived from a minimum of >=5 individuals. This requirement extends to the haplotype frequencies as well.

2) TBI files

Binary files are usually not allowed, since admins do not have a way to check them. The tbi files are an exception to this and they can be exported if you have generated one for your summary statistics. However, please keep in mind that these can be generated also outside Sandbox with the data you have downloaded.

3) Basic descriptive statistics

Min/max/median/quartile values shown for instance in box plots often point to single individuals. These can still be currently exported but it is recommended that some fluctuation is added to them. Values can be shown either via a boxplot or as a table of exact values. Note, however, that the boxplot should not contain any additional dots that point to single individual values (like outlier dots around min and max values)

A slightly modified real-life example of an approved boxplot (endpoint name and case/control counts changed)

Imaginary example of basic descriptive statistics in a table (would be approved):

ENDPOINT_EVENT_AGE

min

2.56

1st quartile

10.44

median

34.23

mean

33.01

3rd quartile

44.99

max

100.2

Additional things to consider:

1) Identify your subgroups correctly

For instance you could be running a case-control analysis for different PRS bins. Then it is not enough to consider total amount of cases and controls but the groups within each bin should be also >=5.

2) Marking empty, NA or n<5 is currently not sufficient

Any results derived from <5 individuals are usually rejected and thus, marking them as empty, NA or n<5 is currently not possible but the results from such groups should be removed entirely. If you have for instance a table where such cells is difficult to avoid and you need it for instance for publication purposes, you can always ask project management to review your request.

3) File formats

Admins are able to inspect files that open in the terminal or with most common graphical programs such as excel and word. Files of other formats such as binary files will be rejected as we are not able to inspect them.

4) What files do you actually need

Please restrict your download request to the files that you really need. All have to go through manual inspection and therefore keeping the files to an absolute minimum will help admins and you will receive the files faster.

5) Timing

Kindly note that we give support for download requests approximately from 08:00 to 16:00 Finnish time. There is no support on weekends or on public holidays. It will usually take a few working days to inspect your file, so last-minute requests will likely not reach you in time.

Last updated