Sandbox download requests – rules and examples for minimum N
Last updated
Last updated
Background: Due to data privacy reasons, only aggregate level data from at least 5 individuals are allowed to be exported from Sandbox. Thus, all subgroups used and visible in analysis results must have >=5 individuals to be allowed to download. No individual data points or IDs can be shown.
1) Case/control analysis results (for instance GWAS results run with SAIGE or REGENIE)
Are allowed to be exported if case and control groups have >=5 individuals. If there are columns that show genotype counts for each variant among cases and controls they can be kept even if the count is <5.
2) Histograms
Each bar shown should be from >=5 individual data. If the bar has other identifiers (like red colored area for females and blue colored area for males), these should also refer to sample groups of >=5 individuals.
3) Curves
Each curve should be drawn from >=5 individuals. For instance in survival curves the N will at some point fall <5 but this is OK if the entire group from which the curve has been drawn from has enough individuals. However, there should not be any other identifiers in the curve unless they also point to >=5 individuals (like colored areas). Please note that curves should not contain vertical bars or any other pointers that show individual events (however a staircase-like curve is allowed as illustrated in the below example).
A slightly-modified real-life example of an approved curve (endpoint name, event details, SNP, and genotype counts changed)
4) Scatter plots
Are allowed if each dot is an average of at least 5 individuals. Sometimes instead of a scatter plot, you can consider something else. For instance PCA plots can be drawn as density plots instead of scatter plots.
Slightly modified real-life example (variable name changed) of an approved PCA density plot
5) Pie charts
Each section should derive from >=5 individual's data. If there are other identifiers such as the sector being further divided, the subparts should also be derived from >=5 individuals.
6) Code
Can be exported as long as code has "pure" commands only and not any table or header views of the data analyzed with it. If there are summary stats or counts or similar shown in the code or in comments, the minimum N should be reported. The code can't contain any FinnGen IDs, so they should be removed prior to export.
1) Allele frequencies and counts
Such statistics are allowed to be exported even if the allele is present in <5 individuals.
The allele counts for SNPs within haplotypes should be derived from a minimum of >=5 individuals. This requirement extends to the haplotype frequencies as well. Please also note that any extra information from a haplotype group must also fulfill N>=5 rule (for instance case/control counts in a haplotype group).
2) TBI files
Binary files are usually not allowed, since admins do not have a way to check them. The tbi files are an exception to this and they can be exported if you have generated one for your summary statistics. However, please keep in mind that these can be generated also outside Sandbox with the data you have downloaded.
3) Basic descriptive statistics
Min/max/median/quartile values shown for instance in box plots often point to single individuals. These can still be currently exported but it is recommended that some fluctuation is added to them. In case the min/max/median/quartile of the group is generated from a large enough group, that is from more than 1000 individuals no fluctuation needs to be added. Values can be shown either via a boxplot or as a table of exact values. Note, however, that the boxplot should not contain any additional dots that point to single individual values (like outlier dots around min and max values)
A slightly modified real-life example of an approved boxplot (endpoint name and case/control counts changed)
Imaginary example of basic descriptive statistics in a table (would be approved):
ENDPOINT_EVENT_AGE
min
2.56
1st quartile
10.44
median
34.23
mean
33.01
3rd quartile
44.99
max
100.2
1) Identify your subgroups correctly
For instance you could be running a case-control analysis for different PRS bins. Then it is not enough to consider total amount of cases and controls but the groups within each bin should be also >=5.
2) Marking empty, NA or n<5 is currently not sufficient
Any results derived from <5 individuals are usually rejected and thus, marking them as empty, NA or n<5 is currently not possible but the results from such groups should be removed entirely. If you have for instance a table where such cell is difficult to avoid and you need it for instance for publication purposes, you can always ask project management to review your request.
3) File formats
Admins are able to inspect files that open in the terminal or with most common graphical programs such as excel and word. Files of other formats such as binary files will be rejected as we are not able to inspect them.
4) What files do you actually need
Please restrict your download request to the files that you really need. They all have to go through manual inspection and therefore keeping the files to an absolute minimum will help admins and you will receive the files faster.
5) Timing
Kindly note that we give support for download requests approximately from 08:00 to 16:00 Finnish time. There is no support on weekends or on public holidays. It will usually take a few working days to inspect your file, so last-minute requests will likely not reach you in time.