Autoreporting – information on overlaps

The autoreporting pipeline is used to filter and annotate GWAS analysis results, as well as compare them against other datasets (e.g. GWAS Catalog and several hand-curated studies). Reports based on finemapping and GWAS analyses are generated for each release. The variants are grouped by credible sets formed in the Susie finemapping pipeline. They are annotated with gnomAD 2.1 and FinnGen release annotations. The resulting variants are then compared against GWAS Catalog and our hand-curated results (publicly available at Betamatch data).

Filtering & Grouping

The autoreporting tool groups variants by the credible sets produced by finemapping analysis. The variant with the highest posterior inclusion probability (PIP) is designated as the lead variant. Variants that are in the same credible set as the lead variant are then grouped with it. In addition to the credible set variants, variants in linkage disequilibrium (LD) with the lead variant are grouped with them. These LD partners are filtered on p-value, LD correlation, and distance from the lead variant. Together, the lead variant, credible set variants, and LD partners form a group.

For release reports, the summary statistics are grouped using credible set grouping.

In addition to credible set grouping, different grouping tactics can be used when finemapping information is unavailable. Variants can be grouped by LD clumping or simply by range.

  • In LD clumping, the most significant variant is chosen as a group lead variant. All variants that are close enough to it (e.g. with a difference in position <= 2MB), significant enough (e.g. pval < 1e-2), and in high enough LD with it (e.g. r^2 > 0.2) are clumped with it. This forms a group. Then the most significant of the not grouped variants is picked as a group lead variant, and variants are clumped to it. This is repeated until all variants significant enough (e.g. 5e-8) to be a group lead variant are assigned to groups.

  • In range-based grouping, the most significant variant is chosen as the group lead variant. Variants that are significant enough (e.g. p-value < 1e-5) and at most N basepairs away from that variant are grouped together with that lead variant. This is continued until all variants significant enough to be group lead variants (e.g. p-value < 5e-8) are grouped.

Annotation

Reports are annotated with population frequencies, Finnish enrichment, and information about the most severe consequence of a variant and the associated gene, in addition to the p-value and effect size of the last FinnGen release data for that variant and phenotype. The annotations come from the FinnGen variant annotation file, gnomAD 2.1 data, and previous releases' summary statistic information.

Comparison to other datasources

The variants are compared against found associations in GWAS Catalog, as well as against a hand-curated group of studies. Since the GWAS Catalog data does not store allele information, it is added from dbSNP. The data from hand-curated studies can be found at Betamatch data.

Outputs

The autoreporting pipeline outputs two TSV files per phenotype: One with one variant per row, and one with one group of variants per row. The report with one variant per row is called the variant report, and has all of the information gathered for the variants in different columns. The second report is called a group report, and it has one group per row. The information on the columns is either aggregated over the group, or is taken for the lead variant.

The autoreporting results for a release can be found in the green library, in the analysis data for a given release. The reports are in TSV (tab-separated values) format.

  • Variant-based reports: /finngen/library-green/finngen_R6/finngen_R6_analysis_data/autoreporting_2.0/variant_reports/

  • Group reports: /finngen/library-green/finngen_R6/finngen_R6_analysis_data/autoreporting_2.0/group_reports/

Read more about format of the FinnGen autoreporting results file.

Last updated