What is the difference is between LD-clumping and the Saige conditional analysis?

There are similarities indeed as both are affected by variants being correlated with each other (LD) and lot of shared goals in doing so, mainly identifying independent hits.

In LD clumping we choose first the most significant variant in a region (above chosen significance threshold) and then choose r2 threshold to assign all variants with r2>threshold to that.

Then we choose among the remaining variants (if significant left) another top variant and keep on looping until no more variants left above chosen threshold. The end result of this is top variants for n approximately independent signals in the regions and for the n variants we have only marginal (unconditional) betas and p-values.

In conditional analysis we follow a similar algorithm but the first iteration is that we run association analysis again for the variants in the regions excluding top_variant

pheno ~ a *covars + b *var_x + y * top_variant

i.e. we condition on the most significant variant. The result of this iteration is then beta and p-value for the variants in the region conditional on genotype top_variant.

After conditioning we check if we have significant variants left and keep on adding each significant variant to the model to test if we have significant variants (and get their conditional betas and pvals) after conditioning on all of the previous variants.

If you care about only whether or not there are significant variants and not about the conditional test statistics, there are still few practical benefits in favor of conditional analysis.

First stems from having to choose r2 threshold. It would seem safe to say ld-clump with r2 of say 0.4. Now if we have strong associations in the region, even a small r2 would be a "shadow" of the signal.

The relationship between r2 and expected chi square test statistics is simply expected_chisq = r2 * lead_chisq

A top variant with 5*10-18 would have genomewide significant variants in it's shadow with r2 of .39 and you would assign that variant as independent signal although its not and it gets worse and worse with huge signals. For example 10^-50 would have shadow signal at 0.13 r2.

The other clear practical benefit is that when conditioning on a stronger signal, you may actually uncover a secondary signal that is not significant in unconditional analysis. Of course with multiple signals in the region the situation is even more complex and with some amount of correlation it's really difficult to get to accurate number of signals with ld-clumping.\

PreviousCan I select only the columns needed for my analysis to import into RStudio?NextCan I download all pairwise LD data across the genome at once?

Last updated 7 months ago

Was this helpful?