CLR Transformation creating unwanted varation?

Hi everyone!

I’m currently working on analysing a microbiome dataset (otu count table, relative, so compositional data, sum of each sample equals to one).
Our goal is to find otu’s related to our metabolomics dataset, by using biomarker discovery tools such as DIABLO & MOFA and compare the selected variables of these tools with the results of our correlation analysis.
Before analysis, I performed a CLR transformation (using a pseudocount) on my microbiome dataset (since this is used a lot and also seen in tutorials of both MOFA & DIABLO).
My problem is now, that when looking at the variables with the top weights for my otu’s: they are all zero in the original (untransformed) otu count table.
So in other words: the CLR transformation changes all the zeros for a certain otu in different values, which cause MOFA & DIABLO picking it up as a variable that is responsible for variation between my sample groups.
Does anybody have an idea what is going on here?
Or somebody that has previous experience with using compositiona count table data in DIABLO?

Thanks a lot!
Kind regards
Pablo

Hi @pvgeende,

I will say now that this is not my area of expertise but I’ll try help out as best I can. I just have a few questions to clarify my understanding of your scenario.

My problem is now, that when looking at the variables with the top weights for my otu’s: they are all zero in the original (untransformed) otu count table.

Just so I understand this properly, the OTU’s which were selected by the DIABLO method are those which primarily contain 0’s in your original (pre-offset and pre-CLR) data?

performed a CLR transformation (using a pseudocount)

When you say “pseudocount”, are you referring to the offset you apply to all samples to remove any cells with 0 in them?

the CLR transformation changes all the zeros for a certain otu in different values

Here, are you saying that for a given OTU, two samples which have the same value in your dataset prior to the CLR have different values after the CLR?

If the answer to this question is yes, I think I can explain. Lets look at how the CLR values are derived. For a given sample x, and if we have D OTUs, then we can define x = [x1, x2, … xD].

image

G(x) is the geometric mean of the sample, not the OTU. Hence, two equal values belonging to the same sample (row) will have the same CLR transformed values. In contrast, two equal values belonging to the same OTU (column) will have different CLR transformed values.

If the answer to the last question was no, them I’m not really sure what is occuring.

Looking forward to your response,
Max.

Hi @MaxBladen

Thanks a lot for your response.

Just so I understand this properly, the OTU’s which were selected by the DIABLO method are those which primarily contain 0’s in your original (pre-offset and pre-CLR) data?

Yes that is correct, the OTU’s with high weigts in both DIABLO and MOFA are the ones which are zero for all my samples pre-offset & pre-CLR

When you say “pseudocount”, are you referring to the offset you apply to all samples to remove any cells with 0 in them?

Correct.

Here, are you saying that for a given OTU, two samples which have the same value in your dataset prior to the CLR have different values after the CLR?

The answer here is indeed yes. But why then are both DIABLO and MOFA selecting these OTU’s with all zeros to include in the latent variables? Maybe it is also because I worked with relative count data instead of absolute?

I’ve now tried pre-processing the absolute count values again using the steps described here. The filtering step circumvents the problem of having a lot of zero’s in my dataset. At first glance, to me everything looks fine. I’ll illustrate below with an example, that will be more clear:

For my absolute data, the value of OTU5 (example name) in the first 5 samples is:
0, 0, 8629, 0, 6713
After applying an offset of +1 and a CLR transformation, this becomes (rounded to two decimals):
-0.71, -0.67, 7.60, -1.19, 7.89
So here indeed, the zero values for one OTU become different in the samples, but at least there in the same value order, so they shouldn’t introduce any unwanted variation between the samples?

Same exercise for the relative data (the one I used in DIABLO & MOFA):
Original values for the same OTU and the same five samples are:
0, 0, 3.46E-01, 3.37E-05, 2.67E-01 (weird that the value in sample three isn’t zero as well)
After aplying an offset of 1E-6 (I’ve read that I should adjust the offset to the minimul value in the data, here that is 4.47E-5) and CLR transformation:
-0.25, -0.15, 12.4, 2.95, 12.26

So the zeros do get transformed into different values, but in the end they shouldn’t be selected as the most important variables right? (certainly not an OTU that is all zeros). Or am I missing something here?
Also when looking at the histograms and testing for multivariate normality, things do not really look good yet, or should we apply some sort of scaling as well after CLR?

Thanks again for your time.

Kind regards
Pablo

weird that the value in sample three isn’t zero as well

When printing values, R tends to simplify most. The 0 value of sample in the absolute data likely represents some tiny, positive value rather than truly being 0.

So the zeros do get transformed into different values, but in the end they shouldn’t be selected as the most important variables right? (certainly not an OTU that is all zeros). Or am I missing something here?

Without any context, it’s impossible for me to say whether you are missing something here. Unfortunately, there are no concrete rules in -omics analysis. If for an OTU (say OTU 5), all the samples with non-zero values (eg. samples 3 and 5) correspond to most or all the samples of a specific class, despite being mostly zeroes, it is a highly discriminative feature and will be selected by the model. If your model is selecting features with ALL zeroes, then there is some cause for concern - but your pre-filtering should negate this from being an issue (assuming you correctly pick your threshold).

Also when looking at the histograms and testing for multivariate normality, things do not really look good yet, or should we apply some sort of scaling as well after CLR?

PLS-based methods do function on non-normally distributed data. However, highly skewed and/or kurtotic data will decrease the efficacy of the model. Hence, if many of your features deviate from the normal distribution to a high degree, scaling the data and remodelling using the scaled data may improve things.

Hope this helps a bit!

1 Like

Hi Max

Thanks again for replying, definitly helping a lot.
I think I will play around with the pre-filtering treshold and see were I can go from there.
I might post some updates here later, thanks again!

Greetings
Pablo

1 Like