Hi @MaxBladen
Thanks a lot for your response.
Just so I understand this properly, the OTU’s which were selected by the DIABLO method are those which primarily contain 0’s in your original (pre-offset and pre-CLR) data?
Yes that is correct, the OTU’s with high weigts in both DIABLO and MOFA are the ones which are zero for all my samples pre-offset & pre-CLR
When you say “pseudocount”, are you referring to the offset you apply to all samples to remove any cells with 0 in them?
Correct.
Here, are you saying that for a given OTU, two samples which have the same value in your dataset prior to the CLR have different values after the CLR?
The answer here is indeed yes. But why then are both DIABLO and MOFA selecting these OTU’s with all zeros to include in the latent variables? Maybe it is also because I worked with relative count data instead of absolute?
I’ve now tried pre-processing the absolute count values again using the steps described here. The filtering step circumvents the problem of having a lot of zero’s in my dataset. At first glance, to me everything looks fine. I’ll illustrate below with an example, that will be more clear:
For my absolute data, the value of OTU5 (example name) in the first 5 samples is:
0, 0, 8629, 0, 6713
After applying an offset of +1 and a CLR transformation, this becomes (rounded to two decimals):
-0.71, -0.67, 7.60, -1.19, 7.89
So here indeed, the zero values for one OTU become different in the samples, but at least there in the same value order, so they shouldn’t introduce any unwanted variation between the samples?
Same exercise for the relative data (the one I used in DIABLO & MOFA):
Original values for the same OTU and the same five samples are:
0, 0, 3.46E-01, 3.37E-05, 2.67E-01 (weird that the value in sample three isn’t zero as well)
After aplying an offset of 1E-6 (I’ve read that I should adjust the offset to the minimul value in the data, here that is 4.47E-5) and CLR transformation:
-0.25, -0.15, 12.4, 2.95, 12.26
So the zeros do get transformed into different values, but in the end they shouldn’t be selected as the most important variables right? (certainly not an OTU that is all zeros). Or am I missing something here?
Also when looking at the histograms and testing for multivariate normality, things do not really look good yet, or should we apply some sort of scaling as well after CLR?
Thanks again for your time.
Kind regards
Pablo