Hello everyone,
I have been slowly working on my understanding of the challenges involved in compositional data analysis and the available tools in that area, and I had some general questions I was hoping people here might help me with. I will also note that this is mostly within the context of microbiome data.
First is the issue of transforms. The mixOmics website and tutorials seem to mostly use the CLR transform. However, reading through the PhILR paper, the introduction mentions this:
However, the centered log-ratio transform has a crucial limitation: it yields a coordinate system featuring a singular covariance matrix and is thus unsuitable for many common statistical models (Pawlowsky-Glahn et al., 2015)
They go on to say that the ILR transform avoids this problem, but that meaningful partitioning can be an issue, and they cite (among others) the MixMC paper as a reference for using the ILR on microbiome data. Now, ignoring the issue of partitioning (especially since a useful phylogenetic tree can be difficult to produce), reading through the MixMC paper, the supplemental S1 Text says the following:
ILR transformed data are not easily interpretableā¦Therefore, Filmoser et al. proposed to back transform the PCA results to the CLR space using the linear relationship between CLR and ILR transformationsā¦The ILR followed by back transformation was used in mixMC with PCA.
So first, I have not been able to access, the Pawlowsky-Glahn book the PhILR paper cites to get an understanding of what they mean by āunsuitable for many common statistical modelsā. But would you agree with them given the approach taken in the MixMC paper? And second, is that back-transformation something currently implemented or supported in mixOmics?
Beyond mixOmics, I have been interested in using the NBZIMM package to do differential abundance analysis on repeated measures data. Would the CLR transform be useful there, or would the application of GLMs fall under the āunsuitableā category.
Finally, and this is not necessarily related to mixOmics, but I would certainly appreciate any perspectives, the Gloor 2017 paper advocates the CLR transformed data and euclidean distance input to PCA as a replacement for beta diversity calculations, and I have seen that approach advocated in discussions here as well. However, the topic I have yet to see discussed is that beta diversity is more than just āhow far apart are these samplesā. Bray-curtis has a different meaning from the Jaccard, which has a different meaning from the unifrac, the Morisita index, etc. Furthermore, the euclidean distance does not handle the double-zero problem, i.e. that a zero (or a one in offset data) could mean multiple things: not present, present below detection, outliers (see Kaul 2017). The āWaste not want notā paper (McMurdie and Holmes 2014) focus on normalization as an approach, which of course the CoDA approach does not approve of. But I have not seen a compositional take on replicating the meanings of the various diversity measures.
Many thanks in advance,
Shareef