SPLS-DA for two time points (repeated), plotLoadings mean vs median, CSS normalisation and scaling

Hello!
Thanks for developing the cool mixomics package! I’m using this for 16s data with clinical variables. I have a longitudinal data that has 2 groups and 2 time points (pre and post).
I would like to see the DA features between 2 groups. I understand that we can only perform sPLS-DA as divisions in 4 situations (2 groups * 2 time points). However, could I externally calculate the delta of 2 groups (timepoint 1- timepoint 0) and give this as an input to sPLS-DA? And use groups as the other factor? By this way I could look for DA features between 2 groups directly. Also could I use the same logic for sPLS?
My other question is, the vignette of plotLoading recommends using “median” for the counts data type. How would it matter differently if we are using “mean” ? Because I do not clearly understand in this case what a “tie” means and I’m afraid I’m losing out on data if I use “median”. Or could I use mean and justify by using only features that hold a certain higher % of importance? Do you recommend any such second situations if I could use mean instead of median.
Thanks in advance :slight_smile:

1 Like

Hi @Jan91,
thanks for your interest and your load of questions! :slight_smile:

I would like to see the DA features between 2 groups. I understand that we can only perform sPLS-DA as divisions in 4 situations (2 groups * 2 time points). However, could I externally calculate the delta of 2 groups (timepoint 1- timepoint 0) and give this as an input to sPLS-DA? And use groups as the other factor? By this way I could look for DA features between 2 groups directly.

Yes you can, and this is roughly what we called ‘indexing’ in this paper for methods others than mixOmics approaches.
In mixOmics you can consider a multilevel decomposition, as shown here, but I believe it will not apply in your case, as you seem to be more interested in discriminating the two groups, rather than the pre vs post? Multilevel decomposition applies when individual variation >> time variation and the individual variation is not of interest to us. You could look at some PCA first to see what is the major source of variation, and compare with a PCA multilevel (argument multilevel in PCA, PLS-DA. For PLS, using the withinVariation() function to extract the within matrix then input in PLS).

Then in sPLS-DA, set Y = group.

Also could I use the same logic for sPLS?

Yes if you have another data set to integrate. See comment above

My other question is, the vignette of plotLoading recommends using “median” for the counts data type. How would it matter differently if we are using “mean” ? Because I do not clearly understand in this case what a “tie” means and I’m afraid I’m losing out on data if I use “median”. Or could I use mean and justify by using only features that hold a certain higher % of importance? Do you recommend any such second situations if I could use mean instead of median.

This is only for visualisation purposes so you are not loosing data, but you could make the wrong conclusions. For count data that are skewed, the mean and the median would be different. This is where you should use the median (because your mean is going to be pulled by extreme values). Basically what we do here (and you can do it by hand to convince yourself, on the centered and scaled data), for a given selected feature, we calculate the median per group and then declare the contribution as the group for which the median is maximum (you can also choose minimum). It might be worthwhile for you at this stage to have a closer look at your important variables and see whether their distribution is symmetric or not. We mentioned count data because they tend to be highly skewed.

Hope that helps

Kim-Anh

Thank you for your clear answers!
Could you please also clarify couple of things?

  1. When using CSS+log transformed data, we must set the scale option FALSE in the plsda/tuning step, is it correct?
    However when using scale=TRUE, I see some separation between the groups (with X-variate explaining 3% and Y-variate explaining 2%), while scale=FALSE, I have both the groups almost overlapping on each other, but with X-1 variate explaining 28% and X-2 variate explaining 13%. So which would be the correct thing to choose, scale=FALSE/TRUE.
  2. While using the CSS+log transformed data, in the plotLoadings, does it still apply to take into account “median” instead of “mean”? Because as I understood, we are accounting for the skewness in the data with previous step css+log step, isn’t it correct?
  1. When using CSS+log transformed data, we must set the scale option FALSE in the plsda/tuning step, is it correct?
    However when using scale=TRUE, I see some separation between the groups (with X-variate explaining 3% and Y-variate explaining 2%), while scale=FALSE, I have both the groups almost overlapping on each other, but with X-1 variate explaining 28% and X-2 variate explaining 13%. So which would be the correct thing to choose, scale=FALSE/TRUE.

We stopped using CSS+log (our choice, it did not seem to bring additional insight, see our mixMC paper). Scaling is different from a normalisation step, it is just ensuring all variables have the same variance and are thus comparable. So we advise to retain scale = TRUE in the tuning (and in every method you use) if the variance of each variable is different and you are not interested in highlighting this in your model. That said, the explained variance in PLS-DA is not extremely relevant here, as we are trying to maximise the discrimination between sample groups (rather than the variance, as in PCA), so I would not make it a criterion to choose. Rather, look at the classification performance using cross-validation (see perf() function)

  1. While using the CSS+log transformed data, in the plotLoadings, does it still apply to take into account “median” instead of “mean”? Because as I understood, we are accounting for the skewness in the data with previous step css+log step, isn’t it correct?

As mentioned in my previous post, you need to first work out whether you normalised data (CSS+log) are skewed or not by looking at your data. I wont be able to advise on this, every data is different and ‘normalisation’ does not necessarily mean your data will have a normal distribution afterwards.

Kim-Anh

1 Like

Hi,
I have went through the article you had suggested (https://doi.org/10.1038/s41467-019-08794-x). You mention “indexing” and I wanted to clarify if “indexing” is what is implemented as “multilevel” in the sPLS/sPLS-DA functions?
Also in the Nature article, you mention that this indexing is analogous to normalizing samples with respect to their baseline. If this is not adding “multilevel”, then can I calculate this indexing for samples by regression model (baseline & 1 yr) coefficients and median at baseline? (I have a non cross over design).
The reason I’m confused about using “multilevel” is you had changed the initial question to “non-repeated” and in my case every participant has measurement at pre and post treatment…
Thanks a lot for your time

hi @Jan91,
I believe your case does not really apply to a multilevel case (cross-over design, hence my original change in the topic). I have updated my comment above regarding the multilevel approach. I feel the indexing (as you also propose) would be appropriate in your case, even though it reduces the number of samples you have. the multilevel does not, but you should test as updated in my post to see whether it is beneficial.
Hope that helps.

Kim-Anh

1 Like