sPLS explained variance and variable selection

Hi all,

I have recently discovered mixOmics and I have experimented with it, but I have some questions regarding my results.

My study design includes 2 groups of different individuals, and for each individual I have 2 time points, so I am using the multilevel design.
I have a metabolomics dataset (700x250) and a proteomics dataset (700x1400). I am applying sPLS in order to find correlations between metabolites and proteins and I am using the canonical mode in order to consider both datasets symmetric.
Based on the Q2 score, I proceeded with 2 components. During the tuning procedure, only 5 variables (the lowest I tested) were chosen from each dataset for each component (so, 10 metabolites and 10 proteins overall).
The explained variance in the metabolomics dataset is 0.22 and 0.26 for the 2 components, whereas for the proteomics it’s 0.01 and 0.02.

  1. I find this a bit weird. I have hundreds of variables, and it chooses only 10, the lowest I tested, while it captures so little variance for the proteins.
  2. Since the explained variance is so small but also so different between the datasets, does it make sense to proceed?

In the cim plot, for each protein in each component, the correlation value with all the selected 5 metabolites is (almost) identical. This results in a heatmap where the whole column (protein) has the same value.
These 5 metabolites are also almost the same regarding their biological function. So, overall, this doesn’t leave me much room for biological interpretation, but on the other hand I guess it makes sense (since they’re so similar, you expect them to be similarly correlated with the same proteins).
3. However, I’m wondering if I can somehow work around this and make it select only 1 of these highly correlated metabolites, with the hope of uncovering more metabolite-protein correlations.

As a different analysis, I also tested the same metabolites with a microbiome dataset (700x700).
The explained variance in the microbiome dataset is similar to the proteomics dataset (too small). Overall it chose 1 component (but I changed it to 2), 10 metabolites and 30 microbes.
In the heatmap we observe 4 major correlations and the rest are very low. The pattern is the same (the same value in the whole column). But what troubled me the most is that the selected metabolites in the first component are exactly the same as in the previous analysis, and on the second component also very similar.

  1. So this makes me wonder: Does the variable selection take place in each dataset separately? And then we test the correlations of these variables between the datasets? From what I have understood, this is not what’s happening. So, did I get so “lucky” that in both analyses the bigger correlations involved the same metabolites?

I hope all this make sense and sorry for the long post! :slight_smile:
Thank you,

hi @cemma,

Welcome and thanks for the thoroughness in your description. That helps.

The tuning in sPLS canonical mode is tricky and I am not sure it really produces the most optimal answer. I would use the tuning as a rough idea of what the keepX / keepY values are but if (say) the selection size is too small, I would then increase it slightly, or largely, depending. At this stage you are mining your data, not making extensive claims about the fact that only those 5 metabolites should be looked at.

Another point to consider is if you really need the multilevel (Multilevel | mixOmics) as it acts as some sort of normalisation and changes the data, perhaps in your case not in a favourable way.

The tuning is done across datasets in combination, so this is just ‘chance’ (and the limitations I stated above). There seems to be a high level of collinearity happening (i.e some microorganisms or metabolites have the exact same value). Make sure you have also CLR transformed your data for the microbiome (details in website → mixMC)


Thank you for taking the time to attend to this, it was helpful!