The number of variables selected in a sPLS-DA should be similar?

In my case, I am analyzing microbiota and metabolomics data: 37 OTUs, 600 plasma metabolites and 600 stool metabolites. For example in one model the variable selection gives me 10;10 15;15 and 40;40 respectively. My concern is, if the number of variables selected for one dataset is higher than another, won’t it influence me when correlating them? Am I not assigning more weight to one dataset?

I’m a little confused by the question. What methods are you using and what are you trying to achieve with them? Are you using multiblock (N-integration) or single-omics frameworks?

Also what do you mean by correlating them? Are you referring to analysing the correlation of the components across multiple models?

Max, thank you very much for your response. I apologize for the vagueness of the question.

I am trying to do a sPLS-DA (N-integration) to see how the X variables: Microbiome and Metabolome interact with two groups of rats I have (Y variables): Transgenic Rats and Wild Type Rats. I want to find which OTUs and which metabolites discriminate the trangenic and I also want to find correlations between the microbiome and metabolome variables.

Something that I had been told is that if I choose for example: 10 OTUs from microbiome, 15 metabolites from plasma and 30 metabolites from feces to run my model, I would be giving more weight to the variable that has more n. I want to know if this is true, because when I do the tuning the n of variables to choose to run the model is never even and anyway it results in models with lower BER and with good power to discriminate the Y variable.

My understanding would suggest that this “extra weight” is irrelevant. Let’s consider an example:

  • A model selects 2 features from your microbiome data, each of which have very high loading values
  • It also selected 20 metabolite features, but each of which have very low loading values.
  • If the microbiome features summed weight is more than the summed weight of the metabolite features, then it could be argued that the microbiome dataset is being weighed more despite having less features selected from it

In reality, I don’t see a way in which one of the datasets is “weighted more” to a detrimental effect. As you’ve identified, the minimisation of metrics (eg BER) is the empirical way to judge a model

1 Like


I understand. Your answer was very clear. I agree with your point, the question arose because it was something that a bioinformatician questioned me and made me doubt.
Thank you very much!

Always best to clarify one’s understanding! Also keeps my brain active. Let me know if you have any other questions