The number of variables selected in a sPLS-DA should be similar?

Lorengol · September 18, 2022, 4:32am

In my case, I am analyzing microbiota and metabolomics data: 37 OTUs, 600 plasma metabolites and 600 stool metabolites. For example in one model the variable selection gives me 10;10 15;15 and 40;40 respectively. My concern is, if the number of variables selected for one dataset is higher than another, won’t it influence me when correlating them? Am I not assigning more weight to one dataset?

MaxBladen · September 19, 2022, 11:54pm

I’m a little confused by the question. What methods are you using and what are you trying to achieve with them? Are you using multiblock (N-integration) or single-omics frameworks?

Also what do you mean by correlating them? Are you referring to analysing the correlation of the components across multiple models?

Lorengol · September 20, 2022, 11:36am

Max, thank you very much for your response. I apologize for the vagueness of the question.

I am trying to do a sPLS-DA (N-integration) to see how the X variables: Microbiome and Metabolome interact with two groups of rats I have (Y variables): Transgenic Rats and Wild Type Rats. I want to find which OTUs and which metabolites discriminate the trangenic and I also want to find correlations between the microbiome and metabolome variables.

Something that I had been told is that if I choose for example: 10 OTUs from microbiome, 15 metabolites from plasma and 30 metabolites from feces to run my model, I would be giving more weight to the variable that has more n. I want to know if this is true, because when I do the tuning the n of variables to choose to run the model is never even and anyway it results in models with lower BER and with good power to discriminate the Y variable.

MaxBladen · September 20, 2022, 10:06pm

My understanding would suggest that this “extra weight” is irrelevant. Let’s consider an example:

A model selects 2 features from your microbiome data, each of which have very high loading values
It also selected 20 metabolite features, but each of which have very low loading values.
If the microbiome features summed weight is more than the summed weight of the metabolite features, then it could be argued that the microbiome dataset is being weighed more despite having less features selected from it

In reality, I don’t see a way in which one of the datasets is “weighted more” to a detrimental effect. As you’ve identified, the minimisation of metrics (eg BER) is the empirical way to judge a model

Lorengol · September 20, 2022, 10:36pm

Max,

I understand. Your answer was very clear. I agree with your point, the question arose because it was something that a bioinformatician questioned me and made me doubt.
Thank you very much!

MaxBladen · September 20, 2022, 10:54pm

Always best to clarify one’s understanding! Also keeps my brain active. Let me know if you have any other questions

Topic		Replies	Views
sPLS explained variance and variable selection	2	140	June 17, 2024
How to select the optimal number of variables for sPLS-DA and comparison with Selbal Analysis	4	715	July 1, 2020
Proportion explained variance in PLS vs sPLS model Analysis	4	131	March 28, 2025
Unable to understand selectVar() output in sPLS-DA Bugs	4	1068	June 9, 2020
Difference between PLS-DA and sPLS-DA Analysis	3	4143	December 21, 2020

The number of variables selected in a sPLS-DA should be similar?

Related topics