PLS or sPLS? What does Q2 mean?

Hi all,

Im doing my master thesis in metabolomics, where I want to find the relation between metabolomics samples (normal diet and lowcarb diet) and the incresement of LDL.

It is known that LDL increase after following Ketogenic diet, but why it does is not known. My aim is to find the correlated metabolites with the change in LDL by the use of PLS or sPLS and interpret the correlated metabolites in biological pathways.

Since i have 3515 metabolic features do you guys recommend to use sPLS or PLS? I think that sPLS with sparsity will choose the most important variables so it would be easier to start the search in biological pathways?

As it is important to validate the models we use, i use the perf function with Mfold and 5 folds (I have 25 samples). I dont quite understand what Q2 and Q2.total means? How is the criteria of Q2.total to be under 0,0975 contructed and how do we interpret it?

Can i only use Q2 or Q2.total, or do i also need to take a look at R2 and MSEP when choosing the number of components for a model?

Hi @Monica,

Regarding the choice of PLS / sPLS, the sPLS will help you select a small subset of metabolites that explain LDL, so that is what I would recommend. But of course start exploring your data (using sample plots in particular) for a first level of understanding.

Cross-validation: with 25 samples, yes you can use M = 5, with several repeats (at least 10).

Regarding the Q2 / Q2.total: these measures are used to understand the quality of fit as well as choose the number of components in the PLS/sPLS model.

Roughly, the Q2 is calculated per Y variable numbered k as
Q2_k^2 = 1 - PRESS_k/RSS_k,
whereas the Q2.total is calculating the sum across all variables as
Q2.total^2 = 1 - sum_k PRESS_k / sum_k RSS_k
so you can see the Q2.total as a sort of aggregation of the performance across all Y variables to work out primarily how many components are needed in the PLS model, whereas the Q2 will tell you how the quality of prediction / for each variable Y individually as you add more components. In your case you have only one Y variable (LDL) so I think the results should be very similar

How you interpret the 0.0975 rule of thumb is that we continue to add components if Q2.total^2 >= 0.0975
I attach the following details as I realise this is too difficult to explain without formulas, if you need.
Screen Shot 2020-03-30 at 11.41.49

We provide the calculation of the R2 and MSEP per Y variable (not aggregated) and you can look at those for more insight instead of looking only at the Q2. They reflect different measures of prediction.

Note: I strongly believe all those measures (R2, Q2, MSEP) are not well adapted for a case with several Y variables. They were primarily developed for one single Y variable. In your case this is fine if you have only one Y variable.

Kim-Anh