Q2.total negative in perf.pls

Hi,

I am performing PLS on my dataset. It says that I should select the components whose Q2.total is less than 0.0975. However, the Q2.total for all my components are negative. How reliable is it to choose the variables whose Q2 is negative.

Thank you.

Hello,
Are you using a sparse PLS for variable selection or not at all?
A Q2.total negative means that the prediction error sum of squares, PRESS (during cross validation) is much higher that the Residual Sum of Squares error, RSS.
Q2.total = 1 - (PRESS/RSS) where both PRESS and RSS are calculated for each Y variable and added.

Indeed if you start negative this is not a good sign, and could be due to either a bad fit, or too many variables (in either X or Y).

Kim-Anh

Hello!

While integrating a proteomics and peptidomics dataset, I encounter the same issue for both PLS and sPLS. Q2.total are either all negative or very low. From the previous response I see this is the result of a bad fit. I am wondering though, what is the best way to deal with this/approach this?

Hope someone can provide some help. Thanks in advance.

Best regards,

Pieter

hi @Pieter,

It all depends on the number of samples (low?) and whether there seem to be some agreement between the two data sets (plotArrow, plotIndiv?). It also depends on the biological question / type of analysis you are trying to answer (are you looking for a combination of X variables that explain a combination of Y variables? are you trying to figure out the number of components?).

Some of those questions do not necessarily need a Q2 index and can be solved with more exploratory outputs. I’ll see in the next few weeks if we can come up with a better criterion for PLS2 to choose the number of components and variables to select.

Kim-Anh

Hello,

This is an interesting question for me as well. I am trying to explain the metabolome of patients (Y) using clinical data (X) using PLS classic. I try to use Q2 as a measure of the quality of the resulting models, but it’s always low, usually negative. It would be interesting to explain the possible relationships in an exploratory fashion, and perhaps either plotArrow or plotIndiv would be the best option. But even so, how can I conclude that a model actually has value/significance? And how can plots aid me in finding the relationships?

There are examples on the website (most notably the liver toxicity tutorial), but I’m having a hard time interpreting these score plots with the goal of finding relationships in mind. The plotVar and cim options seem more straightforward in this respect, but how can I use them if the model they came from is not significant?

Looking forward to your thoughts on this,
Nick

hi @NickBliziotis

Keep in mind that those methods are exploratory so we cannot really talk about significance (let alone statistical significance, since we are not testing anything).

How the Q2 is defined in PLS2 is based on the calculation of the Predicted Error Sum of Squares (based on the test set defined during the CV process), PRESS vs the Residual Sum of Squares (calculated directly from the fitted data).

Each is summed over all the Y variables for a given component. You would like to see:
\sqrt(PRESS) < \sqrt(RSS), or, if you want to put some slack \sqrt(PRESS) < 0.95* \sqrt(RSS).

After squaring and rearranging the terms, you come up with
Q^2 = 1 - PRESS/RSS <= 0.95^2 = 0.0975

So if your Q2 is negative, it means that the model is not good at predicting / generalising. It could be because your number of samples is too small during the CV process (even if you use loo, it may give you an unsufficient estimation); or, as you say, because X does not explain Y.

If the Q2 is low, but positive, it means you are still in the right ‘bandwidth’ because PRESS < RSS.

I like to look at the plotIndiv() to work out if the sample scores are similar from X and Y (or you could extract the $X$variates and $Y$variates and plot one against the other for each component. Similar information could be extracted from plotArrow().
then, only if I see some common information that seems to be extracted, I look at plotVar() to figure out the correlation between specific subsets of variables.

Considering a sparse model with sPLS could also help to filter out some variables. We are currently looking at a new criterion to tune sPLS, hopefully in the next mixOmics update.

Kim-Anh

Hello,

Good to have some clarification. I wonder a bit about Q2, and specifically the reason that RSS is used instead of TSS? According to the paper by Szymanska et al (2012), it should be the latter.

I tried plotting (and correlating) X$variates to X$variates in each component, and together with the plotArrow I’m getting a better feeling as to what is happening in my data. To clarify, a perfect straight line would be the desired result?

There is a small issue I come across with plotVar, in that some variables in my X are removed due to the presence of missing values (according to the warning), but no missing values are actually present. What could the reason for this be, and how can it be mended?

I really appreciate your guidance so far!

Nick

hi @NickBliziotis

We use the definitions from the SIMCA-P PLS software that uses the RSS, but I’ll take a look at the TSS. The paper you mention seems to be focused on PLS-DA, not PLS though.

Plotting X$variates to Y$variates for each dimension should indeed give you a straight line, since you are trying to maximise the covariance between those components. Note however that if you Q2 < 0, then it means that while you can extract common information between your 2 datasets, it does not mean that this is generalisable to similar or validation experiments.

Are you sure there are no missing values in X or Y? This seems odd, please send us the warning message and your command line.

Kim-Anh

Hi Kim-Anh,

Thanks for this great package! I have a paired bacterial and metabolomics dataset (n = 25 samples) and I’m interested in identifying 1) which features are driving variation among my treatments and 2) if particular metabolites correlate with certain OTUs. I think that spls (canonical mode) is best for this analysis because I expect the metabolites and bacteria to be correlated, but they could equally drive one another.

Since there is no model fitting method for canonical spls right now, I started using the ‘regression mode’ but similar to the other posts in this thread, I am having an issue with model fit. All of my q2total values are less than 0.
For the aims of my study, should I be concerned about q2/model fit (previously you said that not all questions need to have the Q2 index issue solved).
It would be very helpful if you could elaborate on your previous post (below) to explain how we can identify what is driving the poor model fit (e.g. how many samples are too low?) and how to go about improving this. I already confirmed that there is agreement between my 2 datasets (plotArrow, as well as procrustes tests suggests strong correlation between my datasets).

So, overall, when is it necessary to be concerned about q2.total/poor model fit?
Is there a way to troubleshoot model fit for the canonical spls?

Thanks for your help!
Tayler

hi @ulbrichtc,

We are in the process of implementing a Q2 version for the canonical mode in PLS, but that is going to take us another 2 months or so to have everything in place, and I doubt in your case it would add much, given the results you have on a PLS regression mode.

I can’t really tell what can drive a poor model fit, except that basically the model does not generalises well on the test data during cross-validation. Choose M = 3 or less given that N = 25 in the perf function.

What your results may suggest is that sPLS is successful at highlighting common information between your two data sets, but it is particular to this study, and as soon as you remove some samples randomly (during the cross-validation stage), then the sPLS regression is not able to predict.

Prediction in your context is not what you are interested in, and so I would just choose a small number of dimensions, informed by sample plots and correlation circle plots and adopt a more exploratory approach rather than ‘set in stone’ parameters (which are limited in this context anyway).

I’ll be able to share more material in 2 months or so about this if you ping me back.

Kim-Anh

Hi Kim-Anh,
Thanks for your response! Looking forward for the added Q2 for the canonical PLS!
In the mean time, I’ll start with a more exploratory approach as you suggested.
Thanks!