# Q2.total negative in perf.pls

Hi,

I am performing PLS on my dataset. It says that I should select the components whose Q2.total is less than 0.0975. However, the Q2.total for all my components are negative. How reliable is it to choose the variables whose Q2 is negative.

Thank you.

Hello,
Are you using a sparse PLS for variable selection or not at all?
A Q2.total negative means that the prediction error sum of squares, PRESS (during cross validation) is much higher that the Residual Sum of Squares error, RSS.
Q2.total = 1 - (PRESS/RSS) where both PRESS and RSS are calculated for each Y variable and added.

Indeed if you start negative this is not a good sign, and could be due to either a bad fit, or too many variables (in either X or Y).

Kim-Anh

Hello!

While integrating a proteomics and peptidomics dataset, I encounter the same issue for both PLS and sPLS. Q2.total are either all negative or very low. From the previous response I see this is the result of a bad fit. I am wondering though, what is the best way to deal with this/approach this?

Hope someone can provide some help. Thanks in advance.

Best regards,

Pieter

hi @Pieter,

It all depends on the number of samples (low?) and whether there seem to be some agreement between the two data sets (plotArrow, plotIndiv?). It also depends on the biological question / type of analysis you are trying to answer (are you looking for a combination of X variables that explain a combination of Y variables? are you trying to figure out the number of components?).

Some of those questions do not necessarily need a Q2 index and can be solved with more exploratory outputs. I’ll see in the next few weeks if we can come up with a better criterion for PLS2 to choose the number of components and variables to select.

Kim-Anh

Hello,

This is an interesting question for me as well. I am trying to explain the metabolome of patients (Y) using clinical data (X) using PLS classic. I try to use Q2 as a measure of the quality of the resulting models, but it’s always low, usually negative. It would be interesting to explain the possible relationships in an exploratory fashion, and perhaps either plotArrow or plotIndiv would be the best option. But even so, how can I conclude that a model actually has value/significance? And how can plots aid me in finding the relationships?

There are examples on the website (most notably the liver toxicity tutorial), but I’m having a hard time interpreting these score plots with the goal of finding relationships in mind. The plotVar and cim options seem more straightforward in this respect, but how can I use them if the model they came from is not significant?

Looking forward to your thoughts on this,
Nick

Keep in mind that those methods are exploratory so we cannot really talk about significance (let alone statistical significance, since we are not testing anything).

How the Q2 is defined in PLS2 is based on the calculation of the Predicted Error Sum of Squares (based on the test set defined during the CV process), PRESS vs the Residual Sum of Squares (calculated directly from the fitted data).

Each is summed over all the Y variables for a given component. You would like to see:
\sqrt(PRESS) < \sqrt(RSS), or, if you want to put some slack \sqrt(PRESS) < 0.95* \sqrt(RSS).

After squaring and rearranging the terms, you come up with
Q^2 = 1 - PRESS/RSS <= 0.95^2 = 0.0975

So if your Q2 is negative, it means that the model is not good at predicting / generalising. It could be because your number of samples is too small during the CV process (even if you use loo, it may give you an unsufficient estimation); or, as you say, because X does not explain Y.

If the Q2 is low, but positive, it means you are still in the right ‘bandwidth’ because PRESS < RSS.

I like to look at the plotIndiv() to work out if the sample scores are similar from X and Y (or you could extract the $X$variates and $Y$variates and plot one against the other for each component. Similar information could be extracted from plotArrow().
then, only if I see some common information that seems to be extracted, I look at plotVar() to figure out the correlation between specific subsets of variables.

Considering a sparse model with sPLS could also help to filter out some variables. We are currently looking at a new criterion to tune sPLS, hopefully in the next mixOmics update.

Kim-Anh

Hello,

Good to have some clarification. I wonder a bit about Q2, and specifically the reason that RSS is used instead of TSS? According to the paper by Szymanska et al (2012), it should be the latter.

I tried plotting (and correlating) X$variates to X$variates in each component, and together with the plotArrow I’m getting a better feeling as to what is happening in my data. To clarify, a perfect straight line would be the desired result?

There is a small issue I come across with plotVar, in that some variables in my X are removed due to the presence of missing values (according to the warning), but no missing values are actually present. What could the reason for this be, and how can it be mended?

I really appreciate your guidance so far!

Nick

We use the definitions from the SIMCA-P PLS software that uses the RSS, but I’ll take a look at the TSS. The paper you mention seems to be focused on PLS-DA, not PLS though.

Plotting X$variates to Y$variates for each dimension should indeed give you a straight line, since you are trying to maximise the covariance between those components. Note however that if you Q2 < 0, then it means that while you can extract common information between your 2 datasets, it does not mean that this is generalisable to similar or validation experiments.

Are you sure there are no missing values in X or Y? This seems odd, please send us the warning message and your command line.

Kim-Anh