Q2 choose number of components + error tune

Hi mixOmics,

I am having a question on how to choose the number of components based on the Q2 values. Below you see the Q2.total plot I got from the perf function (including a metabolomics and phosphoproteomics dataset). Based on the plot I thought to include 3 components, but I also read in another topic that components with negative Q2 values might not be predictive for other studies (but when I plot component 3 against component 1, it does seperate two groups in my data really well). And component one is above the 0.0975 line, is that a ‘bad’ thing? I read (in the same topic) that this has something to do with the significance, but that it is not that important for exploratory analysis, am I right?

I also have a question regarding the tune function for sPLS analysis. When I try to run this function I get the error:

Error in solve.default(t(Pmat) %*% Wmat) :
Lapack routine dgesv: system is exactly singular: U[2,2] = 0

I read that it could be due to too many components in the tune function or too many zeroes in the data, but with 1 component I already get this error, and when I tune in DIABLO with three datasets (including these two), it works (so, I do not think it is the zeroes). I thought it could also be due to high correlation between the datasets, is this true? And can I solve this, or would you recommend setting the numbers arbitrarily?

Kind regards,
Lonneke

hi @lonnekenouwen,

Regarding the Q2: your plot seem to indicate that you need at least 1 component. And yes for component 3 it is (slightly) negative but it depends on your number of samples (we use cross-validation, perhaps N is too small to assess the prediction here). It is great that you are evaluating the model with graphical plots, this is really the way to go in these complex problems, and as you say, for an exploration it is not so much of a big deal. Also, note that the Q2 here is based on how the prediction accuracy of Y.hat based on X, not (I think) the sample groups you are mentioning, so potentially you are looking at two different criteria here.

For the bug you mention could be that the number of variables selected is too small (P and W are the loading vectors associated with X) and that may add too many constraints to the model. Or your choice of folds in the cross-validation can also be a reason. If you have not sorted it out, send us an email with reproducible code and we can look into this.

Depending on your analysis aims, you can also choose arbitrary keepX and keepY value.

Hope that helps, and I appreciate you are going through our (every growing) list of previous questions!

Kim-Anh