Inconsistent perf output when scale=FALSE

Dear mixOmics team,

We are interested in implementing multilevel PLS on our dataset, and therefore started first by reproducing with mixOmics conventional PLS models previously obtained with SIMCA for this dataset.
After cross-validation using perf, we kept obtaining inconsistent results for R2 and Q2 (very different than SIMCA equivalent output), and noticed the value on the first component was specifically odd.

Our dataset is Pareto-scaled, and therefore with carry out PLS analysis with option scale=FALSE to preserve this scaling:
comet.pls <- pls(X,Y,ncomp=6, mode=“regression”,scale = FALSE)

followed by cross-validation with the perf function:
comet.pls.val <- perf(comet.pls, validation=“Mfold”,folds =7, nrepeat =200, progressBar = T)

We also observed that when using ‘scale=TRUE’ instead, results from mixOmics PLS and SIMCA (uv-scale mode) were in that case fully consistent.

Digging a bit into the ‘perf’ output data, I noticed that the value of comet.pls.val[[“RSS”]][1] seemed inconsistent in the case of Pareto scaling and corresponded to N-1 (we have just 1 variable in Y), regardless of the type of scaling (true or false). This value is hard set in the mixOmics source code, and seems to be the source of the problem, as such definition of RSS_0 is only valid when datasets are scaled to unit variance.

In order to fix this for our Pareto-scaled dataset, I inserted the following lines in our script to manually fix comet.pls.val[[“RSS”]][1], and recalculate Q2:

RSS0 <- sum(comet.pls$Y^2, na.rm = TRUE)
comet.pls.val[[“RSS”]] <- replace(comet.pls.val[[“RSS”]], 1, RSS0)

PRESS.inside = Q2 = matrix(nrow = comet.pls$ncomp,ncol=1)
for (h in 1:comet.pls$ncomp)
{
PRESS.inside[h, ] = apply(comet.pls.val$press.mat[[h]], 2, function(x){norm(x, type = “2”)^2})
Q2[h, ] = 1 - PRESS.inside[h, ] / comet.pls.val$RSS[h, ]
}
Q2

Using this, we obtained results fully consistent with the output of SIMCA. This fixes equally the problem when using not-scaled (centered-only) datasets.

Note that when using Pareto-scaled datasets, one has to make sure that proper scaling is applied to both X and Y as input in the PLS (we groped a bit to patch the RSS_0 problem, because our Y input was initially not properly scaled).

We could then go on successfully with multilevel PLS, using the same workaround to calculate Q2!

I hope this helps, i read in this forum a few posts that may be related to this issue.

Best regards,
Bénédicte Elena-Herrmann.

Thank you @belena for the contribution!
I’ll put this as an issue in out GitHub repo to fix or amend this. Note that yes, we do assume that Y variables are entered and scaled in all PLS objects, and we usually assume there are more than 1 variable in Y.

Kim-Anh