Inconsistent perf output when scale=FALSE

belena · July 29, 2020, 10:28am

Dear mixOmics team,

We are interested in implementing multilevel PLS on our dataset, and therefore started first by reproducing with mixOmics conventional PLS models previously obtained with SIMCA for this dataset.
After cross-validation using perf, we kept obtaining inconsistent results for R2 and Q2 (very different than SIMCA equivalent output), and noticed the value on the first component was specifically odd.

Our dataset is Pareto-scaled, and therefore with carry out PLS analysis with option scale=FALSE to preserve this scaling:
comet.pls <- pls(X,Y,ncomp=6, mode=“regression”,scale = FALSE)

followed by cross-validation with the perf function:
comet.pls.val <- perf(comet.pls, validation=“Mfold”,folds =7, nrepeat =200, progressBar = T)

We also observed that when using ‘scale=TRUE’ instead, results from mixOmics PLS and SIMCA (uv-scale mode) were in that case fully consistent.

Digging a bit into the ‘perf’ output data, I noticed that the value of comet.pls.val[[“RSS”]][1] seemed inconsistent in the case of Pareto scaling and corresponded to N-1 (we have just 1 variable in Y), regardless of the type of scaling (true or false). This value is hard set in the mixOmics source code, and seems to be the source of the problem, as such definition of RSS_0 is only valid when datasets are scaled to unit variance.

In order to fix this for our Pareto-scaled dataset, I inserted the following lines in our script to manually fix comet.pls.val[[“RSS”]][1], and recalculate Q2:

RSS0 <- sum(comet.pls$Y^2, na.rm = TRUE)
comet.pls.val[[“RSS”]] <- replace(comet.pls.val[[“RSS”]], 1, RSS0)

PRESS.inside = Q2 = matrix(nrow = comet.pls$ncomp,ncol=1)
for (h in 1:comet.pls$ncomp)
{
PRESS.inside[h, ] = apply(comet.pls.val$press.mat[[h]], 2, function(x){norm(x, type = “2”)^2})
Q2[h, ] = 1 - PRESS.inside[h, ] / comet.pls.val$RSS[h, ]
}
Q2

Using this, we obtained results fully consistent with the output of SIMCA. This fixes equally the problem when using not-scaled (centered-only) datasets.

Note that when using Pareto-scaled datasets, one has to make sure that proper scaling is applied to both X and Y as input in the PLS (we groped a bit to patch the RSS_0 problem, because our Y input was initially not properly scaled).

We could then go on successfully with multilevel PLS, using the same workaround to calculate Q2!

I hope this helps, i read in this forum a few posts that may be related to this issue.

Best regards,
Bénédicte Elena-Herrmann.

kimanh.lecao · August 5, 2020, 10:50pm

Thank you @belena for the contribution!
I’ll put this as an issue in out GitHub repo to fix or amend this. Note that yes, we do assume that Y variables are entered and scaled in all PLS objects, and we usually assume there are more than 1 variable in Y.

Kim-Anh

Topic		Replies	Views
If (pval < alpha) error Support	9	745	December 5, 2019
Error when tuning sPLS parameters Support	8	1686	May 8, 2023
Non-conformable arguments error Suggestions for improvement	2	663	May 26, 2023
Perf() step gives inconsistent results? Support	1	37	October 17, 2024
Model validation for block.pls Support	2	373	October 14, 2020

Inconsistent perf output when scale=FALSE

Related topics