Interpretation variance explained in mode="regression"

Hi mixOmcis team and users,

I have a question regarding the interpretation of proportion of explained variance when mode=“regression”. I can observe for this mode that the sum of explained variance for Y over pls comp are over 100%. That’s not the case when the mode is classical.

On the Linnerud dataset example :

data(“linnerud”)
X = scale(linnerud$physiological)
Y = scale(linnerud$exercise)
mod_mix_regression = mixOmics::pls(Y=Y,X=X,mode=“regression”,scale=F,ncomp=3)
mod_mix_classic = mixOmics::pls(Y=Y,X=X,mode=“classic”,scale=F,ncomp=3)
cumsum(mod_mix_classic$prop_expl_var$Y)
cumsum(mod_mix_regression$prop_expl_var$Y) #over 100%

My questions are :

  • I guess that it comes from the way Y is deflated/normalized but why this choice?
  • Is it still possible to interpret proportion of explained variance? How?

Best regards.

Hi @ggrignon,

You’re right that the sum of explained variance is over 100% in the regression mode for PLS but not the classic mode. In the classic mode each component is orthogonal, so none of them explain any shared variance so total explained variance is <100%. For regression mode Y is not deflated across components, as each component is trying to explain the original Y data. This means that components can explain overlapping variance in the data, so the total explained variance can be >100%.

You can use the explained variance to compare the importance of the PLS components in regression mode, but the cumulative sum of explained variance is not informative because the component can explain overlapping variance.

Hope that helps!
Eva

Hi Eva,

Thanks for the answer. I would like to know what is the practical advantage of regression mode over classic mode? In which case should I use regression instead of the classic method?

Thanks.

Hi @ggrignon,

The choice of mode will depend on what your analysis aim is.

You should use regression mode when you want to predict an outcome from one dataset.
e.g. I have transcriptomics data from different tumour samples. Can I model tumour size based on transcriptomics data?

You should use canonical mode when you’re comparing two datasets of equal importance to find patterns they share across samples.
e.g. I have transcriptomics and proteomics data from different tumour samples. Do these datasets agree?

Cheers,
Eva