First, thank you to all who have contributed to this package and this forum - it has been an invaluable resource for me.
Second, my question. In short, I am wondering if someone might provide an explanation of the “Explained Variance” generated from a PLS or sPLS model. (I am currently working with a PLS model in regression mode.)
The manual states “This function calculates the variance explained by variates.” I am wondering how it is calculated? How ought it be interpreted? Importantly, I am wondering how it relates to Q^2? (E.g., in some models I run that have a negative Q^2, the Explained Variance is positive.)
Some example output provided, in case that is of any help. Here, let X be metabolomics1 and Y be metabolomics2. (And yes, sadly I am aware that my model is a poor fit…)
Thanks for using mixOmics and sharing your questions with us.
Briefly, Explained Variance of a components is simply the component’s variance over total variance in the data. Typically but not always, components with relatively very low explained variance tend to be less important.
Q2 is a measure of model fit as to how well the model fits the data when using cross-validation. Where 1 means complete fit, 0 represents a model which perform only as well as using mean value of the train data as prediction for the test data, and negative values mean a rather poor model.
Hope this helps. Please let us know if you need further clarification.
RE: Explained Variance. I think I can see how this would work for the X and Y components individually, but I’m unsure how the total variance would be calculated for the X-Y data in full. I am wondering if you could perhaps provide an example calculation or a citation? No worries if not!
RE: Q2. I believe I understand this. I.e., that the Q2 is obtained from cross-validation predictive performance, whereas the explained variance are “inherent” to the data and the components.
RE Explained Variance: The pls functions report the explained variance for each component per dataset, so in X and Y space separately. In the XY space covariance is the relevant measure which is the criteria to maximise by the pls function.
RE Q2: That’s correct. Q2 is measure of model fit, while Explained Variance indicates the amount of variability in each component for each dataset. Note that we are maximising the covariance b/w X and Y components so the amount of variance explained by components does not directly inform of their relevance.