Variance explained in PLS-DA in X and Y

I’m using mixOmics::splsda in R to make prediction. X is a numerical matrix of 400x70 for the concentration of 70 proteins in 400 samples. Y is a binary label of the samples. I firstly tried the following:

re1=mixOmics::splsda(X,Y,ncomp = 70)

Some results:

  • re1$prop_expl_var$Y[1] is 1, and min(re1$prop_expl_var$Y) is 0.86. prop_expl_var is the Proportion of variance explained per component after setting possible missing values in the data to zero
  • re1$prop_expl_var$X[1] is 0.019, and range(re1$prop_expl_var$X) is 0.004 to 0.14
  • range(re1$loadings$X) is -0.32 to 0.34

Can the following statements be made?

  1. X contains all the information in Y because the 1st PC already contains all the variance in Y.
  2. Variation in Y is a reflection of the variation of all 70, instead of a few, of the proteins because
    • the minimum proportion of variance in Y explained by any of the 70 PCs is 0.86, still very large.
    • the coefficients between any protein and any PC are between -0.3 and 0.3; that is, no PC is dominated by only a few proteins.
  3. Large majority of information in proteins are not related to Y because the proportions of variance in protein explained by each PC is small, ranging from 0.004 to 0.14

Cross validation suggests only the 1st PC and two proteins should be used in building the model. But at the end the prediction accuracy is poor. Could anyone explain possible reasons for the poor performance?

Thanks!

Hi @blueskypie

  1. X contains all the information in Y because the 1st PC already contains all the variance in Y.
  2. Variation in Y is a reflection of the variation of all 70, instead of a few, of the proteins because
  • the minimum proportion of variance in Y explained by any of the 70 PCs is 0.86, still very large.
  • the coefficients between any protein and any PC are between -0.3 and 0.3; that is, no PC is dominated by only a few proteins.

The variance explained in Y in the PLS-DA context is not relevant, you want to look rather at what is happening in X (note: you don’t need that many components). The results in Y confirm that the Y component fits Y (so that’s not super interesting).

In X the variance explained is fairly low, meaning that the biological variation you expect to highlight based on the Y information is not very strong. Just remember that PLS-DA is not trying to maximise the variance of each component (as opposed to PCA). It tries to maximise the covariance between X and Y.

  1. Large majority of information in proteins are not related to Y because the proportions of variance in protein explained by each PC is small, ranging from 0.004 to 0.14

Yes

Cross validation suggests only the 1st PC and two proteins should be used in building the model. But at the end the prediction accuracy is poor. Could anyone explain possible reasons for the poor performance?

As observed in point 3, the protein abundance, in combination, are not able to discriminate your sample groups.

Kim-Anh

Thank you so much for the quick response, Kim! Wonder if I can ask more questions?

  1. Are the different components in X orthogonal to each other? and are the different components in Y orthogonal to each other? If so, how could min(re1$prop_expl_var$Y) be 0.86, i.e. very large across 70 components.
  2. Is my point #2 a reasonable conclusion?
  3. I still don’t understand why this example gives good prediction whereas mine does not.
> final.splsda$prop_expl_var$Y
    comp1     comp2     comp3 
0.2950213 0.3695560 0.3385408 
> final.splsda$prop_expl_var$X
     comp1      comp2      comp3 
0.05559064 0.06838866 0.05984694 
> range(final.splsda$loadings$X)
[1] -0.4458053  0.6492205
> range(final.splsda$loadings$Y)
[1] -0.8035564  0.9157799

Here the final.splsda$prop_expl_var$X is also small, does it not indicate point #3 above?

Thank you again!

Hi @blueskypie

  1. Are the different components in X orthogonal to each other? and are the different components in Y orthogonal to each other? If so, how could min(re1$prop_expl_var$Y) be 0.86, i.e. very large across 70 components.

As I explained earlier, the explained variance in Y is not relevant in your case, I think this is why those results are confusing. The components in X are orthogonal and the sum of the proportion of variance should tend to 1

  1. Is my point #2 a reasonable conclusion?
    No, I don’t think so. See answer in 1 here and above
  1. I still don’t understand why this example gives good prediction whereas mine does not.

I think you are confusing explained variance and ability to separate groups as equivalent, whereas they give you some indication but you need to dig deeper in the classification aspects. In that example the % of explained variance on the first sPLS-DA component is very low, 10% but the classification performance is quite high. Your explained variance in X (please only focus on the X!) is very small, indicating the separation is probably non existent. Use the perf() function to evaluate the classification performance to confirm this…

  1. Here the final.splsda$prop_expl_var$X is also small, does it not indicate point #3 above?
    As above.

Kim-Anh

Yes, I’m confused. In that example, the sum(final.splsda$prop_expl_var$X) is only ~18%, but the 1st three components gave perfect prediction. Seems the prediction accuracy depends on something else besides prop_expl_var$X, what are the something else?

hi @blueskypie,

Unfortunately I will have to be brief here, but the something else is the variance explained in X, given the information from Y. This is not captured in these explained variance measures. We do not use these measures to measure the classification performance of sPLSDA, we use the classification performance measures from the perf() function.

Kim-Anh