Help understanding high error rate using PLS-DA

Hi all,

I’m using the plsda function to do binary classification on a metabolomics dataset with 718 features and perfectly balanced sample size of n=25 per group.

plsda.res <- plsda(data.sel, response, ncomp = 10)
perf.plsda <- perf(plsda.res, validation = "Mfold", folds = 5, progressBar = FALSE, auc = TRUE, nrepeat = 500)

Using plotIndiv, it looks like there is pretty good separation between the groups along components 1 and 2:

However, the error rates in the perf.plsda object are a lot higher than what I would expect based on the component plot. Something in the range of 40-46% error!

Is this behavior reasonable? I thought that if you could draw a line to separate groups on a projection plot, then the actual classification should be roughly similar in performance. Or am I doing something wrong?

TIA,
Fan

Hi Fan,
thank you for using mixOmics!
What the performance plot shows might be a case of overfitting. On the training data it looks fine (plotIndiv), but as soon as you use cross-validation, the PLS-DA model does not generate well. A few tips to improve performance:

  • consider using sparse PLS-DA to select only the best discriminant metabolites to explain our outcome. It means you need to tune the number of metabolites to select (we provide some examples in our book down vignette to tune sPLS-DA)
  • also increase the number of repeats to at least 1000 for more accurate estimations when using perf on the splsda object.

Let us know if that helps!

Kim-Anh

Hi Kim-Anh,

That makes sense. I didn’t think about the sample plots as showing training performance instead of cross-validation. I will try sPLS-DA!

Best,
Fan