Help understanding high error rate using PLS-DA

Hi all,

I’m using the plsda function to do binary classification on a metabolomics dataset with 718 features and perfectly balanced sample size of n=25 per group.

plsda.res <- plsda(data.sel, response, ncomp = 10)
perf.plsda <- perf(plsda.res, validation = "Mfold", folds = 5, progressBar = FALSE, auc = TRUE, nrepeat = 500)

Using plotIndiv, it looks like there is pretty good separation between the groups along components 1 and 2:

However, the error rates in the perf.plsda object are a lot higher than what I would expect based on the component plot. Something in the range of 40-46% error!

Is this behavior reasonable? I thought that if you could draw a line to separate groups on a projection plot, then the actual classification should be roughly similar in performance. Or am I doing something wrong?

TIA,
Fan

Hi Fan,
thank you for using mixOmics!
What the performance plot shows might be a case of overfitting. On the training data it looks fine (plotIndiv), but as soon as you use cross-validation, the PLS-DA model does not generate well. A few tips to improve performance:

  • consider using sparse PLS-DA to select only the best discriminant metabolites to explain our outcome. It means you need to tune the number of metabolites to select (we provide some examples in our book down vignette to tune sPLS-DA)
  • also increase the number of repeats to at least 1000 for more accurate estimations when using perf on the splsda object.

Let us know if that helps!

Kim-Anh

Hi Kim-Anh,

That makes sense. I didn’t think about the sample plots as showing training performance instead of cross-validation. I will try sPLS-DA!

Best,
Fan

I am just getting my feet wet in mixOmics - very pleased thus far. However, i am having some difficulty in reconciling the plots, as above.

My sPLSDA plot looks like there is decent separation between groups:
nc ← 2 # number of components to fit
splsda.res ← splsda(
X, Y, keepX = rep(50,nc),
ncomp = nc, mode = ‘regression’)

But the CV stinks!
perf.pls ← perf(splsda.res,
validation = “Mfold”,
folds = 5, progressBar = FALSE,
auc = TRUE, nrepeat = 1000)

The output from this perf() function indicates that i have terrible error rates:

perf.pls$error.rate
$overall
max.dist centroids.dist mahalanobis.dist
comp1 0.77115 0.73685 0.73685
comp2 0.67855 0.70130 0.68165

$BER
max.dist centroids.dist mahalanobis.dist
comp1 0.77115 0.73685 0.73685
comp2 0.67855 0.70130 0.68165

I am also not seeing any Q^2 value in the returned object from perf()

The perf() output suggests that prediction is no better than random, despite the promising initial scores plot.

Any guidance/clarification is welcome, and thanks for the nice program!

hi @cbroeckl,

Yes, this can happen for the following reasons (as highlighted in the thread above):

The model is overfitting: it looks good on the whole data set, but bad when estimating as unbiasedly as possible the performance of the model using cross-validation. Potentially your number of samples is very small. Consider changing the number of folds in perf(), but I doubt it will be better. Consider tuning the number of variables (keepX) to see if the results improve with the function tune().

We do not provide a Q2 measure for PLS-DA objects, only for PLS (since Q2 is really designed for a regression framework - our point of view anyway).

If all fail, I would adopt an exploratory/descriptive interpretation of the analysis. You could also have a look at the perf() further outputs to work out what is the class that is misclassified (e.g. DS/TA pregnant woman) and rethink how to (re)define your sample groups.

Kim-Anh

Dear Kim-Anh,

I also frequently see this result when applying MixOmics to my data. I am interested in what you said about adopting exploratory/descriptive interpretation. Could you give an example of that?

Thanks in advance.

Mikhael

hi @mdmanurung,

For an exploratory analysis, you would suggest new hypotheses to be investigated further if you had a larger number of samples. Hence, you would describe the different plots but veer away from any numerical (quantitative results) and keep the tone of the conclusion cautious. You would also focus on the biological interpretation of the features selected - see if they make sense from a biological perspective in relation with the biological system you are looking at.

I dont have any example in mind, but perhaps some articles citing the mixOmics paper are also doing this.

Kim-Anh