I applied PLSDA on my data using 10 components and then assessed the performance using the perf () function with 5-fold cross validation repeated 10 times as a trial. The plot of performance is shown below.
I have a question that why the overall error rate and the BER differed so much, especially when they were measured by the max.dist? The results on overall error rate seem to indicate 10 to be the optimal number of component, while the BER from centroid.dist and mahanobis.dist seem to indicate 4. Which number should I choose for the number of component of my PLSDA model? Also can I set the distance parameter in the plsda () function?
@w.zeng
the fact that the BER >> overall suggest you have a heavily unbalanced data set in terms of samples per class, so we advise you stick to the BER.
Each distance gives a different prediction (see supplemental in https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005752). The large difference in the max.dist is that it seems to ignore the samples in the minority class for prediction.
That prediction distance is only input in functions that predict, i.e. tune() and predict(). You do not need it as an input for plsda() which just fits the model.
Hi Kim-Anh, thanks for the kind reply. It’s so helpful! Yes, I have a heavily unbalanced data and agree with you that I should use BER instead of overall.
For the distances, I read the interpretations of different distances in the article you attached. According to the interpretation and the fact that max.dist may ignore the samples in the minority class, using centroid distance or mahalanobis distance may gave me more accurate predictions ( I think my samples can be considered in a multi-dimensional space)? And I should use them for determining the number of components?
The different distances allow for linear / non linear separation problems. I dont think max.dist would disregard minority classes per say, but by all means choose the distance that minimises the BER. And yes, you can choose ncomp based on the chosen distance as well (and reuse that distance for other prediction functions, such as tune())