Balanced error rate vs. overall error rate

w.zeng · November 28, 2020, 8:32pm

Hello,

I applied PLSDA on my data using 10 components and then assessed the performance using the perf () function with 5-fold cross validation repeated 10 times as a trial. The plot of performance is shown below.

I have a question that why the overall error rate and the BER differed so much, especially when they were measured by the max.dist? The results on overall error rate seem to indicate 10 to be the optimal number of component, while the BER from centroid.dist and mahanobis.dist seem to indicate 4. Which number should I choose for the number of component of my PLSDA model? Also can I set the distance parameter in the plsda () function?

Thanks!
Wenjie

kimanh.lecao · December 3, 2020, 3:26am

@w.zeng
the fact that the BER >> overall suggest you have a heavily unbalanced data set in terms of samples per class, so we advise you stick to the BER.
Each distance gives a different prediction (see supplemental in https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005752). The large difference in the max.dist is that it seems to ignore the samples in the minority class for prediction.
That prediction distance is only input in functions that predict, i.e. tune() and predict(). You do not need it as an input for plsda() which just fits the model.

Kim-Anh

w.zeng · December 3, 2020, 6:58pm

Hi Kim-Anh, thanks for the kind reply. It’s so helpful! Yes, I have a heavily unbalanced data and agree with you that I should use BER instead of overall.

For the distances, I read the interpretations of different distances in the article you attached. According to the interpretation and the fact that max.dist may ignore the samples in the minority class, using centroid distance or mahalanobis distance may gave me more accurate predictions ( I think my samples can be considered in a multi-dimensional space)? And I should use them for determining the number of components?

kimanh.lecao · December 15, 2020, 3:15am

hi @w.zeng,

The different distances allow for linear / non linear separation problems. I dont think max.dist would disregard minority classes per say, but by all means choose the distance that minimises the BER. And yes, you can choose ncomp based on the chosen distance as well (and reuse that distance for other prediction functions, such as tune())

Kim-Anh

Topic		Replies	Views
Splsda difficulties Analysis	3	848	December 21, 2020
PLS-DA distance method determination Analysis	2	127	April 30, 2024
What is the different between BER and PRESS in PLSDA? Analysis	2	584	March 7, 2021
Help understanding high error rate using PLS-DA Analysis	6	3572	October 21, 2020
Help deciding the number of components in PLS-DA Analysis	3	383	June 27, 2024

Balanced error rate vs. overall error rate

Related topics