Splsda difficulties

Santi · December 13, 2020, 4:58pm

Hi,
I have 300 samples, with 100 000 characteristics and a binary outcome that I am trying to predict. I followed the splsda.srbct example. I don’t quite understand the results.

A plot post PLSDA shows that the samples separate nicely.

After running perf, I seem to get a large difference between the BER and the overall error rate

I selected 2 components based on max.dist
max.dist centroids.dist mahalanobis.dist
overall 2 1 1
BER 2 1 1

After running tune.splsda, choice.keepX recommends comp1 with 3 variables

As expected, the ROC curve looks bad:

Is this what is expected based on the initial PLSDA results?

Thank you

kimanh.lecao · December 15, 2020, 3:11am

hi @Santi,
What you are facing here is an overfitting issue. The model is good a discriminating the classes, but as soon as you use cross-validation / subsample, it does not. Also note that even visually the groups does samples appear quite distinct, the difference between them might be very small (as indicated by the amount of variance explained on component 1).

Kim-Anh

Santi · December 15, 2020, 8:48am

Hi Kim-Anh,

Thank you very much for replying. Is the over-fitting also a the reason why the overall error rate after running PLSDA is quite low but the BER is high?

is there anything that I can do to improve this? Would removing features that are only expressed in a few samples help?

I also tried to run splsda on a different dataset where I only have 25 features
After applying tune.splsda, I have a balanced error rate of 0.49 with the recommendation of keeping 4 components.

comp1 comp2 comp3 comp4
16 11 19 12

I get an AUC of 0.8523

What I wanted to check was:

Are the features repeated given I only have 25 features?
Why do I get such a good AUC with such a high BER?

Thank you very much!

Santi

kimanh.lecao · December 21, 2020, 4:41am

hi @Santi,

Is the over-fitting also a the reason why the overall error rate after running PLSDA is quite low but the BER is high?

No this is a different problem, you have an unbalanced number of samples in each class and so the minority classes are not taken into account in the classical Error rate estimation

I also tried to run splsda on a different dataset where I only have 25 features
After applying tune.splsda, I have a balanced error rate of 0.49 with the recommendation of keeping 4 components.

comp1 comp2 comp3 comp4
16 11 19 12

I get an AUC of 0.8523

What I wanted to check was:

Are the features repeated given I only have 25 features?

Yes, and you could check this yourself

Why do I get such a good AUC with such a high BER?

You can read in our PLOS CB paper the reasons why - AUC is not really appropriate, it uses different prediction approach.

Kim-Anh

Topic		Replies	Views
Help understanding high error rate using PLS-DA Analysis	6	3572	October 21, 2020
Splsda error rate Analysis	1	770	September 7, 2020
Problems with AUC for splsda object Analysis	1	785	November 4, 2020
Balanced error rate vs. overall error rate Analysis	3	2142	December 15, 2020
PLS-DA with missing '' values predicted in Y Analysis	1	712	April 26, 2020

Splsda difficulties

Related topics