Splsda difficulties

Hi,
I have 300 samples, with 100 000 characteristics and a binary outcome that I am trying to predict. I followed the splsda.srbct example. I don’t quite understand the results.

A plot post PLSDA shows that the samples separate nicely.

After running perf, I seem to get a large difference between the BER and the overall error rate

I selected 2 components based on max.dist
max.dist centroids.dist mahalanobis.dist
overall 2 1 1
BER 2 1 1

After running tune.splsda, choice.keepX recommends comp1 with 3 variables

As expected, the ROC curve looks bad:

Is this what is expected based on the initial PLSDA results?

Thank you

hi @Santi,
What you are facing here is an overfitting issue. The model is good a discriminating the classes, but as soon as you use cross-validation / subsample, it does not. Also note that even visually the groups does samples appear quite distinct, the difference between them might be very small (as indicated by the amount of variance explained on component 1).

Kim-Anh

Hi Kim-Anh,

Thank you very much for replying. Is the over-fitting also a the reason why the overall error rate after running PLSDA is quite low but the BER is high?

is there anything that I can do to improve this? Would removing features that are only expressed in a few samples help?

I also tried to run splsda on a different dataset where I only have 25 features
After applying tune.splsda, I have a balanced error rate of 0.49 with the recommendation of keeping 4 components.

comp1 comp2 comp3 comp4
16 11 19 12

I get an AUC of 0.8523

What I wanted to check was:

  • Are the features repeated given I only have 25 features?
  • Why do I get such a good AUC with such a high BER?

Thank you very much!

Santi

hi @Santi,

Is the over-fitting also a the reason why the overall error rate after running PLSDA is quite low but the BER is high?

No this is a different problem, you have an unbalanced number of samples in each class and so the minority classes are not taken into account in the classical Error rate estimation

I also tried to run splsda on a different dataset where I only have 25 features
After applying tune.splsda, I have a balanced error rate of 0.49 with the recommendation of keeping 4 components.

comp1 comp2 comp3 comp4
16 11 19 12

I get an AUC of 0.8523

What I wanted to check was:

  • Are the features repeated given I only have 25 features?

Yes, and you could check this yourself

  • Why do I get such a good AUC with such a high BER?

You can read in our PLOS CB paper the reasons why - AUC is not really appropriate, it uses different prediction approach.

Kim-Anh