Optimal components from perf() and tune.splsda() functions are not optimal?

touss007 · January 11, 2023, 5:29pm

Hello,
I am making a pls-da model and was playing around with the number of components in the final model. At first I was using the number of components as suggested by the perf() and tune() functions (taking into account your advice in the manual and in other topics).
However I noticed that often the number of components (and variables) that is suggested is not optimal.

For example, here it seems like 1 component would be the best for max.dist. But when I do this 3 of 26 samples are incorrectly classified. However, with 6 components all 26 samples are correctly classified.

I did 7 fold validation repeated 75 times, so this should not be the problem right? Does it have to do something with the threshold that determines if the model improves or not?

Thanks in advance!

MaxBladen · January 11, 2023, 9:16pm

When determining the optimal number of components, the perf() and tune() functions employ the following algorithm:

for x in 1 : ncomp:
- generate component values and loadings by decomposition
- Use folded CV to determine the error rate of a model utilising these variates. This will yield a distribution of nrepeat error rates.
- if x > 1:
  - run a one sided T-test between the optimal ncomp error rates distribution against component x error rates distribution.
  - If x has statistically significant improvement, then set x as new optimal ncomp.
- else
  - set x as optimal ncomp

Despite there being this “calculated” optimal ncomp, at the end of the day we as users need to decide on what value to use for this parameter. We always need to be balancing model accuracy with model complexity. This is why mixOmics tuning functions tend to lean towards suggesting simpler models (ie. fewer components and features).

Additionally, if we look at the error bars within your figure, there is a large degree of overlap. This suggests that T-tests for components 2 to 10 all were insignificant. The sharp increase beyond component 1 elucidates why it was selected.

touss007 · January 12, 2023, 8:48am

Thanks for the quick response!
I’m trying to get the maximum classification accuracy so I think I’m just going to take the model which gives the best classification results for my training samples and then use that to classify my test samples.
Thanks!

Topic		Replies	Views
Perf() and tune() producing different optimal component counts Analysis	7	1242	May 26, 2022
error while trying to choose the optimum number of components Support	1	350	July 27, 2023
sPLS-DA prediction problem Analysis	4	865	August 11, 2020
Pls-da classification error rate Analysis	3	1960	June 4, 2020
sPLS choice of optimal number of components Analysis	4	1189	July 29, 2021

Optimal components from perf() and tune.splsda() functions are not optimal?

Related topics