Optimal components from perf() and tune.splsda() functions are not optimal?

MaxBladen · January 11, 2023, 9:16pm

When determining the optimal number of components, the perf() and tune() functions employ the following algorithm:

for x in 1 : ncomp:
- generate component values and loadings by decomposition
- Use folded CV to determine the error rate of a model utilising these variates. This will yield a distribution of nrepeat error rates.
- if x > 1:
  - run a one sided T-test between the optimal ncomp error rates distribution against component x error rates distribution.
  - If x has statistically significant improvement, then set x as new optimal ncomp.
- else
  - set x as optimal ncomp

Despite there being this “calculated” optimal ncomp, at the end of the day we as users need to decide on what value to use for this parameter. We always need to be balancing model accuracy with model complexity. This is why mixOmics tuning functions tend to lean towards suggesting simpler models (ie. fewer components and features).

Additionally, if we look at the error bars within your figure, there is a large degree of overlap. This suggests that T-tests for components 2 to 10 all were insignificant. The sharp increase beyond component 1 elucidates why it was selected.

Topic		Replies	Views
Perf() and tune() producing different optimal component counts Analysis	7	1211	May 26, 2022
sPLS choice of optimal number of components Analysis	4	1176	July 29, 2021
sPLS-DA prediction problem Analysis	4	860	August 11, 2020
PLS-DA predictions over 100 splits of the data Analysis	2	304	June 9, 2022
Help deciding the number of components in PLS-DA Analysis	3	399	June 27, 2024

Optimal components from perf() and tune.splsda() functions are not optimal?

Related topics