Optimal components from perf() and tune.splsda() functions are not optimal?

When determining the optimal number of components, the perf() and tune() functions employ the following algorithm:

  • for x in 1 : ncomp:
    • generate component values and loadings by decomposition
    • Use folded CV to determine the error rate of a model utilising these variates. This will yield a distribution of nrepeat error rates.
    • if x > 1:
      • run a one sided T-test between the optimal ncomp error rates distribution against component x error rates distribution.
      • If x has statistically significant improvement, then set x as new optimal ncomp.
    • else
      • set x as optimal ncomp

Despite there being this “calculated” optimal ncomp, at the end of the day we as users need to decide on what value to use for this parameter. We always need to be balancing model accuracy with model complexity. This is why mixOmics tuning functions tend to lean towards suggesting simpler models (ie. fewer components and features).

Additionally, if we look at the error bars within your figure, there is a large degree of overlap. This suggests that T-tests for components 2 to 10 all were insignificant. The sharp increase beyond component 1 elucidates why it was selected.