When I assess the performance of the block.splsda using the perf function, the classification error rate (with mahalanobis distance) increases when adding comp 2 and 3, but decreases when adding the 4th component. However, the classification error rate (with max and centroids distances) decreases consistently when adding component 2 and 3. Is this simply because, max and centroids distances gives a much better performance in this particular case, or is it because I am overfitting? I have 4 classes, 39 samples and 7 folds (n/k = 5.5), and it does not change anything when using leave-one-out, increasing the nrepeat or using less or more folds.
I also have some questions/considerations on how to set the parameters for tune.block.splsda. For my large proteomics dataset, I would normally test seq(5:9, seq(10, 150, 5)). However, lately I have been using keepX = c(5:150) with nrepeat = 80, because I believe this results in a more thorough tuning. Could there be any statistical/mathematical advantages/drawbacks (except for the heavier computation) by doing it this way? Computational time is not a problem, since I am doing it on a big server.