When I assess the performance of the block.splsda using the perf function, the classification error rate (with mahalanobis distance) increases when adding comp 2 and 3, but decreases when adding the 4th component. However, the classification error rate (with max and centroids distances) decreases consistently when adding component 2 and 3. Is this simply because, max and centroids distances gives a much better performance in this particular case, or is it because I am overfitting? I have 4 classes, 39 samples and 7 folds (n/k = 5.5), and it does not change anything when using leave-one-out, increasing the nrepeat or using less or more folds.
I also have some questions/considerations on how to set the parameters for tune.block.splsda. For my large proteomics dataset, I would normally test seq(5:9, seq(10, 150, 5)). However, lately I have been using keepX = c(5:150) with nrepeat = 80, because I believe this results in a more thorough tuning. Could there be any statistical/mathematical advantages/drawbacks (except for the heavier computation) by doing it this way? Computational time is not a problem, since I am doing it on a big server.
hi @christoa,
Regarding the choice of distances: have a look at the Supplemental in: https://journals.plos.org/ploscompbiol/article/related?id=10.1371/journal.pcbi.1005752 I screenshot the text here, I dont think it is an overfitting problem.
I would favour the centroid distance in your case. Your error bars indicate a high variability, and increasing the # of repeats is appropriate, it will make the estimation more reliable and ‘reproducible’ (to an extent), and this is how it should be done.
Kim-Anh
Hi @kimanh.lecao,
Thanks very much for the help. It is sincerely appreciated.
What about the fact that I test many variables (keepX = c(5:150) instead of seq(5,150, 5)), do you have an opinion on this? I achieve a slightly better correlation between the two datasets (0.97 vs 0.95) when doing this, and thus, I am wondering why people don’t do it this way. Is it just a matter of personal preference? (Benefit vs. computational time)
hi @christoa,
it is definitely a computational issue that explains the coarse grid seq(5,150, 5)
because there are so many combinations of keepX values across all data sets to test! But if you have the compute power and need a very precise answer, then a fine grid is the way to go.
Kim-Anh
Again thank you for your help. The reason why I do it, is because it improves the clustering coefficient in the subsequent functional network analysis.