I have been enjoying using mixOmics package and as I dive deeper into my analysis, I had more questions.
According to some papers on PLSDA (e.g.https://link.springer.com/article/10.1007/s11306-011-0330-3), a double cross-validation scheme may be the most unbiased way to validate the PLSDA model. I noticed that I can perform cross-validation through both the perf() and the tune() function in the mixOmics package. Is the cross-validation implemented in these two functions a double cross-validation scheme? Will the number of folds affect the error rate? And how should I determine the number of folds?
Double cross-validation is indeed the way to go, but only if you have a sufficient number of samples. Which is not often the case in omics studies. One way is for you to create an outer loop around either perf() or tune().
Yes the number of folds affects how the error rate is estimated. We recommend that your test set (inner fold) includes at least 5 samples. Don’t forget that we also have a repeat argument to repeat the CV process.
Thanks for the kind reply. If I decide to use the double-cross validation scheme. Can I reduce the number of repeats of CV in the inner loop without affecting the validity of the outcome?
I attempted to code the nested cross-validation. For the outer and inner loop, I used 5 and 2 folds, respectively. If I understand correctly, I tune the parameters on the inner loop and then test the best performing model on the data from the outer loop. I then take the average of the metrics across all folds to estimate model performance. But then, which hyperparameters should I use for the final refit of my model?