I work with ruminal microbiota data, and have tried to use sPLS-DA to look for association between OTUs and a 1 divergent line of animals. My data consists of 283 samples and 2125 OTUs. My query is for the optimal number of variables to include in my sPLS-DA analysis.
I used two tools like selbal and clr.lasso to determine the number of variables to include. In selbal, I did a cross-validation (5 folds and 10 iteration) and I got that the optimal number of variables is 15. When I used clr-lasso, I also used crossvalidation (5 folds and 10 iteration) and I got that the optimal number of variables should be between 48 and 90 variables. In both procedures I considered AUC as a factor in determining the optimal number of variables.
What is your experience on the subject, and perhaps you can help me determine what number of variables I should use (selbal or clr-lasso). I read the article published on both methodologies, but it is not clear to me in these cases which of the two results to use whether selbal or clr-lasso.
Dear @GMB,
thank you for your interest in using mixOmics. As we highlight in this article it really depends on the biological question. Selbal will lead to a small signature as a balance, sPLS-DA does not. Selbal is based on AUC + cross-validation, sPLS-DA is based on cross-validation (not AUC), as detailed in this vignette. We do not recommend you use AUC for the latter because sPLS-DA already makes its own prediction call (see also the Suppl of this article).
Thank you very much for your reply, Kim-Anh. Itβs clearer to me, but in my case I work with a large database, it might be better to use clr-lasso, right?
Although from what I understood in the article, it is recommended to use selbal because of the problem of the geometric average and the dependence with the eliminated Otus.