I work with ruminal microbiota data, and have tried to use sPLS-DA to look for association between OTUs and a 1 divergent line of animals. My data consists of 283 samples and 2125 OTUs. My query is for the optimal number of variables to include in my sPLS-DA analysis.
I used two tools like selbal and clr.lasso to determine the number of variables to include. In selbal, I did a cross-validation (5 folds and 10 iteration) and I got that the optimal number of variables is 15. When I used clr-lasso, I also used crossvalidation (5 folds and 10 iteration) and I got that the optimal number of variables should be between 48 and 90 variables. In both procedures I considered AUC as a factor in determining the optimal number of variables.
What is your experience on the subject, and perhaps you can help me determine what number of variables I should use (selbal or clr-lasso). I read the article published on both methodologies, but it is not clear to me in these cases which of the two results to use whether selbal or clr-lasso.