How to select the optimal number of variables for sPLS-DA and comparison with Selbal

GMB · June 29, 2020, 11:15am

I work with ruminal microbiota data, and have tried to use sPLS-DA to look for association between OTUs and a 1 divergent line of animals. My data consists of 283 samples and 2125 OTUs. My query is for the optimal number of variables to include in my sPLS-DA analysis.

I used two tools like selbal and clr.lasso to determine the number of variables to include. In selbal, I did a cross-validation (5 folds and 10 iteration) and I got that the optimal number of variables is 15. When I used clr-lasso, I also used crossvalidation (5 folds and 10 iteration) and I got that the optimal number of variables should be between 48 and 90 variables. In both procedures I considered AUC as a factor in determining the optimal number of variables.

What is your experience on the subject, and perhaps you can help me determine what number of variables I should use (selbal or clr-lasso). I read the article published on both methodologies, but it is not clear to me in these cases which of the two results to use whether selbal or clr-lasso.

Guillermo

kimanh.lecao · June 30, 2020, 1:46am

Dear @GMB,
thank you for your interest in using mixOmics. As we highlight in this article it really depends on the biological question. Selbal will lead to a small signature as a balance, sPLS-DA does not. Selbal is based on AUC + cross-validation, sPLS-DA is based on cross-validation (not AUC), as detailed in this vignette. We do not recommend you use AUC for the latter because sPLS-DA already makes its own prediction call (see also the Suppl of this article).

I hope that helps,

Kim-Anh

GMB · June 30, 2020, 3:26am

Thank you very much for your reply, Kim-Anh. It’s clearer to me, but in my case I work with a large database, it might be better to use clr-lasso, right?

Although from what I understood in the article, it is recommended to use selbal because of the problem of the geometric average and the dependence with the eliminated Otus.

Guillermo

kimanh.lecao · July 1, 2020, 4:07am

Yes I think the CLR-LASSO would be more efficient. There is also a glm approach proposed in the paper, see the extended vignette in: https://github.com/EvaYiwenWang/Microbiome_variable_selection_tutorial

Kim-Anh

GMB · July 1, 2020, 6:49pm

Thank you very much Kim Anh

Guillermo

Topic		Replies	Views
Variable importance in sPLS-DA Analysis	4	483	March 11, 2021
The number of variables selected in a sPLS-DA should be similar? Analysis	5	304	September 20, 2022
How to determine number of variables tio be used when we say 𝑛≪𝑝;	1	307	May 10, 2021
Number of variables in final sPLS-DA Analysis	1	85	May 2, 2024
Number of variables per component in tuning vs checking stability Support	2	249	September 6, 2023

How to select the optimal number of variables for sPLS-DA and comparison with Selbal

Related topics