we know that PLSr is good modeling method for 𝑛≪𝑝; but how to determine the best/right number of variables; how do we know the number of variables that we are using is right or is affecting the model if very large say 20k; Is there a reference for knowing the statistical power or knowing what should be the number of p that we could say focus on with feature selection method to derive subset of features to be used from big data say transcriptomiocs or metabolomics with variables in thousands (upto20k) ;I will appreciate if somebody could point me to the direction where I can use properly and cite such reference for my analysis.
The number of variables to select can be tuned using the
tune.spls function. It used cross-validation to find the best set of features to keep in the model. Please refer to
?tune.spls for more details.
Hope it helps,