I have a lipidomics dataset with more than 1800 features and 30 samples and I want to apply PLSDA. So, I used the following code to do this:
HLGA_plsda ← plsda(Normalized_data, Groups, ncomp = 10, scale = FALSE)
perf_plsda ← perf(HLGA_plsda, validation = “Mfold”,
folds = 3, nrepeat = 100,
progressBar = TRUE, auc = TRUE)
plot(perf_plsda, col = color.mixo(5:7), sd = TRUE,
legend.position = “horizontal”)
perf_plsda$choice.ncomp
and I get this plot:
Now, I get that only one component is the best with “$choice.ncomp”, although I feel it is not reasonable and I should choose 2. Also, how come I have a very low error rate, does it mean the model is overtrained? or am I doing something wrong?
I chose not to scale because I get a higher classification error rate with scaling.
Also, using RVAideMemoire package, cross validation with 2 components gives me a lower classification error rate than one component:
MVA.cv(Normalized_data, Groups, repet = 100, k = 3, ncomp = 1, scale = FALSE, model =“PLS-DA”)
Mean (standard error) classification error rate (%): 0.2 (0.08), but with 2 components I get 0% (0)
So, I am really confused how to decide the number of components, can anyone help? Both models with one or 2 components are significant upon permutation.
I am really new to this area so any advice will help!
Thank you