Pls-da classification error rate

Hi mixomics team,

I am analyzing 16S amplicons microbiome data of soil samples.
My intention is to classify my samples based on the sampling depth of the soil horizons.
Specifically, my dataset has an unbalanced sapling size and consist of 10 samples at depth D1, 12 samples at depth D2 and 12 samples at depth D3.

  1. my otutable was transformed in percent values obtaining relative abundance of features per sample.

  2. otutable was log and center transformed by the function

bac_otutab.t.log <- logratio.transfo(bac_otu_count.t, logratio = "CLR", offset = 1)

then I peformed my pls-da

Y= cl_depth
bac.plsX = plsda(bac_otutab.t.log, Y, ncomp = 10) 

color.per.group = c("darkgreen","darkorange","darkviolet") # assign colors to Y groups

bac_pls_2d <- plotIndiv(bac.plsX, comp = 1:2, cex = 2, 
                        pch = 16, ellipse = T, 
                        ind.names = F, col = color.per.group, 
                        ellipse.level = .8,  star = T, legend = T, centroid = T, 
                        title = "PLS-DA: ZOTUs ~ Depth") # 2D plot of PLS_DA

plsda

  1. I wanted to assess the performance of the classification
set.seed(999)
MyPerf.bac.plsX <- perf(bac.plsX, validation = "Mfold", 
                        folds = 5, 
                        progressBar = FALSE, auc = TRUE,
                        nrepeat = 50, cpus = 8) # we suggest nrepeat = 50-100
plot(MyPerf.bac.plsX, col = color.mixo(5:7), sd = TRUE, legend.position = "horizontal")

Classification_error

My question is about the classification error rate that seems to increase along with the number of components.
I don’t understand how is it possible and I am wondering if is it a data problem? Is there a conceptual mistake I don’t see or is just a matter of a mistake in the scripts/workflows?

Thanks in advance for your assistance,
Cheers, Alberto.

I believe this means that that you should only keep the first component.

I had something similar when it was looking at 85% error rate.

Thank you for your reply.
I understand what the plot is showing, but I don’t see the logical reason behind.
As a multivariate analysis, I don’t see how to keep one component.

However, giving you some more details, my otu matrix has about 4000 features so I was wondering if this could imply some different setup, as the number of folds or nrepeat in my perf() function

hi @pascale.alb,

After 1 component, the classification becomes worse (potentially because all 4,000 variables are too noisy to discriminate further). From your plot, I see that component 1 can discriminate the 3 groups, but when you add a second component, it potentially shows that in fact it does not add that much for the discrimination. Note that the perf() shows the results using cross-validation (i.e. train / test), whereas the plot shows what happens on the entire data set. The M-fold seem to be adequate.

You can now try the same with a sparse PLS (only on the first 3 comps should be enough), as shown in: http://mixomics.org/case-studies/splsda-srbct/ or https://mixomicsteam.github.io/Bookdown/plsda.html for variable selection. It might help.

Kim-Anh