Pls-da classification error rate

pascale.alb · June 1, 2020, 11:07pm

Hi mixomics team,

I am analyzing 16S amplicons microbiome data of soil samples.
My intention is to classify my samples based on the sampling depth of the soil horizons.
Specifically, my dataset has an unbalanced sapling size and consist of 10 samples at depth D1, 12 samples at depth D2 and 12 samples at depth D3.

my otutable was transformed in percent values obtaining relative abundance of features per sample.
otutable was log and center transformed by the function

bac_otutab.t.log <- logratio.transfo(bac_otu_count.t, logratio = "CLR", offset = 1)

then I peformed my pls-da

Y= cl_depth
bac.plsX = plsda(bac_otutab.t.log, Y, ncomp = 10) 

color.per.group = c("darkgreen","darkorange","darkviolet") # assign colors to Y groups

bac_pls_2d <- plotIndiv(bac.plsX, comp = 1:2, cex = 2, 
                        pch = 16, ellipse = T, 
                        ind.names = F, col = color.per.group, 
                        ellipse.level = .8,  star = T, legend = T, centroid = T, 
                        title = "PLS-DA: ZOTUs ~ Depth") # 2D plot of PLS_DA

plsda

I wanted to assess the performance of the classification

set.seed(999)
MyPerf.bac.plsX <- perf(bac.plsX, validation = "Mfold", 
                        folds = 5, 
                        progressBar = FALSE, auc = TRUE,
                        nrepeat = 50, cpus = 8) # we suggest nrepeat = 50-100

plot(MyPerf.bac.plsX, col = color.mixo(5:7), sd = TRUE, legend.position = "horizontal")

Classification_error

My question is about the classification error rate that seems to increase along with the number of components.
I don’t understand how is it possible and I am wondering if is it a data problem? Is there a conceptual mistake I don’t see or is just a matter of a mistake in the scripts/workflows?

Thanks in advance for your assistance,
Cheers, Alberto.

strkiky · June 2, 2020, 2:54am

I believe this means that that you should only keep the first component.

I had something similar when it was looking at 85% error rate.

pascale.alb · June 3, 2020, 12:31pm

Thank you for your reply.
I understand what the plot is showing, but I don’t see the logical reason behind.
As a multivariate analysis, I don’t see how to keep one component.

However, giving you some more details, my otu matrix has about 4000 features so I was wondering if this could imply some different setup, as the number of folds or nrepeat in my perf() function

kimanh.lecao · June 4, 2020, 12:01am

hi @pascale.alb,

After 1 component, the classification becomes worse (potentially because all 4,000 variables are too noisy to discriminate further). From your plot, I see that component 1 can discriminate the 3 groups, but when you add a second component, it potentially shows that in fact it does not add that much for the discrimination. Note that the perf() shows the results using cross-validation (i.e. train / test), whereas the plot shows what happens on the entire data set. The M-fold seem to be adequate.

You can now try the same with a sparse PLS (only on the first 3 comps should be enough), as shown in: http://mixomics.org/case-studies/splsda-srbct/ or https://mixomicsteam.github.io/Bookdown/plsda.html for variable selection. It might help.

Kim-Anh

Topic		Replies	Views
Help understanding high error rate using PLS-DA Analysis	6	3642	October 21, 2020
Help deciding the number of components in PLS-DA Analysis	3	435	June 27, 2024
PLS-DA questions Analysis	10	2048	April 9, 2021
High error rate even when more components are included in sPLS-DA Analysis	1	502	September 22, 2020
Transcriptomic signature with sPLS-DA Analysis	7	1470	October 3, 2019

Pls-da classification error rate

Related topics