High error rate even when more components are included in sPLS-DA


I am following " Case study: HMP bodysites repeated measures" tutorial for a microbiome study with amplicon sequencing. I have 3 dietary groups (S, B and Y - where S is control), 15 cows for control diet group and 16 cows for the other two groups. I also have 3 time points for sampling from each cow and all the cows were fed the control diet on the first time point (W2).

Firstly, I see that inter-individual variation was stronger than the time variation (based on the pca), so applied the multilevel approach in the HMP tutorial: The outcome is dietarygroup and time point (FeedWeek) and the unique sample ID is cowID for multilevel analysis.
I want to compare the archaea composition between the treatment groups and control group for the different time points, and select the discriminative ASVs with sPLS-DA. Although the first time points are from the same diet, I have annotated them with the affiliated treatment group to see the differences in the beginning - so 9 groups in total, 141 samples and 126 ASVs after pre-processing.
Hope this is the correct approach for our design.

I have run tuning sPLS-DA for choosing keepX and ncomp parameters, but the BER was very high even for the 8th component and the error rate for the first component increased by feature numbers. The output from plot(diverse.tune.splsda):

And the output of the error rates from perf (increased nrepeat to 1000 as suggested before for another entry Help understanding high error rate using PLS-DA):

comp1 0.8183759
comp2 0.7978511
comp3 0.7729362
comp4 0.7498298
comp5 0.7258652
comp6 0.7194965
comp7 0.7190780
comp8 0.7155177

comp1 0.8215755
comp2 0.8002287
comp3 0.7753051
comp4 0.7524125
comp5 0.7283907
comp6 0.7219759
comp7 0.7214231
comp8 0.7177278

I wonder if the approach is correctly selected and what is the reason of that so high error rate. Can it be because of the similarity between the groups analyzed, as the groups did not diverge much from each other on the sPLS-DA comp 1-2 plot?

I am quite new to mixOmics and hope my questions make sense to you. Looking forward to your feedback.

Thank you very much for your time!


dear @ozg.umu,
I agree that the performance of the sPLS-DA is not great at all. I suspect it is (also) because your are asking to discriminate the two control groups in W2 when there should not be any difference between the groups.
I would first break down the problem into subproblems to understand why the discrimination task seems difficult. For example focus on a specific time point #2 to discriminate the 3 groups, do the same for time point #3, and also for time point #1 but 2 groups only.

The multilevel might not be the best here, if you are interested in differences between groups and time points? If time is also of interest, then you could either do as above (suboptimal) or look at other approaches that focus on clustering time profiles within a particular group of individual, see:
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0134540 with the lmms package
https://www.frontiersin.org/articles/10.3389/fgene.2019.00963/full a follow up on that last paper.