PLSDA on small sample size, and OPLSDA


I was hoping you would be willing to answer a few of my questions. I am currently trying to work out how to perform a cross-validation on my PLS-DA model and whether or not I should be using OPLS-DA instead. Our set up is as follows:

Metabolomics Analysis of 1 tissue


My question is, since I only have 5 samples in each group, is it even possible to perform a meaningful cross-validation?

My PCA looks fairly good but I could see in the 3D PCA that I have more separation between groups than I can see on the 2D PCA. So I used the mixOmics package to perform PLS-DA to see if I could tease out this separation on a 2D plot, but now I am not sure how I can validate my model, since I am using so few replicates. If I wanted to publish the results how would I justify that I haven’t performed cross-validation to address the problem of over-classification due to the supervised nature of the PLS-DA analysis?

Also to perform PLS-DA I had to treat the groups as 6 independent groups, which is obviously not ideal – someone suggested I could use OPLS-DA instead? Please note, our age groups are technically independent of each other because mice are culled to collect the tissue.

Any information would be greatly appreciated.


Thanks for describing your data with your diagram, that makes my life easier :slight_smile:

1 - PCA: you have 6 sample groups that you could like to discriminate. I would assume that the largest source of variation is treatment vs control, and then days within each treatment group. It is very unlikely that PLS-DA can do this in less that 3 components, I think OPLSDA sometimes can.

2 - PLS-DA and especially sparse PLS-DA might be able to tell you what are the metabolic features that can discriminate the different groups. You can still do repeated cross-validation. With 30 samples, perhaps do folds = 3 with nrepeat = 50. Or the extreme, validation = 'loo' (leave-one-out). You can explore also the effect of feature selection. the tune() function might be a bit limited but is worth a try, otherwise just set the number of variables to select to, say 10 or 50 and see what you get. perf() would give you the final performance of this model. (OPLSDA cannot do feature selection, so this is a good feature from sPLS-DA)

3 -Usually we do a multilevel analysis with repeated measurements to remove the individual variation. This is not the case here, but you could potentially further explore the effect of day by looking only at one treatment at a time (e.g first part of the paper describes the overall difference in treatments, and then narrows down to within each treatment). I am not a specialist of OPLSDA so I can’t really comment.

I hope that helps,