Multilevel PLSDA- Avoid overfitting on small sample size experiment

Hello @mbonhomme!

I’ll address a few different points in order of your post.

My first component explains 28% of the variance and the second 21% (vs. 19% and 15% without multilevel; using the multilevel seems to help my analysis)

My first piece of advice is to be a little be weary of using the variance captured as a measure of efficacy. In your case, this seems fine. However, for future reference, don’t rely on this metric as the “be-all-end-all” for measuring the method’s ability to analyse your data.

what would be the best way to perform the multilevel PLSDA in order to avoid overfitting?

This may not be what you want to hear, but overfitting is quite unavoidable with such a small sample size. This doesn’t mean your model can’t elucidate some useful information - but it will be far from generalisable.

Should I use as a training dataset? If yes, what should I use as a training dataset?

If you have the time, you could implement a simple way of using Leave-One-Out-CV by using one sample for testing, iterating through each sample and average your results. Otherwise, I’d recommend using all samples as part of your model due to the low sample size.

With this small sample size and lots of variables, would you use the minimal code (default values for selected number of component and variable) or would you go for variable selection?

You will definitely want to use variable selection here. The below error (system is computationally singular) is commonly caused by a matrix with too few samples and too many features.

On the perf() function, would you recommand using folds = 5, nrepeat = 10, or more? or use Leave-One-Out (LOO) validation? I would use the same for tune.splsda.

You can’t have 5 folds as you only have 6 samples. LOO CV is the way to go here

How should I choose the parameters on the list.keepX?

Your current selected value (c(1:10, seq(20, 300, 10))) is appropriate. If you want some more direction on this, read my response on this linked post.

I am not sure what I should use as dist and measure, any recommandations?

Generally, the centroids.dist will be the most appropriate - but this is NOT a universal rule. The answer to this question is totally data dependent. You will have to experiment with them to see which works for your data the best.

Should I use cpus?

No, the cpus parameter is depreciated. Have a look at the BPPARAM parameter. This utilises the BiocParallel package.

Hope this all helps a bit!

1 Like