Multilevel PLSDA- Avoid overfitting on small sample size experiment

Dear mixOmics Team,

Thank you for the very nice packages,
As a R beginer I really enjoy your website, tutorials and forum.

I am performing metabolomic analysis on a small sample size (n=6), with repeated measures (2 time points). I am working with around 4 000 features as variables.
I would like to performe multivariate analysis to select metabolites of interests between the 2 time point.

I first performed a multilevel PCA to see if time could separate the sample. My first component explains 28% of the variance and the second 21% (vs. 19% and 15% without multilevel; using the multilevel seems to help my analysis).

I now want to perform a PLS-DA to asses the metabolites that could discriminate the time condition. The sPLSDA seems to be appropriate to select my metabolites of interest.
Due to the design with the repeated mesures on same subject, I need to do a multilevel PLS-DA.

My question is, with this small sample size and the repeated measures, what would be the best way to perform the multilevel PLSDA in order to avoid overfitting ?
Should I use as a training dataset? If yes, what should I use as a training dataset?

With this small sample size and lots of variables, would you use the minimal code (default values for selected number of component and variable) or would you go for variable selection?

If we go for variable selction. On the perf() function, would you recommand using folds = 5, nrepeat = 10, or more? or use Leave-One-Out (LOO) validation? I would use the same for tune.splsda.

Thanks in advance for any input on this topic,

Regards,
Maƫlle

EDIT: Here is the code I tryied to run before trying the multilevel :slight_smile:

X ā† datatranspo[,c(4:4357)]
Y<- datatranspo$Temps

MyPLSDA ā† plsda(X, Y, ncomp = 10)

MyPerf.plsda ā† perf(MyPLSDA, validation = ā€œMfoldā€, folds = 3, progressBar = FALSE, nrepeat = 50)

=> returns ā€œError in solve.default(Sr) : **
** system is computationally singular: reciprocal condition number = 1.58286e-16ā€

plot(MyPerf.plsda, col = color.mixo(5:7), sd = TRUE, legend.position = ā€œhorizontalā€)

MyPerf.plsda$choice.ncomp

list.keepX ā† c(1:10, seq(20, 300, 10))

=>How should I choose the parameters on the list.keepX?

list.keepX

tunePLSDA ā† tune(X, Y, ncomp = ?, test.keepX = list.keepX, validation = ā€˜Mfoldā€™, folds = 3,
progressBar = FALSE, dist = ā€˜?ā€™, measure = ā€œoverallā€, nrepeat = 50, cpus = 2)

=> I am not sure what I should use as dist and measure, any recommandations?
Should I use cpus?

error ā† tunePLSDA$error.rate
ncomp ā† tunePLSDA$choice.ncomp$ncomp

select.keepX ā† tunePLSDA$choice.keepX[1:ncomp]

MyResult.splsda.final ā† splsda(X, Y, ncomp = ncomp, keepX = select.keepX)

plotIndiv(MyResult.splsda.final, group = Y, ind.names = FALSE, legend=TRUE,
ellipse = TRUE, title=ā€œsPLS-DA - final resultā€)

perf.splsda ā† perf(MyResult.splsda.final ,
folds = 5, nrepeat = 10, # use repeated cross-validation
validation = ā€œMfoldā€, dist = ā€œmax.distā€, # use max.dist measure
progressBar = FALSE)

1 Like

Hello @mbonhomme!

Iā€™ll address a few different points in order of your post.

My first component explains 28% of the variance and the second 21% (vs. 19% and 15% without multilevel; using the multilevel seems to help my analysis)

My first piece of advice is to be a little be weary of using the variance captured as a measure of efficacy. In your case, this seems fine. However, for future reference, donā€™t rely on this metric as the ā€œbe-all-end-allā€ for measuring the methodā€™s ability to analyse your data.

what would be the best way to perform the multilevel PLSDA in order to avoid overfitting?

This may not be what you want to hear, but overfitting is quite unavoidable with such a small sample size. This doesnā€™t mean your model canā€™t elucidate some useful information - but it will be far from generalisable.

Should I use as a training dataset? If yes, what should I use as a training dataset?

If you have the time, you could implement a simple way of using Leave-One-Out-CV by using one sample for testing, iterating through each sample and average your results. Otherwise, Iā€™d recommend using all samples as part of your model due to the low sample size.

With this small sample size and lots of variables, would you use the minimal code (default values for selected number of component and variable) or would you go for variable selection?

You will definitely want to use variable selection here. The below error (system is computationally singular) is commonly caused by a matrix with too few samples and too many features.

On the perf() function, would you recommand using folds = 5, nrepeat = 10, or more? or use Leave-One-Out (LOO) validation? I would use the same for tune.splsda.

You canā€™t have 5 folds as you only have 6 samples. LOO CV is the way to go here

How should I choose the parameters on the list.keepX?

Your current selected value (c(1:10, seq(20, 300, 10))) is appropriate. If you want some more direction on this, read my response on this linked post.

I am not sure what I should use as dist and measure, any recommandations?

Generally, the centroids.dist will be the most appropriate - but this is NOT a universal rule. The answer to this question is totally data dependent. You will have to experiment with them to see which works for your data the best.

Should I use cpus?

No, the cpus parameter is depreciated. Have a look at the BPPARAM parameter. This utilises the BiocParallel package.

Hope this all helps a bit!

1 Like