Dear mixOmics Team,
Thank you for the very nice packages,
As a R beginer I really enjoy your website, tutorials and forum.
I am performing metabolomic analysis on a small sample size (n=6), with repeated measures (2 time points). I am working with around 4 000 features as variables.
I would like to performe multivariate analysis to select metabolites of interests between the 2 time point.
I first performed a multilevel PCA to see if time could separate the sample. My first component explains 28% of the variance and the second 21% (vs. 19% and 15% without multilevel; using the multilevel seems to help my analysis).
I now want to perform a PLS-DA to asses the metabolites that could discriminate the time condition. The sPLSDA seems to be appropriate to select my metabolites of interest.
Due to the design with the repeated mesures on same subject, I need to do a multilevel PLS-DA.
My question is, with this small sample size and the repeated measures, what would be the best way to perform the multilevel PLSDA in order to avoid overfitting ?
Should I use as a training dataset? If yes, what should I use as a training dataset?
With this small sample size and lots of variables, would you use the minimal code (default values for selected number of component and variable) or would you go for variable selection?
If we go for variable selction. On the perf() function, would you recommand using folds = 5, nrepeat = 10, or more? or use Leave-One-Out (LOO) validation? I would use the same for tune.splsda.
Thanks in advance for any input on this topic,
Regards,
Maƫlle
EDIT: Here is the code I tryied to run before trying the multilevel
X ā datatranspo[,c(4:4357)]
Y<- datatranspo$Temps
MyPLSDA ā plsda(X, Y, ncomp = 10)
MyPerf.plsda ā perf(MyPLSDA, validation = āMfoldā, folds = 3, progressBar = FALSE, nrepeat = 50)
=> returns āError in solve.default(Sr) : **
** system is computationally singular: reciprocal condition number = 1.58286e-16ā
plot(MyPerf.plsda, col = color.mixo(5:7), sd = TRUE, legend.position = āhorizontalā)
MyPerf.plsda$choice.ncomp
list.keepX ā c(1:10, seq(20, 300, 10))
=>How should I choose the parameters on the list.keepX?
list.keepX
tunePLSDA ā tune(X, Y, ncomp = ?, test.keepX = list.keepX, validation = āMfoldā, folds = 3,
progressBar = FALSE, dist = ā?ā, measure = āoverallā, nrepeat = 50, cpus = 2)
=> I am not sure what I should use as dist and measure, any recommandations?
Should I use cpus?
error ā tunePLSDA$error.rate
ncomp ā tunePLSDA$choice.ncomp$ncomp
select.keepX ā tunePLSDA$choice.keepX[1:ncomp]
MyResult.splsda.final ā splsda(X, Y, ncomp = ncomp, keepX = select.keepX)
plotIndiv(MyResult.splsda.final, group = Y, ind.names = FALSE, legend=TRUE,
ellipse = TRUE, title=āsPLS-DA - final resultā)
perf.splsda ā perf(MyResult.splsda.final ,
folds = 5, nrepeat = 10, # use repeated cross-validation
validation = āMfoldā, dist = āmax.distā, # use max.dist measure
progressBar = FALSE)