Test and training datasets

Hi there,
I’m using MixOmics - DIABLO to perform integration of proteomics (1625 ids), metabolomics (90 ids) and lipids (289 ids), on 62 patients (32 group REL and 30 group NOT.REL).
I’ve used the tune.block.splsda function using this test.keepX from my entire datasets:

test.keepX ← list(metabolites = c(seq(1,90, 6)),
protein = c (seq(100,1600,107)),
lipids = c(seq(10,280,19)))

from this i have obtained the list.keepX and created my final model.

Now I would like to validate this model. I was looking at the " Case Study of sPLS-DA with SRBCT dataset" in the PREDICTION part where it say:

“In real scenarios, the training model should be tuned itself. It is crucial that when tuning the training model, it is done in the absence of the testing data. This also reduces likelihood of overfitting”.

if I’m understanding well I need to split the dataset at the very beginning of my analysis in test and training… then tune the model only on the training.dataset (with the previous described test.keepX parameters) and then use the predict function on the test.dataset…

It’s that correct?
The main point is that i have only one dataset of 62 patients on which i would like to build and validate the model, is that possible? Is in your experience the sample dimension enough?

Thanks

hi @Chiara.Anser,

In a utopian case you would have a second validation dataset available to do a proper validation but most researchers cannot achieve this. So your validation would only come from cross-validation using the perf() function of your final model, given that the number of samples you have from the start is quite small.

You could divide your original dataset into training and testing but your results would be highly dependent on the test set, and so you would have to repeat this with several test sets. That would be equivalent to do cross-validation.

Kim-Anh