I have a practical question regarding PLS-DA/sPLS-DA.
Let’s say I have created a PLS-DA model based on some samples, and want to use it to classify new unlabelled samples. I could not find the way of doing this, is that a possibility?
One additional question from the tutorial you have linked.
It is about splitting the dataset for training and testing.
I have in total 184 samples. Did the split as instructed in the tutorial, where for training I have allocated 140 samples, and the rest is testing set (44 samples)
train ← sample(1:nrow(X), 140) # randomly select 140 samples in training
test ← setdiff(1:nrow(X), train)
Hi @MaxBladen,
Sorry for the very late answer. I have not found my way around this actually.
Yes, The confusion matrix sums up to 6 samples. nrow(X) (before splitting data) returns 184, and the nrow of train and test datasets returns expected values, i.e, 140 and 44, respectively.
Yes, that is true, most of predict.comp2 are NaNs, except for the ones that are shown in confusion matrix. The same pattern one can find in predict.splsda.h2s
Yes, have a look at the link that Max gave earlier, in the section ‘Prediction’. In that case we artificially created a test set from the original data. In your case that would be a new data set.
The issue is how you would normalise your new unlabelled samples first, without overfitting.
The problem I am facing is in evaluation output. As mentioned earlier, I have my training and test datasets. I have trained the model with the training set (140 samples), and used it for prediction of the test dataset (44 samples). However the output of the confusion matrix recognises in total only 6 samples. Meaning 38 have no prediction- empty rows.
ok that is weird. If you want I can have a look at your data and code if you send me your .RData and everything I need to rerun that part (i.e the final PLS-DA model and the prediction).
I have the same issue. Strangely, I found my way around it by uploading a single data file and manually dividing the training and test data. I don’t understand why it’s working now but at least I have results…
Thanks @Fabien-Filaire, then I think there is something weird happening when you divide the training / test data sets directly into R, i.e droplevels(Y) has not been applied, or something of that kind (i.e some information still remains in memory and has not been ‘cut’ properly.