PLS-DA: prediction of new upcoming samples

Hi there,

I have a practical question regarding PLS-DA/sPLS-DA.

Let’s say I have created a PLS-DA model based on some samples, and want to use it to classify new unlabelled samples. I could not find the way of doing this, is that a possibility?

Thanks

Explore the predict() function. If you’d like some usage examples, look here

@MaxBladen perfect! That’s what I am looking for, thanks.

One additional question from the tutorial you have linked.
It is about splitting the dataset for training and testing.

I have in total 184 samples. Did the split as instructed in the tutorial, where for training I have allocated 140 samples, and the rest is testing set (44 samples)

train ← sample(1:nrow(X), 140) # randomly select 140 samples in training
test ← setdiff(1:nrow(X), train)

store matrices into training and test set:

X.train ← X[train, ]
X.test ← X[test,]
Y.train ← Y[train]
Y.test ← Y[test]

After training

train.splsda.h2s ← splsda(X.train, Y.train, ncomp = optimal.ncomp, keepX = optimal.keepX)

and testing

predict.splsda.h2s ← predict(train.splsda.h2s, X.test, dist = “mahalanobis.dist”)

evaluation with the confusion matrix shows only 6 samples.

predict.comp2 ← predict.splsda.h2s$class$mahalanobis.dist[,2]
table(factor(predict.comp2, levels = levels(Y)), Y.test)

I would expect here evaluation of all 44 test samples. Why is that?

So the confusion matrix only sums to 6? Are you sure nrow(X) returns 184?

My best guess is that some of the values in predict.comp2 are NAs. If not, then I don’t know what would be causing that.

Let me know if you cant resolve it and I’ll look into it.

Hi @MaxBladen,
Sorry for the very late answer. I have not found my way around this actually.

Yes, The confusion matrix sums up to 6 samples. nrow(X) (before splitting data) returns 184, and the nrow of train and test datasets returns expected values, i.e, 140 and 44, respectively.

Yes, that is true, most of predict.comp2 are NaNs, except for the ones that are shown in confusion matrix. The same pattern one can find in predict.splsda.h2s

Hi @DeniR

Yes, have a look at the link that Max gave earlier, in the section ‘Prediction’. In that case we artificially created a test set from the original data. In your case that would be a new data set.
The issue is how you would normalise your new unlabelled samples first, without overfitting.

Kim-Anh

Hi @kimanh.lecao,

Thanks for following up on this.

The problem I am facing is in evaluation output. As mentioned earlier, I have my training and test datasets. I have trained the model with the training set (140 samples), and used it for prediction of the test dataset (44 samples). However the output of the confusion matrix recognises in total only 6 samples. Meaning 38 have no prediction- empty rows.

ok that is weird. If you want I can have a look at your data and code if you send me your .RData and everything I need to rerun that part (i.e the final PLS-DA model and the prediction).

Kim-Anh

Hello @kimanh.lecao ,

I have the same issue. Strangely, I found my way around it by uploading a single data file and manually dividing the training and test data. I don’t understand why it’s working now but at least I have results…

Best

Fabien

@Fabien-Filaire @DeniR,

Thanks @Fabien-Filaire, then I think there is something weird happening when you divide the training / test data sets directly into R, i.e droplevels(Y) has not been applied, or something of that kind (i.e some information still remains in memory and has not been ‘cut’ properly.

Kim-Anh