MINT-PLS DA - Prediction in another study

Hello,

I was able to create a model with MINT-PLS DA and I would like to be able to use the prediction function in a new study. However, it seems that I have to specify to the model that my sample to predict is part of one of the studies used in the model. How can I tell him that the sample is part of another study please?

Yours faithfully,
Jérémy Tournayre

Hi Jeremy,
Since we use ‘leave on out group cross-validation’ it would be pretty safe to run a perf() function on all the collated studies, and focus on the prediction performance for that specific study. I am not sure we have implemented the predict.mint.splsda explicitly in the package (it is certainly used in perf.mint.spslda if you want to have a look).
If you do want to go down that path have a look at a MINT function because there is some scaling to do before hand.

Kim-Anh

Hi Kimanh,

Thanks for your quick reply! I am satisfied with my model in terms of performance. It’s a model created with MINT-PLS DA with a qualitative variable that can take 3 possibilities (“A”, “B” or “D”).

In fact, I have received samples from a new study which I dont know the value of the qualitative variable so they can’t be put into the model. However, I would like to be able to predict them with the model created in MINT-PLS. Maybe the prediction function only serves to evaluate the quality of the model and not to predict a real unknown?

Jeremy

Hi Jeremy,
Thanks for clarifying what your analysis aims are.

Here is the code to predict on an external data set. The function is quite tricky to find because predict is an S3 function in our package, but they have an extended invisible name when you call the function, but these functions are visible if you look at the whole reference manual of the package.

?predict.mint.block.splsda

Here is an example of code:
data(stemcells)
stemcells$study

# here we remove one study (study 4) in the training set
index.study4 <- which(stemcells$study == 4)
X <- stemcells$gene[-index.study4,]
dim(X)  #110 400
Y <- stemcells$celltype[-index.study4]
study <- droplevels(stemcells$study[-index.study4])

# study 4 will be used as external test set
X4 <-stemcells$gene[index.study4,]

# mint method
res <- mint.splsda(X = X, Y = Y, ncomp = 3, keepX = c(50,50,50),
                  study = study)

# in predict, study.test is used in case there are several studies
# in the test set, as we need to center study-wise
res.predict <- predict(res, X4, study.test = droplevels(stemcells$study[index.study4]))

res.predict$class

Given those results you will choose the last component (that you have tuned previously on your training set) as your final prediction.
Good luck, and we hope your results are promising!

Kim-Anh + Al

Thank you, it works perfectly!

Hi,

I wonder why a sample have a prediction result depending of the other samples put in the prediction.

For example, if I predict only two samples : the 124 (sample166) and the 125 (sample167) of your data I obtain :

res.predict$predict

, , dim3

      Fibroblast       hESC     hiPS

sample166 -0.2065909 0.06891953 0.976463
sample167 -0.1899129 0.54211822 0.570175

But if I predict the 123 (sample165) and the 125 (sample167) of your example I obtain :

res.predict$predict

, , dim3

      Fibroblast      hESC      hiPS

sample165 -0.2088545 0.1647048 0.8783963
sample167 -0.1903215 0.5258324 0.5797968

Why sample167 have two differents prediction results depending of the other sample?
Also, can I predict only one sample?

Yours faithfully,
Jérémy Tournayre

Hi Jeremy,
Could I direct you the supplemental material from mixOmics that explains the prediction distances?
All in all though, even if the value differ (because it depends on what else is in the test set to calculates the predicted variates) it seems that the final class prediction would be the same? You did not show the final prediction.

Other explanation: it depends on the distance you use and they are based on centroid (or other).

We make the assumption that you would predict all the samples from your test set in one go.

Kim-Anh

Hi,

Thanks for your fast reply!
“it seems that the final class prediction would be the same? You did not show the final prediction.”
-> Yes the final prediction still the same despite the different values.

“We make the assumption that you would predict all the samples from your test set in one go.”
-> I can’t predict a single sample from another study, is that it? Is there a reason?

I read the “Prediction distances” part of the supplemental material of mixOmics
(https://ndownloader.figshare.com/files/9754087) but I don’t understand why we can not predict a single sample from another study?

From your example, I was able to create a model from the studies 1, 2 and 3 and I try to predict 2 samples from manipulation 4. This gives me opposite prediction values (see below), why?

res.predict$predict
, , dim3
Fibroblast hESC hiPS
124 -0.07785964 -0.3872099 0.4342543
125 0.07785964 0.3872099 -0.4342543

(> The script allowing me to get this :

library(mixOmics)
data(stemcells)
index.study4=c(124,125)

2 samples of study 4 will be used as external test set

X4 <-stemcells$gene[index.study4,]
rownames(X4)=c(“124”,“125”)
external_study=c(4,4)

here we remove study 4 in the training set

index.study4 <- which(stemcells$study == 4)
X <- stemcells$gene[-index.study4,]
Y <- stemcells$celltype[-index.study4]
study <- droplevels(stemcells$study[-index.study4])

mint method

res <- mint.splsda(X = X, Y = Y, ncomp = 3, keepX = c(50,50,50),
study = study)

res.predict <- predict(res, X4, study.test = external_study)
res.predict$predict
res.predict$class
)

If I put the same sample twice, I get 0 :
Fibroblast hESC hiPS
125-1 0 0 0
125-2 0 0 0

I think this is due to the assumption that I must predict all the samples from my test set in one go. However if I have only one sample I can’t predict it. Is that right?

Jérémy

Hi Jeremy,

I had a quick look, the predict function require at least 2 samples to work, otherwise it throws the error
Error in matrix(0, nrow = nrow(concat.newdata[[1]]), ncol = q) :
non-numeric matrix extent
However I have this feeling it might just be a computational issue rather than a theoretical reason. I will ask @aljabadi to have a look at the code and we will get back to you.

Regarding your concern about the res.predict$predict output, you should have a look instead at the res.predict$class this is really what matters in your case, not the actual component predicted values.

Kim-Anh

Hi Jeremy,

Thanks for letting us know about this bug. It is now fixed in the devel version which you can install using:

## install devtools if not installed
if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")

devtools::install_github("mixOmicsTeam/mixOmics", ref="devel")

Please note that if your model (on trained data) uses scale=TRUE, your new single test/prediction sample from a new study cannot be scaled (it can be if it’s from studies in the model) so you should be careful about that. Therefore, either consider not scaling your data in the mint model (if that’s appropriate), or use more than one sample in predictions, or include all studies in both train and test/prediction.

Hope it helps.

Al

1 Like