Regression coefficients, PRESS, RSS and Q2 in PLSDA

Hi all, I would like to know how to calculate PRESS and TSS for my PLSDA. The main reason is I would like to follow the methodology in this paper (PDF) Discriminant Q 2 (DQ 2 ) for improved discrimination in PLSDA models

My first issue is i don’t know the meaning of the predicted values (from predict function) - I have tried multiplying the reg.coefficients (from predict(plsdamodel, X)$B.hat) by the centred and scaled X matrix (plsdamodel$X), but my predicted values don’t match the predicted values I obtain from predict(plsdamodel, X), and I think I remember reading somewhere that the predict(plsdamodel,X)$predict would give me the likelihood of belonging to some class - so if this is true, it’s expectable that my values don’t match, but also I can’t really use likelihood to calculate the TSS and PRESS. I need some sort of predicted value that leads to a predicted class, depending on some threshold.

Note: I know PRESS is calculated based on cross-validation. I am just starting by the simplest example to understand where to get the data before going into the perf with CV loo, but the ultimate goal is to 1- get predicted values to be able to 2- calculate PRESS 3- maybe even calculate PRESSD if I manage to code it in R and then 4-get the DQ2 from there, otherwise 2- calculate PRESS and 3- get Q2.

Let me know if I need to clarify this question a bit, I can also create a set of fake data and do the codes if it’s easier.

Many thanks!
Joana

Hi @Joana,

I may not be the best person to ask about this but I’ll see how I go answering it. Looking at the image below, we can see how the predict() function calculates the expected score of each sample for each response class. This is due to mixOmics encoding all classification responses as dummy matrices. This may make converting to your desired output a bit tricky. Note that you are after Y.hat.new not T.pred.

image

Looking at the predict() function code, line 593 shows how Ypred is calculated. If my understanding is correct, the values from this line is what you want. You do not want the transformed version (on line 594) as this converts it to a likelihood.

The first code snippet directly below sets things up. Note that the X and Y datasets were sliced such that this was a two-class problem (as the methodology described in the linked paper cannot apply to multiclass problems):

data(breast.TCGA)
X.train <- breast.TCGA$data.train$mirna[1:75,]
Y = as.factor(as.vector(breast.TCGA$data.train$subtype[1:75]))

ncomp <- 2 

res.splsda <- plsda(X.train, Y, ncomp = 2)

X.test <- breast.TCGA$data.test$mirna[1:35, ]

This is where you would employ the predict() function. Instead, run the following lines:

D <- crossprod(scale(X.train), res.splsda$variates$X)
B <- crossprod(res.splsda$ind.mat, res.splsda$variates$X)
W <- res.splsda$loadings$X

ncomp.list <- 1:ncomp

Ypred <- list()
for (i in ncomp) {
  Ypred = lapply(1 : ncomp.list[i], function(x){scale(X.test) %*% W[, 1:x] %*% solve(t(D[, 1:x]) %*% W[, 1:x]) %*% t(B)[1:x, ]})
}

If we inspect the first few samples of Ypred on the first component:

head(Ypred[[i]])
#>           Basal       Her2
#> A54N  1.0254366 -1.0254366
#> A2NL  1.9738867 -1.9738867
#> A6VY  0.1388726 -0.1388726
#> A3XT  0.6176910 -0.6176910
#> A1F0  0.1721207 -0.1721207
#> A26I -0.1705571  0.1705571

You’ll see that each sample has a predicted value for each class, but that the magnitudes are equal (for a given sample). You can essentially take the first column and think of it as your y.hat.i from the PRESS formula. If the value is >0, then its predicted as the +1 class (“Basal” in this case) and if its <0 then its the -1 class. You can then apply the PRESSD formula, excluding those which have a magnitude greater than 1.

I hope this helps and feel free to follow up with any more questions you have

Cheers,
Max.

1 Like

Hi @MaxBladen , could I have the link to the source of the image you showed?

Thanks!