Tuning the number of components for block.spls using test datasets

Hello everyone,

I am currently working on a project and I want to regress two continuous variables from several omics data. The use of a multi-block PLS in a sparse framework seems very appropriate. However, as I cannot tune the number of components, I have been looking for a way to justify it and that is what my post is about.

I’ve set aside 20 individuals before building my model, and the idea would be to predict the values of these 20 individuals from k components. The final goal would be to keep only a reasonable number of components minimizing the RMSE (Root Mean Square Error) for these 20 individuals, i.e. reject all the components that do not minimize the RMSE. I then obtain this kind of graphs for one of my two Y variables:

image

Looking at this graph, I would tend to select 13 components, as the following ones do not reduce the RMSE enough. I wanted to know if this type of methodology could be applied to select components for a block.spls. It would also be possible to implement cross-validation to make this selection more robust. What do you think? Does this methodology seem appropriate to justify the number of components? Is 13 components too high?

This is a part of my R code :

k<-30 #number of components
Y1 <- rep(NA,k) #first Y variable
Y2 <- rep(NA,k) #second Y variable
MyResult.diablo <- block.spls(X, Y, keepX=list.keepX, ncomp=k, design = MyDesign,
                              mode="regression") #build a model
Mypredict.diablo <- predict(MyResult.diablo, newdata = X.test, dist = "centroid") #test with 20 ind
mypred <- Mypredict.diablo$WeightedPredict #get the weighted predictions

# a look for each k
for(i in 1:k){
  #dim1
  mypreddim <- mypred[,,i]
  mypred1 <- mypreddim[,1]*sd1+m1 # de-scale (since Y1 has been scaled) : for comparisons
  mypred2 <- mypreddim[,2]*sd2+m2 # de-scale (since Y2 has been scaled) : for comparisons
  Resp1[i] <- sqrt(mean((Y.test[,1] - mypred1)**2)) # RMSE for 1st Y var
  Resp2[i] <- sqrt(mean((Y.test[,2] - mypred2)**2)) # RMSE for 2nd Y var
}
# getting plots
plot(1:k,Resp1,xlab="Number of components",ylab="RMSE",
     main="Prediction : 20 individuals",pch=16)
plot(1:k,Resp2,xlab="Number of components",ylab="RMSE",
     main="Prediction : 20 individuals",pch=16)

Thank you in advance for your advice. I would like to take this opportunity to congratulate you on this package, which is very ergonomic.
Sincerely

hi again @gdrd,

I think 13 components is too large a number for what you are trying to achieve with the block.spls! Presumably you would not need more than 1 or 2 components to explain those two variables.

What we have proposed for a classic PLS2 model (2 blocks) is to use the Q2 criterion, which is a bit more global than looking a the RMSE. Potentially it could be extended to a block.spls but it would require a bit of implementation (spoiler: at that stage the code is not great to understand in the package!).

Here are some details about the Q2.

Can you send me an email and we can have this discussion offline until we find a workable solution?

Kim-Anh

Hi @kimanh.lecao and @gdrd,

Indeed I found very interesting and valuable this discussion. Have you finally included a tuning procedure in either block.spls or wapper.sgcca models?

In a 3-blocks model, would it be correct/acceptable to tune the number of components in separated spls models? Or, based on your last advances on this package, what would you suggest?

Many thanks for your time.

Bests,

Serena

Hello,

I was wondering if there is any update on the tuning for block.spls.

Thanks!
Mariana

hi @MarianaPLR

Still not! we need funding :slight_smile:
We use block.spls recently in this paper: https://www.biorxiv.org/content/10.1101/2024.01.30.577864v1.full. We chose the number of variables to select arbitrarily, and we inspected the sample plots to choose the number of components that made sense (here 1).

(I think in general this is a hard methodological question, and I dont think we will solve it any time soon).