Hello everyone,
I am currently working on a project and I want to regress two continuous variables from several omics data. The use of a multi-block PLS in a sparse framework seems very appropriate. However, as I cannot tune the number of components, I have been looking for a way to justify it and that is what my post is about.
I’ve set aside 20 individuals before building my model, and the idea would be to predict the values of these 20 individuals from k components. The final goal would be to keep only a reasonable number of components minimizing the RMSE (Root Mean Square Error) for these 20 individuals, i.e. reject all the components that do not minimize the RMSE. I then obtain this kind of graphs for one of my two Y variables:
Looking at this graph, I would tend to select 13 components, as the following ones do not reduce the RMSE enough. I wanted to know if this type of methodology could be applied to select components for a block.spls. It would also be possible to implement cross-validation to make this selection more robust. What do you think? Does this methodology seem appropriate to justify the number of components? Is 13 components too high?
This is a part of my R code :
k<-30 #number of components
Y1 <- rep(NA,k) #first Y variable
Y2 <- rep(NA,k) #second Y variable
MyResult.diablo <- block.spls(X, Y, keepX=list.keepX, ncomp=k, design = MyDesign,
mode="regression") #build a model
Mypredict.diablo <- predict(MyResult.diablo, newdata = X.test, dist = "centroid") #test with 20 ind
mypred <- Mypredict.diablo$WeightedPredict #get the weighted predictions
# a look for each k
for(i in 1:k){
#dim1
mypreddim <- mypred[,,i]
mypred1 <- mypreddim[,1]*sd1+m1 # de-scale (since Y1 has been scaled) : for comparisons
mypred2 <- mypreddim[,2]*sd2+m2 # de-scale (since Y2 has been scaled) : for comparisons
Resp1[i] <- sqrt(mean((Y.test[,1] - mypred1)**2)) # RMSE for 1st Y var
Resp2[i] <- sqrt(mean((Y.test[,2] - mypred2)**2)) # RMSE for 2nd Y var
}
# getting plots
plot(1:k,Resp1,xlab="Number of components",ylab="RMSE",
main="Prediction : 20 individuals",pch=16)
plot(1:k,Resp2,xlab="Number of components",ylab="RMSE",
main="Prediction : 20 individuals",pch=16)
Thank you in advance for your advice. I would like to take this opportunity to congratulate you on this package, which is very ergonomic.
Sincerely