PLS and DIABLO tuning

Hi,

I will ask couple of questions as they are all regarding tuning of ncomp and keepx parameters in DIABLO and pls.

  1. I am running pls and when I want to tune for optimal ncomp selection I get this error:
    Error in Ypred[omit, , h] ← Y.hat[, , 1] :
    number of items to replace is not a multiple of replacement length

dim(X); dim(Y)
[1] 24 1943
[1] 24 158
#First a PLS with sufficient components and then we will validate the ncomp
MyResult.pls1 ← pls(Y,X, ncomp = 4)
View(MyResult.pls1)
set.seed(30) # for reproducbility in this vignette, otherwise increase nrepeat
perf.pls ← perf(MyResult.pls1, validation = “Mfold”, folds = 4,

  •              progressBar = FALSE, nrepeat = 10)
    

Error in Ypred[omit, , h] ← Y.hat[, , 1] : **
** number of items to replace is not a multiple of replacement length

  1. I get an error when performing DIABLO, which I was not getting two days ago when I rerun the analysis

X ← list(volatiles = data_gcms_P[c(1:19),c(7:164)],

  •       nonvolatilesNEG = data_neg_P[c(1:19), c(7:1949)], 
    
  •       nonvolatilesPOS = data_pos_P[c(1:19), c(7:1256)])
    

type<-data_neg_P[c(1:24),]
Subtype ← as.vector(type$Experiment)
Subtype<-as.factor(Subtype)
Y ← Subtype[c(1:19)]
summary(Y)
cooked raw
10 9
#set up arbitrarily the number of variables keepX that we wish to select in each data set and each component.
list.keepX ← list(volatiles = c(15, 5), nonvolatilesNEG = c(20,15), nonvolatilesPOS = c(10,5))
MyResult.diablo.less ← block.splsda(X, Y, keepX=list.keepX, ncomp=2) #default: ncomp=2, scale=T, mode=regression
Warning messages:
1: In cor(A[[k]], variates.A[[k]]) : the standard deviation is zero
2: In cor(A[[k]], variates.A[[k]]) : the standard deviation is zero
#various plots
plotIndiv(MyResult.diablo.less) ## sample plot
plotVar(MyResult.diablo.less) ## variable plot
Warning messages:
1: In cor(object$blocks[], object$variates[][, c(comp1, comp2)], :
the standard deviation is zero
2: In cor(object$blocks[], object$variates[][, c(comp1, comp2)], :
the standard deviation is zero
#CV
MyPerf.diablo ← perf(MyResult.diablo.less, validation = ‘Mfold’, folds = 3,

  •                   nrepeat = 50, 
    
  •                   dist = 'centroids.dist')
    

Error: Unexpected error while trying to choose the optimum number of components. Please check the inputs and if problem persists submit an issue to Issues · mixOmicsTeam/mixOmics · GitHub

  1. When tuning for the keepx, in your case script you have:
    test.keepX = list (datasetA = c(5:9, seq(10, 18, 2), seq(20,30,5)),
    datasetB = c(5:9, seq(10, 18, 2), seq(20,30,5)),
    datasetC = c(5:9, seq(10, 18, 2), seq(20,30,5)))
    How do I choose these lists? and what is the impact of them? Does it mean that not all variables are checked for the model?

Thank you very much in advance for your help and my compliments for your work!

Hi EiriniP,

I’m getting the same error (" Ypred[omit, , h] ← Y.hat[, , 1] : number of items to replace is not a multiple of replacement length ") … have you already resolved this problem?

In regards to the first error when using the perf() function on a pls object, even with data of the same dimensions and using your exact code, I am unable to replicate your error.

If you are familiar with the use of breakpoints in RStudio, I would advise placing on at line 542 of the perf() function. Examine the dimensions of the Ypred and Y.hat variables. If this does not provide a clear answer to your issue, feel free to let me know what your email is. I can then reach out to you regarding your data and code and we can work through the issue.

Hi @Leandro and @EiriniP,

In regards to the issue of perf() not functioning on your pls objects, I’ve raised an error on the Github and implemented a fix for it. If you are wanting to use this build to work around the bug you reported, simply install the devtools library and run the following commands:

library(devtools)
install_github("mixOmicsTeam/mixOmics", ref = github_pull("197"))

If you are wanting to revert back to the standard release, navigate to your library folder within the R install directory, delete the mixOmics folder and then run the following line within RStudio:

BiocManager::install("mixOmics")

Let me know if this fixes your issue!

Cheers,
Max.

Hi Max,

First, sorry I did not reply to your previous message. It has been long ago since I posted my question and eventually as I hadn’t managed to work around it, I used other package…

Second, Thank you very very much for fixing the error. I will do as you say and definitely use mixOmics for my current analyses!

Have a great day

Eirini

~WRD0000.jpg

So the problem was the zero variance features… got it! Anyway, thank you for your work!

Leandro

As a suggestion for you both (as well as anyone else reading this post), the pre-processing of your data is extremely crucial (potentially where more time should be spent when compared to the actual analysis). You should have a strong understanding of how your features are related (and correlated), how many and where missing values are and what features have little to no variance.

There are no hard and fast rules as to whether these features should be retained or removed prior to analysis, but exploring these things is of paramount consequence.

Hello, I saw that this question (which is exactly my same question) was not responded.

On the website guide how these lists are chosen is also not explained. I would like to know how I should do that. Thanks in advance.

Hi @estefaniatn @EiriniP,

Does it mean that not all variables are checked for the model?

No, it means that only the top ones (e.g. 5) are selected during the evaluation.

How do I choose these lists?

You could be comprehensive and try something like:
datasetA = c(1:ncol(datasetA))
but of course this would take for ages to run. So instead you have to be strategic and think of - what is the minimum and maximum number of variables per dataset you need for interpretation, per component? 1, 5 or 20?

You could try a few options first and then refine, in order to reduce the computational burden.

Kim-Anh