Error when tuning sPLS parameters

Hi mixOmics Team,
I am trying to run the perf function for an sPLS analysis result:

> result.spls <- spls(X, Y, ncomp = 5, mode = "regression")
> perf.pls <- perf(result.spls, validation = 'Mfold', folds = 7, nrepeat = 50, progressBar = TRUE, cpus = 1)

But whenever I run the perf function I am getting the following error message:

> perf.pls <- perf(result.spls, validation = 'Mfold', folds = 7, nrepeat = 50, progressBar = TRUE, cpus = 1)
[========                                          ] 16%Error in X.test %*% a.cv : non-conformable arguments
> perf.pls <- perf(result.spls, validation = 'Mfold', folds = 7, nrepeat = 50, progressBar = TRUE, cpus = 1)
[=                                                 ] 2%Error in X.test %*% a.cv : non-conformable arguments
> perf.pls <- perf(result.spls, validation = 'Mfold', folds = 7, nrepeat = 50, progressBar = TRUE, cpus = 1)
[==                                                ] 4%Error in X.test %*% a.cv : non-conformable arguments

As you can see in the output the error does not always occur at the same point/time. I already tried it with a decreased nrepeat parameter and if I run it over and over again with the lower nrepeat value it can happen that in about 1of 5 times the function finishes without an error.

> perf.pls <- perf(result.spls, validation = 'Mfold', folds = 7, nrepeat = 5, progressBar = TRUE, cpus = 1)
[==============================                    ] 60%Error in X.test %*% a.cv : non-conformable arguments
> perf.pls <- perf(result.spls, validation = 'Mfold', folds = 7, nrepeat = 5, progressBar = TRUE, cpus = 1)
[====================                              ] 40%Error in X.test %*% a.cv : non-conformable arguments
> perf.pls <- perf(result.spls, validation = 'Mfold', folds = 7, nrepeat = 5, progressBar = TRUE, cpus = 1)
[==============================                    ] 60%Error in X.test %*% a.cv : non-conformable arguments
> perf.pls <- perf(result.spls, validation = 'Mfold', folds = 7, nrepeat = 5, progressBar = TRUE, cpus = 1)
[==================================================] 100%

The same problem appears when using tune.spls. I assumed that it is a problem with the input data, but the dimensions are ok and there are no NZV values in the input matrices.

> dim(X)
[1] 16 30
> dim(Y)
[1] 16 23
> nearZeroVar(X)
$Position
integer(0)

$Metrics
[1] freqRatio     percentUnique
<0 rows> (or 0-length row.names)

> nearZeroVar(Y)
$Position
integer(0)

$Metrics
[1] freqRatio     percentUnique
<0 rows> (or 0-length row.names)

Additionally, when I use the same two datasets for the DIABLO block.splsda analysis perf and tune.block.splsda are working without a problem.

Please could you help me with this problem?
Thank you very much in advance for your help!

I believe based on the error we’re seeing that an issue is occuring here. This line is calculating the predicted values for the test samples (for a given fold over a given component).

The error states that the length of a.cv (X loading values) is different to number of columns in X.test. So we are either losing columns from the X.test dataset or values from the loading vector somewhere in that iteration.

The fact that it almost always occurs when nrepeat=50 and sometimes when nrepeat=5 suggests that this is a issue of randomness (likely the partitioning of the testing and training sets on a given repeat).

While I believe this is an issue of your data, would you mind running the below code to see if the same error occurs?

library(mixOmics)
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic

result.spls <- spls(X, Y, ncomp = 5, mode = "regression")
perf.pls <- perf(result.spls, 
            validation = 'Mfold', folds = 7, 
            nrepeat = 50, progressBar = TRUE)

If you do not receive any errors running the above code, let me know along with what your email is. I can then get in contact with you and I can potentially have a look at your particular datasets (if that is possible).

I am getting the same error with my dataset. It works fine if I use a subset of my Y dataset (31 taxa) but if I add more to it (400 taxa) the issue comes about. I got rid of nzv columns so that when I run MyResults.spls$nzv it says NULL.

I have no idea what the problem is.

The dimensions of my data:

dim(meta); dim(filtered_gene)
[1] 184 432
[1]  184 4953

ncomptry=5
MyResult.spls <- spls(meta, filtered_gene, ncomp=ncomptry)
MyResult.spls$nzv # should say NULL

set.seed(30)
perf.pls <- perf(MyResult.spls, validation="Mfold", folds=7, progressBar=TRUE, nrepeat=50)

# Also fails if I run:
perf.pls <- perf(MyResult.spls, validation="Mfold", folds=5, progressBar=FALSE, nrepeat=50)

> Error in X.test %*% a.cv : non-conformable arguments

But I have no issues if I use a slightly different dataset with the following dimensions:

dim(filtered_meta); dim(filtered_gene)
[1] 190 31
[1]  190 4953
MyResult.spls <- spls(filtered_gene, filtered_meta, ncomp=ncomptry)

set.seed(30) 
perf.pls <- perf(MyResult.spls, validation="Mfold", folds=5, progressBar=FALSE, nrepeat=50)

# No errors, it runs and then I can check the plots

I am unfortunately facing the exact same issue :frowning:

EDIT: I fixed it (or worked around it) by tuning my NearZeroVar a bit more aggressively, in my specific case increasing the ‘uniquecut’ parameter to 15 did the job.

Melkschuimer

I’m sorry but what is the uniquecut variable and where did you incorporate it? Is it part off the spls function? I don’t see that as a variable to include.

If you apply the NearZeroVar function to your data separately before using spls/pls, you can set the uniquecut parameter. See here

I’m still unsure how this gets implemented. So if my dataset that seems to be problematic is called meta, I ran it as such:

dim(meta)
# 190 204
nearZeroVar(meta, freqCut=95/5, uniqueCut=15, allowParallel=TRUE)
dim(meta)
# 190 204

Should I expect this to change my dataframe size - the vignette provided from the package was pretty useless in my opinion.

Edit: I am trying it again and also setting near.zero.var=TRUE in the spls function and so far it hasn’t crashed (yet only at 8%) when running perf. But usually it crashes anywhere between 2%-45% when running perf.

Edit 2: the perf function actually finished. But when trying to run tune.spls, it fails. I’m about to give up - I don’t understand why it fails so often and now at the tuning step. It works fine with a smaller dataset but it’s pointless if I can’t use a complete one.

Thanks for helping out @Melkschuimer.

In case anyone was curious how to subset your data after using nearZeroVar(), here’s a simple example:

# Zero containing dataframe
df

# Build nearZeroVar dataframe
nzv <- mixOmics::nearZeroVar(df, uniqueCut = 10)

# Subset using an if function (in case column length is 0)
if(length(nzv$Position) > 0) df <- df[, -nzv$Position]
1 Like