I Have a problem/error with tune.block.splsda

Hello everyone, I hope you are doing well.
I am having a problem making the choice of parameters for my Diablo analysis.

My data, from the same 21 rats are:

Microbiota seq 16S - 29 bacterial families.
Metabolon Feces - 575 metabolites
Plasma Metabolon - 635 metabolites.

I used these data as X in my model, while as Y, I used the data of whether these rats belonged to the Tg or WT group.

Previously I performed the perf function, which recommended me to select 2 components.

I did not run the model with 10 fold because when I ran the choice of components with 10 fold I got this error:
10: In repeat_cv_perf.diablo(nrep) :
At least one class is not represented in one fold, which may unbalance the error rate.
Consider a number of folds lower than the minimum in table(Y): 9
So I ran it with 9 initially, then with 5 fold.

Now, when I do the tune.block.splsda, I get errors that I did not know how to interpret, as soon as I found in the forum. I copy two different runs:

set.seed(123) # Forreproducibilitywiththishandbook,removeotherwise
test.keepX <- list(microbiota_clr = c(seq(2, 29, 4)),
                   metabolon_feces = c(5:10, seq(11, 575, 20)),
                   metabolon_plasma = c(5:10,seq(5, 635, 20)))


> tune.diablo.tcga <- tune.block.splsda(X, groups$GENOTYPE, ncomp = 2,
+                                       test.keepX = test.keepX, design = design,
+                                       validation = 'Mfold', folds = 5, nrepeat = 5,
+                                       dist = "centroids.dist")
Design matrix has changed to include Y; each block will be
            linked to Y.

You have provided a sequence of keepX of length:  7 for block microbiota_clr and 35 for block metabolon_feces and 38 for block metabolon_plasma.
This results in 9310 models being fitted for each component and each nrepeat, this may take some time to run, be patient!

You can look into the 'BPPARAM' argument to speed up computation time.
Error: BiocParallel errors
  1 remote errors, element index: 1
  4 unevaluated and other errors
  first remote error: Lapack routine dgesv: system is exactly singular: U[2,2] = 0

> tune.diablo.tcga <- tune.block.splsda(X, Y, ncomp = 2,
+                                       test.keepX = test.keepX, design = design,
+                                       validation = 'Mfold', folds = 5, nrepeat = 1,
+                                       dist = "centroids.dist")
Design matrix has changed to include Y; each block will be
            linked to Y.

You have provided a sequence of keepX of length:  7 for block microbiota_clr and 35 for block metabolon_feces and 38 for block metabolon_plasma.
This results in 9310 models being fitted for each component and each nrepeat, this may take some time to run, be patient!

You can look into the 'BPPARAM' argument to speed up computation time.
Error: BiocParallel errors
  1 remote errors, element index: 1
  0 unevaluated and other errors
  first remote error: Lapack routine dgesv: system is exactly singular: U[2,2] = 0
In addition: There were 13 warnings (use warnings() to see them)

If you have any suggestions, it would be very helpful. Thank you very much!

Thanks for the post @Lorengol, I’ll look into this and see what I can find

Max,
Thank you very much. For what it’s worth, I ran this way and it worked for me:

set.seed(123) # Forreproducibilitywiththishandbook,removeotherwise
test.keepX <- list(microbiota_clr = c(5,10,15,20,25,29),
                   metabolon_feces = c(5,10,25,50,100,150,200,350,500,575),
                   metabolon_plasma = c(5,10,25,50,100,150,200,350,500,575,635))

BPPARAM <- BiocParallel::SnowParam(workers = parallel::detectCores()-1)


tune.diablo.tcga <- tune.block.splsda(X, Y, ncomp = 2,
                                       test.keepX = test.keepX, design = design,
                                       validation = 'Mfold', folds = 5, nrepeat = 5,
                                       dist = "centroids.dist",max.iter = 200)

I can only put fold 5 since one group has n = 9 so I cannot use fold 10. On the other hand, I kept getting the same error. What I did was to put less variables to test (before I had put a sequence of different combinations of variables and changed it to less n variables). The truth is that I don’t know why if I put more variables in “keep.X” I get the message that the system is exactly singular.

Thank you for your time!

@MaxBladen Hello, how are you? Were you able to detect the problem? Thank you very much!

G’day @Lorengol

Unfortunately I haven’t had time to spend much time with this one. Due to the size of the mixOmics team, I’ve had to prioritise some other issues. Also, I’ve just come back from a short break.

I have done some exploration and have an idea where the problem might stem from - it relates to the size of the input dataframe. With too many features (columns), the predictor matrix becomes “singular” - meaning it cannot be inverted. Matrix inversion is a necessary step as part model building. So with too many features, this step fails and the error is raised. this occurs while the system is in parallel - hence BiocParallel raises the issue.

I’ll be sure to reach out once I’ve had some time to spend on this issue. Thanks again for raising the issue - user feedback is the holy grail of development.

Max,

Sounds logical. I understand they are busy. Thank you very much for taking the time to get back to me and keep me in the loop.
It makes sense what you say about this issue.
Thanks and we’ll be in touch!