Excessive runtime using tune.block.splsda()

Hi,

In advance, thanks for this amazing package.
I am using DIABLO, with four blocks, and the optimum number of components is three.
The algorithm is running in a cluster, with 32 cores, I set the progress bar equal T.

Though with multicore, the algorithm says that is not possible, the progress is shown, thus so far so good.

However, when it comes to the second component, the progress is not shown anymore. It keeps thinking one more day, and it seems to be tucked.

Moreover, the number of models is 243 for every 10 repeats and 3 folds, so it should be returning any output, in minutesā€¦ but it does not

The sequence of the variables to be selected is the following:
list.keppX ā† list(clinical_data=seq(1,16,5),IP=seq(1,5,2),metabolome=seq(1,37,10),microbiome=seq(1,50,10))

Any help ?

Howdy @eddgeag,

Without any context, its hard to provide much help. In my own experience, using 16 cores, 243 models (and then multiplied by 30 (10 repeats x 3 folds)) takes quite a bit of time to run. How long are we talking in terms of run time?

What is the BPPARAM function that youā€™re using? Iā€™ve noticed that sometimes Iā€™ve written the wrong function call but no errors were raised, meaning that only one core was being used. Can you see the progress bar moving along or is it static? As if you can see it moving, then your BPPARAM function has an error as while the progress bar will appear, it is static when multiple cores are being utilised.

Also, if you are on windows ensure that you use the SnowParam() function and not MulticoreParam().

If you want to post your code below I can more adequately help. Also, knowing at least the dimensions of your 4 blocks of data will aid in this process.

Cheers,
Max.

Thank you for your response, sorry for not repliy earlier I was on vacation. My code is the following.

library(mixOmics)
load("datos.RData")
variables_in_bacteria = salvar$variables
datos = salvar$datos
res1.pls.met_micro <- mixOmics::pls(datos$metabolome,datos$microbiome,ncomp=1)
(rho_met_micro <- cor(res1.pls.met_micro$variates$X,res1.pls.met_micro$variates$Y))

res1.pls.met_IP <- mixOmics::pls(datos$IP,datos$metabolome,ncomp = 1)
(rho_met_ip <- cor(res1.pls.met_IP$variates$X,res1.pls.met_IP$variates$Y))

res1.pls.met_clin <- mixOmics::pls(datos$metabolome,datos$clinical_data,ncomp=1)
(rho_pls_met_clin <- cor(res1.pls.met_clin$variates$X,res1.pls.met_clin$variates$Y))

res1.pls_micro_clin <- mixOmics::pls(datos$microbiome,datos$clinical_data,ncomp=1)
(rho_pls_micro_clin <- cor(res1.pls_micro_clin$variates$X,res1.pls_micro_clin$variates$Y))

res1.pls_micro_IP <- mixOmics::pls(datos$microbiome,datos$IP,ncomp = 1)
(rho_pls_micro_IP <- cor(res1.pls_micro_IP$variates$X,res1.pls_micro_IP$variates$Y))

res1.pls_IP_clin <- mixOmics::pls(datos$IP,datos$clinical_data,ncomp = 1)
(rho_pls_IP_clin <- cor(res1.pls_IP_clin$variates$X,res1.pls_IP_clin$variates$Y))

design =matrix(c(0,rho_pls_IP_clin,rho_pls_met_clin,rho_pls_micro_clin,
                 rho_pls_IP_clin,0,rho_met_ip,rho_pls_micro_IP,
                 rho_pls_met_clin,rho_met_ip,0,rho_met_micro,
                 rho_pls_micro_clin,rho_pls_micro_IP,rho_met_micro,0),ncol = 4,dimnames = list(names(datos),names(datos)))
interaccion <-with(variables_in_bacteria,interaction(GROUP,OBESE))
names(interaccion)<-variables_in_bacteria$Paciente
sgccda.res_total_1 = block.splsda(X = datos, Y=interaccion, ncomp = 5,
                                  design = design)

set.seed(123) # for reproducibility, only when the `cpus' argument is not used
# this code takes a couple of min to run
perf.diablo_toltal_1 = mixOmics::perf(sgccda.res_total_1, validation = 'Mfold', folds = 3, nrepeat = 50)


perf.diablo_toltal_1$choice.ncomp$WeightedVote
ncomp = perf.diablo_toltal_1$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"]
list.keppX <- list(clinical_data=seq(1,ncol(datos$clinical_data),5),IP=c(1:ncol(datos$IP)),metabolome=seq(14,ncol(datos$metabolome),2),microbiome=seq(1,50,5))
BPPARAM <- BiocParallel::MulticoreParam(workers = parallel::detectCores()-1)
print(BPPARAM)
print(paste("numero de componentes",ncomp))
print("aqui")
tune.splsda.srbct <- tune.block.splsda(X=datos, Y=interaccion, ncomp = ncomp, # we suggest to push ncomp a bit more, e.g. 4
                                 test.keepX=list.keppX,
                                design=design,
                                nrepeat=50,
                                BPPARAM=BPPARAM,
                                fold=3,progressBar =T)   # we suggest nrepeat = 50

plot(tune.splsda.srbct)
print("aqui2")
# Now tuning a new component given previous tuned keepX

Just sending your code without any context doesnā€™t help sorry.

How large are your datasets? How long is the run time? What is your operating system? How many cores does your CPU have? Have you tried testing the runtime at lower nrepeats or a subset of your data?

Once I have some context for your issue I can help

Sorry, Max for not putting it into context,

I have 4 blocks: The clinical data set, comprises 16 dimensions, the Intestinal Permeability markers, which comprise 5 variables, the metabolome block that has 37 metabolites, and finally the metagenome data sets that have 87 genera. The outcome is 6 subgroups, obese nonobese, by sex, and by the syndrome, I am studying, the PCOS. The highest number of subjects in the subgroup is 9 and the lowest is 7. I have 46 subjects.

I have been running on a cluster with 37 cores, the operating system is Linux, and no, I have 50 repeated. I did not subset my data, because of the low number of patients.

What I want is a biological signature of the components that best represent each group, i.e: Males Obese and Nonobese, Females Obese and Nonobese, PCOS females obese, and PCOS females non-obese. Furthermore, I am viewing how well the DIABLO method can discriminate between these subjects, so I can tell if the method works well or not.

The greeds of the tuning component I am not so sure how can I put in them. On the docs, it says that sequences of like seq(1,50,5) should work, but on other docs, it says like (10:50,seq(51,60,2)) so I am not so sure what I can do.

Furthermoe, I am running the algorithm on a cluster, with the same parameters on my personal computer with 4 nodes, and it seems it is working con mi laptop, but not on the cluster. The progress bar is shown on my personal computer but not on the cluster.

That is odd. Theyā€™re all relatively small datasets so shouldnā€™t be taking too long. If you run the same code on your laptop and the cluster, and it only works on the former, it may be an issue with your Linux cluster. If the progress bar is not showing at all (and youā€™re sure youā€™ve set progressBar = TRUE), then this is likely an indicator that your issue is machine specific.

Unfortunately, if youā€™re unable to reproduce the issue with your second machine then I am very unlikely to do the same. Hence, Iā€™m not sure what I can do for you.

What I want is a biological signature of the components that best represent each group, i.e: Males Obese and Nonobese, Females Obese and Nonobese, PCOS females obese, and PCOS females non-obese.

The plotLoadings() function will aid in visualising this. It can colour the bars by which class is associated with the maximum/minimum mean/median (set by the contrib and method parameters respectively). Look here for more info

The greeds of the tuning component I am not so sure how can I put in them. On the docs, it says that sequences of like seq(1,50,5) should work, but on other docs, it says like (10:50,seq(51,60,2)) so I am not so sure what I can do.

I shall quote myself below from this post which Iā€™d recommend have a read through. Tuning keepX in an iterative manner - by decreasing the range and increasing the granularity of the grid at each step - will save time.

Ensure when you tune these models that you use an adequately high number of repeats (ie. nrepeat = 100 ). As this may take a lot of time, Iā€™d also suggest doing the tuning over multiple steps - increasing the resolution and decreasing the range of the grid at each iteration. Eg. Start with test.keepX = seq(10, 150, 10) and based on the output (lets say it selects 50), then undergo tuning again with test.keepX = seq(30, 70, 5) ; rinse and repeat.

If you have any more questions about analysis and usage of the package I can help. However, I believe your issue of excessive runtime on your Linux cluster is something that I cannot aid in diagnosing. Best of luck.

Cheers,
Max.

Hello Max, thank you in advance for your reply

Another thing that is happening is that when tuning, the results give me repeated variables selected. Then when making the tuning again with already. tested, it gives me singular matrices, for that reason I cannot follow the analysis. On the tutorials that do not happen. So, when tuning again, should I discard those variables already selected?

when making the tuning again with already. tested, it gives me singular matrices

Iā€™m not exactly sure what you mean sorry. What gives you singular matrices? Iā€™ve run the below code and the output of the two tune.block.splsda() calls have the same structure (obviously with different values). Can you please confirm which version of mixOmics you are using by reporting via the sessionInfo() function.

library(mixOmics)
data(breast.TCGA) # load in the data

# set a list of all the X dataframes
data = list(miRNA = breast.TCGA$data.train$mirna, 
            mRNA = breast.TCGA$data.train$mrna,
            proteomics = breast.TCGA$data.train$protein)

Y = breast.TCGA$data.train$subtype # set the response variable as the Y df

BPPARAM <- BiocParallel::SnowParam(workers = parallel::detectCores()-1)

design = matrix(0.1, ncol = length(data), nrow = length(data), 
                dimnames = list(names(data), names(data)))
diag(design) = 0 # set diagonal to 0s

test.keepX <- list(mrna = c(10, 30), mirna = c(15, 25), protein = c(4, 8))


tune.comp12 <- tune.block.splsda(X = data, Y = Y,
                                 ncomp = 2,
                                 test.keepX = test.keepX,
                                 design = design,
                                 nrepeat = 2,
                                 BPPARAM = BPPARAM
)
#> Design matrix has changed to include Y; each block will be
#>             linked to Y.
#> 
#> You have provided a sequence of keepX of length: 2 for block mrna and 2 for block mirna and 2 for block protein.
#> This results in 8 models being fitted for each component and each nrepeat, this may take some time to run, be patient!
#> 
tune.comp12$choice.keepX
#> $miRNA
#> [1] 10 30
#> 
#> $mRNA
#> [1] 15 25
#> 
#> $proteomics
#> [1] 4 4


# Now tuning a new component given previous tuned keepX
already.tested.X = tune.comp12$choice.keepX
tune.comp34 = tune.block.splsda(X = data, Y = Y,
                                ncomp = 4,
                                test.keepX = test.keepX,
                                design = design,
                                nrepeat = 2,
                                BPPARAM = BPPARAM,
                                already.tested.X = already.tested.X
)
#> Design matrix has changed to include Y; each block will be
#>             linked to Y.
#> 
#> You have provided a sequence of keepX of length: 2 for block mrna and 2 for block mirna and 2 for block protein.
#> This results in 8 models being fitted for each component and each nrepeat, this may take some time to run, be patient!
#> 

tune.comp34$choice.keepX
#> $miRNA
#> [1] 10 30 10 10
#> 
#> $mRNA
#> [1] 15 25 15 25
#> 
#> $proteomics
#> [1] 4 4 4 8

Created on 2022-03-09 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.1.2 Patched (2021-11-16 r81220)
#>  os       Windows 10 x64 (build 19042)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_Australia.1252
#>  ctype    English_Australia.1252
#>  tz       Australia/Sydney
#>  date     2022-03-09
#>  pandoc   2.14.2 @ C:/Users/Work/AppData/Local/Pandoc/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package      * version date (UTC) lib source
#>  backports      1.4.1   2021-12-13 [1] CRAN (R 4.1.2)
#>  BiocParallel   1.28.1  2021-11-18 [1] Bioconductor
#>  cli            3.2.0   2022-02-14 [1] CRAN (R 4.1.2)
#>  colorspace     2.0-3   2022-02-21 [1] CRAN (R 4.1.2)
#>  corpcor        1.6.10  2021-09-16 [1] CRAN (R 4.1.1)
#>  crayon         1.5.0   2022-02-14 [1] CRAN (R 4.1.2)
#>  digest         0.6.29  2021-12-01 [1] CRAN (R 4.1.2)
#>  dplyr          1.0.8   2022-02-08 [1] CRAN (R 4.1.2)
#>  ellipse        0.4.2   2020-05-27 [1] CRAN (R 4.1.2)
#>  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.1.2)
#>  evaluate       0.15    2022-02-18 [1] CRAN (R 4.1.2)
#>  fansi          1.0.2   2022-01-14 [1] CRAN (R 4.1.2)
#>  fastmap        1.1.0   2021-01-25 [1] CRAN (R 4.1.2)
#>  fs             1.5.2   2021-12-08 [1] CRAN (R 4.1.2)
#>  generics       0.1.2   2022-01-31 [1] CRAN (R 4.1.2)
#>  ggplot2      * 3.3.5   2021-06-25 [1] CRAN (R 4.1.2)
#>  ggrepel        0.9.1   2021-01-15 [1] CRAN (R 4.1.2)
#>  glue           1.6.2   2022-02-24 [1] CRAN (R 4.1.2)
#>  gridExtra      2.3     2017-09-09 [1] CRAN (R 4.1.2)
#>  gtable         0.3.0   2019-03-25 [1] CRAN (R 4.1.2)
#>  highr          0.9     2021-04-16 [1] CRAN (R 4.1.2)
#>  htmltools      0.5.2   2021-08-25 [1] CRAN (R 4.1.2)
#>  igraph         1.2.11  2022-01-04 [1] CRAN (R 4.1.2)
#>  knitr          1.37    2021-12-16 [1] CRAN (R 4.1.2)
#>  lattice      * 0.20-45 2021-09-22 [2] CRAN (R 4.1.2)
#>  lifecycle      1.0.1   2021-09-24 [1] CRAN (R 4.1.2)
#>  magrittr       2.0.2   2022-01-26 [1] CRAN (R 4.1.2)
#>  MASS         * 7.3-54  2021-05-03 [2] CRAN (R 4.1.2)
#>  Matrix         1.3-4   2021-06-01 [2] CRAN (R 4.1.2)
#>  matrixStats    0.61.0  2021-09-17 [1] CRAN (R 4.1.2)
#>  mixOmics     * 6.18.1  2021-11-18 [1] Bioconductor (R 4.1.2)
#>  munsell        0.5.0   2018-06-12 [1] CRAN (R 4.1.2)
#>  pillar         1.7.0   2022-02-01 [1] CRAN (R 4.1.2)
#>  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.1.2)
#>  plyr           1.8.6   2020-03-03 [1] CRAN (R 4.1.2)
#>  purrr          0.3.4   2020-04-17 [1] CRAN (R 4.1.2)
#>  R.cache        0.15.0  2021-04-30 [1] CRAN (R 4.1.2)
#>  R.methodsS3    1.8.1   2020-08-26 [1] CRAN (R 4.1.1)
#>  R.oo           1.24.0  2020-08-26 [1] CRAN (R 4.1.1)
#>  R.utils        2.11.0  2021-09-26 [1] CRAN (R 4.1.2)
#>  R6             2.5.1   2021-08-19 [1] CRAN (R 4.1.2)
#>  rARPACK        0.11-0  2016-03-10 [1] CRAN (R 4.1.2)
#>  RColorBrewer   1.1-2   2014-12-07 [1] CRAN (R 4.1.1)
#>  Rcpp           1.0.8   2022-01-13 [1] CRAN (R 4.1.2)
#>  reprex         2.0.1   2021-08-05 [1] CRAN (R 4.1.2)
#>  reshape2       1.4.4   2020-04-09 [1] CRAN (R 4.1.2)
#>  rlang          1.0.1   2022-02-03 [1] CRAN (R 4.1.2)
#>  rmarkdown      2.12    2022-03-02 [1] CRAN (R 4.1.2)
#>  RSpectra       0.16-0  2019-12-01 [1] CRAN (R 4.1.2)
#>  rstudioapi     0.13    2020-11-12 [1] CRAN (R 4.1.2)
#>  scales         1.1.1   2020-05-11 [1] CRAN (R 4.1.2)
#>  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
#>  snow           0.4-4   2021-10-27 [1] CRAN (R 4.1.1)
#>  stringi        1.7.6   2021-11-29 [1] CRAN (R 4.1.2)
#>  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.1.2)
#>  styler         1.6.2   2021-09-23 [1] CRAN (R 4.1.2)
#>  tibble         3.1.6   2021-11-07 [1] CRAN (R 4.1.2)
#>  tidyr          1.2.0   2022-02-01 [1] CRAN (R 4.1.2)
#>  tidyselect     1.1.2   2022-02-21 [1] CRAN (R 4.1.2)
#>  utf8           1.2.2   2021-07-24 [1] CRAN (R 4.1.2)
#>  vctrs          0.3.8   2021-04-29 [1] CRAN (R 4.1.2)
#>  withr          2.5.0   2022-03-03 [1] CRAN (R 4.1.2)
#>  xfun           0.30    2022-03-02 [1] CRAN (R 4.1.2)
#>  yaml           2.3.5   2022-02-21 [1] CRAN (R 4.1.2)
#> 
#>  [1] C:/Users/Work/Documents/R/win-library/4.1
#>  [2] C:/Program Files/R/R-4.1.2patched/library
#> 
#> ------------------------------------------------------------------------------

Hello Max,

If I add a second tune, like

tune.comp13 <- tune.comp12 <- tune.block.splsda(X = data, Y = Y,
                                                ncomp = 3,
                                                already.tested.X = tune.comp12$choice.keepX,
                                                test.keepX = test.keepX,
                                                design = design,
                                                nrepeat = 2,
                                                BPPARAM = BPPARAM
)`


It gives me the same variables, when I do the same on my code, it says singular matrix
The SessionInfo is the following:


R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] mixOmics_6.18.1 ggplot2_3.3.5   lattice_0.20-45 MASS_7.3-55    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8          RSpectra_0.16-0     plyr_1.8.6         
 [4] pillar_1.7.0        compiler_4.1.2      RColorBrewer_1.1-2 
 [7] tools_4.1.2         lifecycle_1.0.1     tibble_3.1.6       
[10] gtable_0.3.0        pkgconfig_2.0.3     rlang_1.0.1        
[13] Matrix_1.4-0        igraph_1.2.11       DBI_1.1.2          
[16] cli_3.2.0           ggrepel_0.9.1       parallel_4.1.2     
[19] gridExtra_2.3       stringr_1.4.0       withr_2.4.3        
[22] dplyr_1.0.8         generics_0.1.2      vctrs_0.3.8        
[25] grid_4.1.2          tidyselect_1.1.2    glue_1.6.2         
[28] ellipse_0.4.2       R6_2.5.1            snow_0.4-4         
[31] fansi_1.0.2         rARPACK_0.11-0      BiocParallel_1.28.3
[34] tidyr_1.2.0         reshape2_1.4.4      purrr_0.3.4        
[37] corpcor_1.6.10      magrittr_2.0.2      scales_1.1.1       
[40] ellipsis_0.3.2      matrixStats_0.61.0  assertthat_0.2.1   
[43] colorspace_2.0-3    utf8_1.2.2          stringi_1.7.6      
[46] munsell_0.5.0       crayon_1.5.0 

Can you please show me what your output looks like? It is very vague saying ā€œit says singular matrixā€

Sorry Max,

Following my code, o the line:

tune = tune.block.splsda(X = datos, Y = interaccion,nrepeat=1,
                         ncomp = ncomp1+1, test.keppX = list.keppX, design = design,
                         already.tested.X = already.tested.X,progressBar=T,fold=3)

Where the list.keppX output and already.tested.X, are the same:

>list.keppX
$clinical_data
[1]  1  4 13  4  1

$IP
[1] 5 2 1 1 3

$metabolome
[1] 19  1 37 19 19

$microbiome
[1]  1  1 81  1 21

> already.tested.X
$clinical_data
[1]  1  4 13  4  1

$IP
[1] 5 2 1 1 3

$metabolome
[1] 19  1 37 19 19

$microbiome
[1]  1  1 81  1 21

So, I thought that being the same some numbers it may lead to the following output and next error:

Design matrix has changed to include Y; each block will be
            linked to Y.

You have provided a sequence of keepX of length: 8 for block clinical_data and 3 for block IP and 6 for block metabolome and 6 for block microbiome.
This results in 864 models being fitted for each component and each nrepeat, this may take some time to run, be patient!

You can look into the 'BPPARAM' argument to speed up computation time.

tuning component 6
  |                                                             |   0%Error: BiocParallel errors
  1 remote errors, element index: 1
  0 unevaluated and other errors
  first remote error: Lapack routine dgesv: system is exactly singular: U[6,6] = 0
In addition: Warning message:
In Check.entry.wrapper.mint.block(X = X, Y = Y, indY = indY, ncomp = ncomp,  :
  Reset maximum number of variates 'ncomp[2]'
            to ncol(X[[2]])= 5.

Cheers, Edmond

Hi, I am wondering if you have come across a solution to the longer cluster run time as I am having a similar issue running ā€˜tune.block.splsdaā€™ on our HPC. I have 3 blocks (92 x 539, 92 x 316, 92 x 1200) and am running 3375 models. This finishes overnight on my 6 core windows PC however is still running after a day on our linux cluster, despite having more resources and appearing to be using them as requested.

Iā€™ve tried running with a few different combinations of cpus and tasks in the job submission (slurm script) and have tried changing the bpparam settings using:

BPPARAM ā† BiocParallel::MulticoreParam(workers = X)

Iā€™ve also tried both v6.26.0 and v6.22.0 but both seem to have the same issue. Hoping someone else has come across this!
Thanks

hi @kbow,

Perhaps other people can comment on your post (we are too short here in terms of capacity - I need to hire a new software developer!).

I dont think you should set a fine grid for your tuning (see other posts on this topic), you can reduce the number of models.

Kim-Anh

Yeah I realised I was going a bit crazy on the fine grid! I dropped it to 1000, probably still a bit high, but that finishes in ~9 hours on my PC vs >1.5 days on the HPC. I understand itā€™s likely a system specific issue, so hoping someone else has experienced something similar.
Thanks

1 Like