Tuning multilevel sPLS

Dear mixOmics Team,

First of all, thanks so much for providing such a great tool!

I want to analyse two datasets using multilevel sPLS (canonical mode). To do so, I have been following your Multilevel sPLS Liver Toxicity Case Study (and the associated R script), and I have a couple of questions on the tuning step.

  1. In the tune.splslevel() function, it is possible to include the already.tested.X and already.tested.Y parameters. Do you have any suggestions on how to choose these values?

  2. How can I select the optimal ncomp, keepX and keepY after running the tune.splslevel() function? While the tuning output of a non-multilevel sPLS (tune.spls() function) provides all these optimal parameters, the output of tune.splslevel() is cor.value. It is not clear to me how the optimal ncomp, keepX and keepY can be selected based on this. The tutorial doesn’t cover this step specifically, and I can’t find an answer in the R script either. Sorry if I am missing something.

Thanks for your help!

Emma

Unfortunately, that function is pretty old and isn’t the most intuitive to use. It doesn’t quite work the same as the other tune() functions. You’re right in that the tutorial does not cover how to use it.

The way you’d select these values is to compute the optimal values for each component individually and iteratively. This is tedious and prone to error.

What I’ve done for you (and anyone else who stumbles along here) is created a little function for you to copy to your script:

tune.splslevel.wrapper <- function(X,
                                   Y,
                                   design,
                                   ncomp,
                                   test.keepX,
                                   test.keepY) {
    ncomps <- 1:ncomp # set range of ncomp values to use
    
    final.obj <- list() # initialise returned object

    # default values are -1 for debugging
    atX <- rep(-1, length(ncomps)-1) # already.tested.X
    atY <- rep(-1, length(ncomps)-1) # already.tested.Y
    for (ncomp in ncomps) {

        # previously calculated optimal keepX/Y value 
        tmp.atX <- c(atX[1:ncomp-1])
        tmp.atY <- c(atY[1:ncomp-1])

        # need to be NULL for first iteration
        if (ncomp==1) { tmp.atX <- NULL; tmp.atY <- NULL }

        cat("=== NCOMP:", ncomp, "===\n")
        model <- suppressMessages(tune.splslevel(X, Y, multilevel=design,
                                                mode="regression", 
                                                ncomp=ncomp,
                                                test.keepX = test.keepX, 
                                                test.keepY = test.keepY,
                                                already.tested.X <- tmp.atX,
                                                already.tested.Y <- tmp.atY))

        # extract position in cor.value which corresponds to max
        opt.pos <- which(model$cor.value==max(model$cor.value), arr.ind=T)

        # set these for future iterations
        atX[ncomp] <- test.keepX[opt.pos[1]]
        atY[ncomp] <- test.keepY[opt.pos[2]]

        # add to returned object
        final.obj[["X"]][ncomp] <- test.keepX[opt.pos[1]]
        final.obj[["Y"]][ncomp] <- test.keepY[opt.pos[2]]
        final.obj[["cor"]][ncomp] <- max(model$cor.value)
    }
    # determine ncomp with optimal correlation
    cor.max <- which(final.obj[["cor"]]==max(final.obj[["cor"]]))

    final.obj[["opt.ncomp"]] <- cor.max

    return(final.obj)
}

tune.splslevel.wrapper() will allow you to use your input as you would in other tune() functions. Here’s an example of its use:

library(mixOmics)
data(vac18)

X <- vac18$genes[,1:500]
Y <- vac18$genes[,501:1000]
ML <- vac18$sample
design <- data.frame(sample = ML)

ncomp <- 5
test.keepX = seq(10,100,10)
test.keepY = seq(10,100,10)

tune.splslevel.wrapper(X, Y, design,
                       ncomp, test.keepX, test.keepY)
#> $X
#> [1] 100 100 100  80 100
#> 
#> $Y
#> [1] 100 100  50 100 100
#> 
#> $cor
#> [1] 0.9942615 0.9857412 0.9857662 0.9841769 0.9828682
#> 
#> $opt.ncomp
#> [1] 1

Created on 2022-08-08 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       Australia/Sydney
#>  date     2022-08-08
#>  pandoc   2.18 @ C:/Users/Work/AppData/Local/Pandoc/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version date (UTC) lib source
#>  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.2.1)
#>  BiocParallel   1.30.3  2022-06-07 [1] Bioconductor
#>  cli            3.3.0   2022-04-25 [1] CRAN (R 4.2.1)
#>  codetools      0.2-18  2020-11-04 [1] CRAN (R 4.2.1)
#>  colorspace     2.0-3   2022-02-21 [1] CRAN (R 4.2.1)
#>  corpcor        1.6.10  2021-09-16 [1] CRAN (R 4.2.0)
#>  DBI            1.1.3   2022-06-18 [1] CRAN (R 4.2.1)
#>  digest         0.6.29  2021-12-01 [1] CRAN (R 4.2.1)
#>  dplyr          1.0.9   2022-04-28 [1] CRAN (R 4.2.1)
#>  ellipse        0.4.3   2022-05-31 [1] CRAN (R 4.2.1)
#>  evaluate       0.15    2022-02-18 [1] CRAN (R 4.2.1)
#>  fansi          1.0.3   2022-03-24 [1] CRAN (R 4.2.1)
#>  fastmap        1.1.0   2021-01-25 [1] CRAN (R 4.2.1)
#>  fs             1.5.2   2021-12-08 [1] CRAN (R 4.2.1)
#>  generics       0.1.3   2022-07-05 [1] CRAN (R 4.2.1)
#>  ggplot2      * 3.3.6   2022-05-03 [1] CRAN (R 4.2.1)
#>  ggrepel        0.9.1   2021-01-15 [1] CRAN (R 4.2.1)
#>  glue           1.6.2   2022-02-24 [1] CRAN (R 4.2.1)
#>  gridExtra      2.3     2017-09-09 [1] CRAN (R 4.2.1)
#>  gtable         0.3.0   2019-03-25 [1] CRAN (R 4.2.1)
#>  highr          0.9     2021-04-16 [1] CRAN (R 4.2.1)
#>  htmltools      0.5.3   2022-07-18 [1] CRAN (R 4.2.1)
#>  igraph         1.3.4   2022-07-19 [1] CRAN (R 4.2.1)
#>  knitr          1.39    2022-04-26 [1] CRAN (R 4.2.1)
#>  lattice      * 0.20-45 2021-09-22 [1] CRAN (R 4.2.1)
#>  lifecycle      1.0.1   2021-09-24 [1] CRAN (R 4.2.1)
#>  magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.2.1)
#>  MASS         * 7.3-57  2022-04-22 [1] CRAN (R 4.2.1)
#>  Matrix         1.4-1   2022-03-23 [1] CRAN (R 4.2.1)
#>  matrixStats    0.62.0  2022-04-19 [1] CRAN (R 4.2.1)
#>  mixOmics     * 6.20.0  2022-04-26 [1] Bioconductor (R 4.2.0)
#>  munsell        0.5.0   2018-06-12 [1] CRAN (R 4.2.1)
#>  pillar         1.8.0   2022-07-18 [1] CRAN (R 4.2.1)
#>  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.2.1)
#>  plyr           1.8.7   2022-03-24 [1] CRAN (R 4.2.1)
#>  purrr          0.3.4   2020-04-17 [1] CRAN (R 4.2.1)
#>  R.cache        0.16.0  2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3    1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo           1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils        2.12.0  2022-06-28 [1] CRAN (R 4.2.1)
#>  R6             2.5.1   2021-08-19 [1] CRAN (R 4.2.1)
#>  rARPACK        0.11-0  2016-03-10 [1] CRAN (R 4.2.1)
#>  RColorBrewer   1.1-3   2022-04-03 [1] CRAN (R 4.2.0)
#>  Rcpp           1.0.9   2022-07-08 [1] CRAN (R 4.2.1)
#>  reprex         2.0.1   2021-08-05 [1] CRAN (R 4.2.1)
#>  reshape2       1.4.4   2020-04-09 [1] CRAN (R 4.2.1)
#>  rlang          1.0.4   2022-07-12 [1] CRAN (R 4.2.1)
#>  rmarkdown      2.14    2022-04-25 [1] CRAN (R 4.2.1)
#>  RSpectra       0.16-1  2022-04-24 [1] CRAN (R 4.2.1)
#>  scales         1.2.0   2022-04-13 [1] CRAN (R 4.2.1)
#>  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.2.1)
#>  stringi        1.7.8   2022-07-11 [1] CRAN (R 4.2.1)
#>  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.2.1)
#>  styler         1.7.0   2022-03-13 [1] CRAN (R 4.2.1)
#>  tibble         3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
#>  tidyr          1.2.0   2022-02-01 [1] CRAN (R 4.2.1)
#>  tidyselect     1.1.2   2022-02-21 [1] CRAN (R 4.2.1)
#>  utf8           1.2.2   2021-07-24 [1] CRAN (R 4.2.1)
#>  vctrs          0.4.1   2022-04-13 [1] CRAN (R 4.2.1)
#>  withr          2.5.0   2022-03-03 [1] CRAN (R 4.2.1)
#>  xfun           0.31    2022-05-10 [1] CRAN (R 4.2.1)
#>  yaml           2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] D:/Programs/Work Programs/R-4.2.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

A brief explanation of whats going (follow along with code comments if needed):

  • Over the desired range of ncomp values (ncomps), we calculate the optimal keepX and keepY value.
  • These values are stored in atX and atY. These vectors use -1 as a default. If you see a -1 in the output, something has gone wrong.
  • the opt.ncomp object in the output is determined via the maximisation of the cor output object.

This last point requires a bit of explanation if you’re wanting to properly understand it. Essentially, when calculating the optimal tuneX/Y and ncomp values, we use the maximisation of β€œcorrelation” as our criteria. What is this β€œcorrelation”?

When forming an sPLS model (using a given ncomp, keepX and keepY), the tune functions split the data into training (majority) and testing subsets. A model is trained using the training data. This model then predicts the component projection values of the testing data. The agreement between the predictions of the test samples (predicted component) and the true test sample values in the component space (actual component) is measure by Pearson correlation.

Hope all this helps a bit. Let me know if you have any follow up questions

1 Like

Hi @MaxBladen,

Thank you for sharing this wrapper function - I had the identical question and this has worked very well for me (at least it looks like it).

I was hoping I could clarify the following things with you:

  • Is this function performing a repeated CV, and if so – what is the default number of repeats? (And if not, can we specify how many permutations we want?);

  • Is the function applying a regression or canonical mode? (I presume regression but just to double check)

  • I have at times been getting: Warning: The SGCCA algorithm did not converge. Is this an issue?

Thank you very much for your time and help, I look forward to hearing from you!

Best wishes,
Marko

Is this function performing a repeated CV, and if so – what is the default number of repeats? (And if not, can we specify how many permutations we want?);

Unfortunately not to both questions. It’s an extremely underdeveloped function and does not have the same capability as the more-used tune.spls().

Is the function applying a regression or canonical mode?

It defaults to canonical

I have at times been getting: Warning: The SGCCA algorithm did not converge. Is this an issue?

This means that given the number of iterations, the threshold for stability in the variates is not reached. I’d recommend adjusting the set.seed() value until this warning does not show, then stick with that value.

Overall, a pipeline that will serve you better than the tune.splslevel.wrapper() function I wrote above is the following:

library(mixOmics)
data(vac18)

X <- vac18$genes[,1:500]
Y <- vac18$genes[,501:1000]
ML <- vac18$sample
design <- data.frame(sample = ML)

ncomp <- 5
test.keepX = seq(10,100,10)
test.keepY = seq(10,100,10)

Xw <- withinVariation(X = X, design = design)
Yw <- withinVariation(X = Y, design = design)

tune.spls(Xw, Yw, 
          ncomp=ncomp, 
          test.keepX=test.keepX, test.keepY=test.keepY,
          validation = "...",
          mode = "...",
          folds = ...,
          nrepeat = ...,
          measure = "...",
          BPPARAM = BiocParallel::SnowParam(workers="..."))

Essentially, we get the same functionality of tune.splslevel() by calling the withinVariation() function manually. Then, by passing this to tune.spls(), you can have full control over the mode (you can set it to "regression"), the nrepeat, etc

Hi @MaxBladen ,

Thank you so much for your very quick reply, this is amazing!

I will test the code you shared asap and get back to you once I get the results back.

Best wishes,
Marko