Tuning multilevel sPLS

Dear mixOmics Team,

First of all, thanks so much for providing such a great tool!

I want to analyse two datasets using multilevel sPLS (canonical mode). To do so, I have been following your Multilevel sPLS Liver Toxicity Case Study (and the associated R script), and I have a couple of questions on the tuning step.

  1. In the tune.splslevel() function, it is possible to include the already.tested.X and already.tested.Y parameters. Do you have any suggestions on how to choose these values?

  2. How can I select the optimal ncomp, keepX and keepY after running the tune.splslevel() function? While the tuning output of a non-multilevel sPLS (tune.spls() function) provides all these optimal parameters, the output of tune.splslevel() is cor.value. It is not clear to me how the optimal ncomp, keepX and keepY can be selected based on this. The tutorial doesn’t cover this step specifically, and I can’t find an answer in the R script either. Sorry if I am missing something.

Thanks for your help!

Emma

Unfortunately, that function is pretty old and isn’t the most intuitive to use. It doesn’t quite work the same as the other tune() functions. You’re right in that the tutorial does not cover how to use it.

The way you’d select these values is to compute the optimal values for each component individually and iteratively. This is tedious and prone to error.

What I’ve done for you (and anyone else who stumbles along here) is created a little function for you to copy to your script:

tune.splslevel.wrapper <- function(X,
                                   Y,
                                   design,
                                   ncomp,
                                   test.keepX,
                                   test.keepY) {
    ncomps <- 1:ncomp # set range of ncomp values to use
    
    final.obj <- list() # initialise returned object

    # default values are -1 for debugging
    atX <- rep(-1, length(ncomps)-1) # already.tested.X
    atY <- rep(-1, length(ncomps)-1) # already.tested.Y
    for (ncomp in ncomps) {

        # previously calculated optimal keepX/Y value 
        tmp.atX <- c(atX[1:ncomp-1])
        tmp.atY <- c(atY[1:ncomp-1])

        # need to be NULL for first iteration
        if (ncomp==1) { tmp.atX <- NULL; tmp.atY <- NULL }

        cat("=== NCOMP:", ncomp, "===\n")
        model <- suppressMessages(tune.splslevel(X, Y, multilevel=design,
                                                mode="regression", 
                                                ncomp=ncomp,
                                                test.keepX = test.keepX, 
                                                test.keepY = test.keepY,
                                                already.tested.X <- tmp.atX,
                                                already.tested.Y <- tmp.atY))

        # extract position in cor.value which corresponds to max
        opt.pos <- which(model$cor.value==max(model$cor.value), arr.ind=T)

        # set these for future iterations
        atX[ncomp] <- test.keepX[opt.pos[1]]
        atY[ncomp] <- test.keepY[opt.pos[2]]

        # add to returned object
        final.obj[["X"]][ncomp] <- test.keepX[opt.pos[1]]
        final.obj[["Y"]][ncomp] <- test.keepY[opt.pos[2]]
        final.obj[["cor"]][ncomp] <- max(model$cor.value)
    }
    # determine ncomp with optimal correlation
    cor.max <- which(final.obj[["cor"]]==max(final.obj[["cor"]]))

    final.obj[["opt.ncomp"]] <- cor.max

    return(final.obj)
}

tune.splslevel.wrapper() will allow you to use your input as you would in other tune() functions. Here’s an example of its use:

library(mixOmics)
data(vac18)

X <- vac18$genes[,1:500]
Y <- vac18$genes[,501:1000]
ML <- vac18$sample
design <- data.frame(sample = ML)

ncomp <- 5
test.keepX = seq(10,100,10)
test.keepY = seq(10,100,10)

tune.splslevel.wrapper(X, Y, design,
                       ncomp, test.keepX, test.keepY)
#> $X
#> [1] 100 100 100  80 100
#> 
#> $Y
#> [1] 100 100  50 100 100
#> 
#> $cor
#> [1] 0.9942615 0.9857412 0.9857662 0.9841769 0.9828682
#> 
#> $opt.ncomp
#> [1] 1

Created on 2022-08-08 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       Australia/Sydney
#>  date     2022-08-08
#>  pandoc   2.18 @ C:/Users/Work/AppData/Local/Pandoc/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version date (UTC) lib source
#>  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.2.1)
#>  BiocParallel   1.30.3  2022-06-07 [1] Bioconductor
#>  cli            3.3.0   2022-04-25 [1] CRAN (R 4.2.1)
#>  codetools      0.2-18  2020-11-04 [1] CRAN (R 4.2.1)
#>  colorspace     2.0-3   2022-02-21 [1] CRAN (R 4.2.1)
#>  corpcor        1.6.10  2021-09-16 [1] CRAN (R 4.2.0)
#>  DBI            1.1.3   2022-06-18 [1] CRAN (R 4.2.1)
#>  digest         0.6.29  2021-12-01 [1] CRAN (R 4.2.1)
#>  dplyr          1.0.9   2022-04-28 [1] CRAN (R 4.2.1)
#>  ellipse        0.4.3   2022-05-31 [1] CRAN (R 4.2.1)
#>  evaluate       0.15    2022-02-18 [1] CRAN (R 4.2.1)
#>  fansi          1.0.3   2022-03-24 [1] CRAN (R 4.2.1)
#>  fastmap        1.1.0   2021-01-25 [1] CRAN (R 4.2.1)
#>  fs             1.5.2   2021-12-08 [1] CRAN (R 4.2.1)
#>  generics       0.1.3   2022-07-05 [1] CRAN (R 4.2.1)
#>  ggplot2      * 3.3.6   2022-05-03 [1] CRAN (R 4.2.1)
#>  ggrepel        0.9.1   2021-01-15 [1] CRAN (R 4.2.1)
#>  glue           1.6.2   2022-02-24 [1] CRAN (R 4.2.1)
#>  gridExtra      2.3     2017-09-09 [1] CRAN (R 4.2.1)
#>  gtable         0.3.0   2019-03-25 [1] CRAN (R 4.2.1)
#>  highr          0.9     2021-04-16 [1] CRAN (R 4.2.1)
#>  htmltools      0.5.3   2022-07-18 [1] CRAN (R 4.2.1)
#>  igraph         1.3.4   2022-07-19 [1] CRAN (R 4.2.1)
#>  knitr          1.39    2022-04-26 [1] CRAN (R 4.2.1)
#>  lattice      * 0.20-45 2021-09-22 [1] CRAN (R 4.2.1)
#>  lifecycle      1.0.1   2021-09-24 [1] CRAN (R 4.2.1)
#>  magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.2.1)
#>  MASS         * 7.3-57  2022-04-22 [1] CRAN (R 4.2.1)
#>  Matrix         1.4-1   2022-03-23 [1] CRAN (R 4.2.1)
#>  matrixStats    0.62.0  2022-04-19 [1] CRAN (R 4.2.1)
#>  mixOmics     * 6.20.0  2022-04-26 [1] Bioconductor (R 4.2.0)
#>  munsell        0.5.0   2018-06-12 [1] CRAN (R 4.2.1)
#>  pillar         1.8.0   2022-07-18 [1] CRAN (R 4.2.1)
#>  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.2.1)
#>  plyr           1.8.7   2022-03-24 [1] CRAN (R 4.2.1)
#>  purrr          0.3.4   2020-04-17 [1] CRAN (R 4.2.1)
#>  R.cache        0.16.0  2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3    1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo           1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils        2.12.0  2022-06-28 [1] CRAN (R 4.2.1)
#>  R6             2.5.1   2021-08-19 [1] CRAN (R 4.2.1)
#>  rARPACK        0.11-0  2016-03-10 [1] CRAN (R 4.2.1)
#>  RColorBrewer   1.1-3   2022-04-03 [1] CRAN (R 4.2.0)
#>  Rcpp           1.0.9   2022-07-08 [1] CRAN (R 4.2.1)
#>  reprex         2.0.1   2021-08-05 [1] CRAN (R 4.2.1)
#>  reshape2       1.4.4   2020-04-09 [1] CRAN (R 4.2.1)
#>  rlang          1.0.4   2022-07-12 [1] CRAN (R 4.2.1)
#>  rmarkdown      2.14    2022-04-25 [1] CRAN (R 4.2.1)
#>  RSpectra       0.16-1  2022-04-24 [1] CRAN (R 4.2.1)
#>  scales         1.2.0   2022-04-13 [1] CRAN (R 4.2.1)
#>  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.2.1)
#>  stringi        1.7.8   2022-07-11 [1] CRAN (R 4.2.1)
#>  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.2.1)
#>  styler         1.7.0   2022-03-13 [1] CRAN (R 4.2.1)
#>  tibble         3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
#>  tidyr          1.2.0   2022-02-01 [1] CRAN (R 4.2.1)
#>  tidyselect     1.1.2   2022-02-21 [1] CRAN (R 4.2.1)
#>  utf8           1.2.2   2021-07-24 [1] CRAN (R 4.2.1)
#>  vctrs          0.4.1   2022-04-13 [1] CRAN (R 4.2.1)
#>  withr          2.5.0   2022-03-03 [1] CRAN (R 4.2.1)
#>  xfun           0.31    2022-05-10 [1] CRAN (R 4.2.1)
#>  yaml           2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] D:/Programs/Work Programs/R-4.2.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

A brief explanation of whats going (follow along with code comments if needed):

  • Over the desired range of ncomp values (ncomps), we calculate the optimal keepX and keepY value.
  • These values are stored in atX and atY. These vectors use -1 as a default. If you see a -1 in the output, something has gone wrong.
  • the opt.ncomp object in the output is determined via the maximisation of the cor output object.

This last point requires a bit of explanation if you’re wanting to properly understand it. Essentially, when calculating the optimal tuneX/Y and ncomp values, we use the maximisation of “correlation” as our criteria. What is this “correlation”?

When forming an sPLS model (using a given ncomp, keepX and keepY), the tune functions split the data into training (majority) and testing subsets. A model is trained using the training data. This model then predicts the component projection values of the testing data. The agreement between the predictions of the test samples (predicted component) and the true test sample values in the component space (actual component) is measure by Pearson correlation.

Hope all this helps a bit. Let me know if you have any follow up questions