Reducing number of features DIABLO

Hi everyone,

I have a small dataset, Its composed of resistant cell lines and their parental. There’s only one parental per line. And there are no more than 5 clones, in some cases just one.

Also, MUTATION block does have NA values, but I still get the same error message wherever I use It or not.

When I try to run tune.block.splsda

test.keepX = list (CNV = c(25,25), 
                   MUTATION = c(25,25),
                   RNA = c(25,25),
                   ENHANCER = c(25,25),
                   PROMOTER = c(25,25))

tune.TCGA = tune.block.splsda(X = diablo_input, Y = diablo_metadata, ncomp = 2, 
                              test.keepX = test.keepX, design = design,
                              validation = 'loo', folds = 5, nrepeat = 1,
                              dist = "centroids.dist",max.iter = 200)


Design matrix has changed to include Y; each block will be
linked to Y.

You have provided a sequence of keepX of length: 2 for block CNV and 2 for block MUTATION and 2 for block RNA and 2 for block ENHANCER and 2 for block PROMOTER.
This results in 32 models being fitted for each component and each nrepeat, this may take some time to run, be patient!

You can look into the ‘BPPARAM’ argument to speed up computation time.
Error in 1:n : NA/NaN argument

Does anybody have some idea? I leave a link to the Robject.

Also, I would like to ask If, with this many samples, this has any statistical sense at all.

sessionInfo()

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.14.4 forcats_0.5.2 stringr_1.4.1 purrr_0.3.5 readr_2.1.3 tidyverse_1.3.2 pheatmap_1.0.12
[8] tibble_3.1.8 tidyr_1.2.1 mixOmics_6.19.3 ggplot2_3.3.6 lattice_0.20-45 MASS_7.3-58 dplyr_1.0.10

loaded via a namespace (and not attached):
[1] matrixStats_0.62.0 fs_1.5.2 lubridate_1.8.0 METACLUSTER_1.0.0 bit64_4.0.5 RColorBrewer_1.1-3
[7] httr_1.4.4 rprojroot_2.0.3 tools_4.2.1 backports_1.4.1 utf8_1.2.2 R6_2.5.1
[13] DBI_1.1.3 colorspace_2.0-3 withr_2.5.0 prettyunits_1.1.1 processx_3.7.0 tidyselect_1.2.0
[19] gridExtra_2.3 curl_4.3.3 bit_4.0.4 compiler_4.2.1 cli_3.4.1 rvest_1.0.3
[25] xml2_1.3.3 labeling_0.4.2 scales_1.2.1 callr_3.7.2 digest_0.6.30 rmarkdown_2.17
[31] pkgconfig_2.0.3 htmltools_0.5.4 dbplyr_2.2.1 fastmap_1.1.0 rlang_1.0.6 readxl_1.4.1
[37] rstudioapi_0.14 farver_2.1.1 generics_0.1.3 jsonlite_1.8.3 BiocParallel_1.30.4 vroom_1.6.0
[43] googlesheets4_1.0.1 magrittr_2.0.3 Matrix_1.5-1 Rcpp_1.0.9 munsell_0.5.0 fansi_1.0.3
[49] lifecycle_1.0.3 stringi_1.7.8 yaml_2.3.6 pkgbuild_1.3.1 plyr_1.8.7 grid_4.2.1
[55] parallel_4.2.1 ggrepel_0.9.1 crayon_1.5.2 haven_2.5.1 hms_1.1.2 ps_1.7.1
[61] knitr_1.40 pillar_1.8.1 igraph_1.4.0 corpcor_1.6.10 reshape2_1.4.4 codetools_0.2-18
[67] reprex_2.0.2 glue_1.6.2 evaluate_0.17 remotes_2.4.2 BiocManager_1.30.18 modelr_0.1.9
[73] vctrs_0.5.0 tzdb_0.3.0 cellranger_1.1.0 gtable_0.3.1 assertthat_0.2.1 xfun_0.34
[79] broom_1.0.1 RSpectra_0.16-1 googledrive_2.0.0 gargle_1.2.1 rARPACK_0.11-0 ellipse_0.4.3
[85] ellipsis_0.3.2

Hi @Jose_Gracia,

My hint is because some of your groups only include one sample, when this is removed in cross-validation, it breaks the system. You would need to regroup some samples so that all classes are represented in leave-one-out cross validation (i.e at least 2 samples per class).

the MUTATION data may also posing a problem, as you noticed.

And also you are using the tune.keepX parameters incorrectly, they should specify the parameters to try for each component, e.g.

test.keepX = list(CNV = c(5,10, 25), 
                  RNA = c(5,10, 25),
                  ENHACER = c(5,10, 25),  # note here that is should also match exactly how you named your datasets
                  PROMOTER = c(5,10, 25))

I hope that helps, thanks for sharing your data.

Kim-Anh

Hi @kimanh.lecao ,

Thank you for your response.

I have another question, this time regarding the heatmap.

  1. Is there a way of getting the legend out of the figure?

  2. How can I keep my gene names from being cut?

  3. Is there a way of extracting this data so I can plot It with other libraries?

Thank you for your time!

Jose.

hi @Jose_Gracia

Assuming you are using the function cimDiablo(), you will see in the help file of that function that you can extract the M object from that function.

e.g.

data(nutrimouse)
Y = nutrimouse$diet
data = list(gene = nutrimouse$gene, lipid = nutrimouse$lipid)
design = matrix(c(0,1,1,1,0,1,1,1,0), ncol = 3, nrow = 3, byrow = TRUE)


nutrimouse.sgccda <- block.splsda(X = data,
Y = Y,
design = design,
keepX = list(gene = c(10,10), lipid = c(15,15)),
ncomp = 2,
scheme = "centroid")

output = cimDiablo(nutrimouse.sgccda)
oputput$M

I am not sure how to solve the legend / cropping issues. You could transpose, and change the name of your genes. See the help file which lists these arguments:

margins numeric vector of length two containing the margins (see par(mar)) for column and row names respectively.
legend.position position of the legend, one of “bottomright”, “bottom”, “bottomleft”, “left”, “topleft”, “top”, “topright”, “right” and “center”.
transpose logical indicating if the matrix should be transposed for plotting. Defaults to FALSE.
row.names, col.names logical, should the name of rows and/or columns of mat be shown? If TRUE (defaults) rownames(mat) and/or colnames(mat) are used. Possible character vectors with row and/or column labels can be used.

Kim-Anh