Error when trying to tune number of features

Hello everyone,

I’m trying to do an N-integration analysis to identify responders from a clinical trial using DIABLO. I’m tunning the model for different design values c(0, 0.25, 0.5, 1.0) and trying to pick the one that maximizes AUC. For each design value, I do:

design = matrix(d,
                ncol = length(data),
                nrow = length(data),
                dimnames = list(names(data),
                                names(data)))
diag(design) = 0 

basic.diablo.model = block.splsda(X = data,
                                  Y = Y,
                                  ncomp = 4,
                                  design = design)

perf.diablo = perf(basic.diablo.model,
                   validation = 'Mfold',
                   folds = 5,
                   nrepeat = 50)

optimal_ncomp = perf.diablo$choice.ncomp$WeightedVote["Overall.BER", "max.dist"]

test.keepX = list(df_metabolic = c(seq(1,ncol(data$df_metabolic),5)),
                  df_lipids  = c(seq(1,ncol(data$df_lipids),25)),
                  df_pgs  = c(seq(1,ncol(data$df_pgs),1)),
                  df_microbiome = c(seq(1,ncol(data$df_microbiome),20))
)

tune.TCGA = tune.block.splsda(X = data,
                              Y = Y,
                              ncomp = optimal_ncomp, 
                              test.keepX = test.keepX,
                              design = design,
                              validation = 'Mfold',
                              folds = 5,
                              nrepeat = 50,
                              dist = "max.dist",
                              progressBar = TRUE,
                              BPPARAM = BiocParallel::SnowParam(workers = 20))

The optimal number of components for all designs varies between 3 and 4. However, I get the following error after a while of running the tunning function for all design models:

Error: BiocParallel errors
  20 remote errors, element index: 1, 2, 3, 4, 5, 6, ...
  30 unevaluated and other errors
  first remote error:

Error in get.keepA(X = X, keepX = keepX, ncomp = ncomp): each component of 'keepX[[4]]'
                must be lower or equal to ncol(X[[4]])=4.

I am clueless about where this error comes from. The dimensions of my datasets are the following:

lapply(data, dim) 

$df_metabolic
[1] 138  49

$df_lipids
[1] 138 165

$df_microbiome
[1] 138 154

$df_pgs
[1] 138   4

It would be really helpful if you could point out some solution to the problem.

Thanks a lot!

Best,
Carolina

Hi @carolinaalvarez,

Looking through the code you have shared I can’t immediately see what the issue could be. Could you please let me know which version of mixOmics you are running?

As you said you get the error for every model you have created only when you run tune(), could you please run the following example code to check that the tune function is generally running OK on your end?

# load mixOmics
library(mixOmics)

# load data
data(breast.TCGA)
X <- list(mirna = breast.TCGA$data.train$mirna, 
          mrna =  breast.TCGA$data.train$mrna, 
          protein = breast.TCGA$data.train$protein)
Y <- breast.TCGA$data.train$subtype

# basic design
design = matrix(0.1, ncol = length(X), nrow = length(X), 
                dimnames = list(names(X), names(X)))
diag(design) = 0 # set diagonal to 0s

# basic model
diablo.tcga <- block.splsda(X, Y)

# tune number of components
perf.diablo = perf(diablo.tcga, validation = 'Mfold', 
                   folds = 10, nrepeat = 10) 
ncomp = perf.diablo$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"] 
print(ncomp) # 1

# tune number of variables
test.keepX = list (mRNA = c(seq(20,30,5)), 
                   miRNA = c(seq(20,30,5)),
                   proteomics = c(seq(20,30,5)))
tune.TCGA = tune.block.splsda(X = X, Y = Y, ncomp = ncomp, 
                              test.keepX = test.keepX, design = design,
                              validation = 'Mfold', folds = 10, nrepeat = 1,
                              dist = "centroids.dist")
tune.TCGA$choice.keepX

If the above code runs without errors then the issue is likely specific to your data and/or DIABLO model. I’m wondering whether the error is related to your test.keepX values, have you tested whether your code works with different test.keepX? For example could you try:

test.keepX = list(df_metabolic = c(1, 2, 3),
                  df_lipids  = c(1, 2, 3),
                  df_pgs  = c(1, 2, 3),
                  df_microbiome = c(1, 2, 3)
)

Or other versions of this in which you substitute the test.keepX for each datablock with c(1,2,3) in turn?

Let me know how you go with these investigations and hopefully we can identify the issue!

Cheers,
Eva