choice.keepX changes each run

Hi mixOmics team,

I went through your case study of sPLS with the Liver Toxicity dataset (sPLS Liver Toxicity Case Study | mixOmics) and applied the shown methods to my datasets. I then discovered that the tuning results, especially the keepX and keepY outputs of the tune.spls result, change when I change the order of my dataset. I tried it with the code shown in the case study and want to demonstrate to you what I mean:

Original code:

data(liver.toxicity) 
X <- liver.toxicity$gene 
Y <- liver.toxicity$clinic 

spls.liver <- spls(X = X, Y = Y, ncomp = 5, mode = 'regression')
perf.spls.liver <- perf(spls.liver, validation = 'Mfold',
                        folds = 10, nrepeat = 5) 

list.keepX <- c(seq(20, 50, 5))
list.keepY <- c(3:10) 

tune.spls.liver <- tune.spls(X, Y, ncomp = 2,
                             test.keepX = list.keepX,
                             test.keepY = list.keepY,
                             nrepeat = 1, folds = 10, 
                             mode = 'regression', measure = 'cor') 

tune.spls.liver$choice.keepX 
#Output:  
# comp1 comp2 
#   20    40 

tune.spls.liver$choice.keepY
#Output:  
# comp1 comp2 
#   3     3 

Swapped X and Y dataset:

#datasets swapped
X2 <- liver.toxicity$clinic 
Y2 <- liver.toxicity$gene 

spls.liver2 <- spls(X = X2, Y = Y2, ncomp = 5, mode = 'regression')
perf.spls.liver2 <- perf(spls.liver2, validation = 'Mfold',
                        folds = 10, nrepeat = 5) 

# also swap lists as datasets are swapped
list.keepX2 <-  c(3:10) 
list.keepY2 <- c(seq(20, 50, 5))

tune.spls.liver2 <- tune.spls(X2, Y2, ncomp = 2,
                             test.keepX = list.keepX2,
                             test.keepY = list.keepY2,
                             nrepeat = 1, folds = 10, # use 10 folds
                             mode = 'regression', measure = 'cor') 

tune.spls.liver2$choice.keepX
#Output:  
# comp1 comp2 
#   10     9 

tune.spls.liver2$choice.keepY
#Output:  
# comp1 comp2 
#   30     50 

As you can see the resulting values of keepX and keeps completely differ compared to the ones before, which makes me wonder why, as I would expect that the values would only be swapped (so keepY2 has now values as keepX before, …).

Maybe someone could explain to me why these results differ and how I know which dataset I have to choose as X and which one as why Y, as this obviously leads to different results for the plots (CIM, …) following the tuning.

Best regards,
Katharina

1 Like

If you look at that case study, in the first code block, you can see the following line:

set.seed(5249) # for reproducibility, remove for normal use

If you’re familiar with RNG-based functions and seeds, then this should answer your question. If you aren’t, continue reading.

All of the tune functions undergo a repeated cross validated method in order to determine the optimal number of features to use. This involves partitioning the data into folds folds (in your case, 10) every repeat - and is done randomly every repeat. These are used as training and testing sets in order to calculate the error rate (and hence the optimal number of features).

By using the set.seed() function, we can control this partitioning. I won’t go into detail here of how this works, but read this article if you’d like more information. I’d highly recommend wrapping your head around this - it’s a fundamental concept in programming.

If you don’t use the set.seed() function, then this training/testing partitioning will be different every time you run the function. Depending on how this sets are generated, the resulting choice.keepX and choice.keepY can be vastly different.

However, having said all that, in real scenarios, you don’t want to apply set.seed(). It introduces bias into your results. What you want to do is have a large number of repeats (via the nrepeat parameter). This means that no matter the seed, the function will likely converge onto the same optimal number of features. You have nrepeat = 1, which is very dangerous and will almost certainly cause incorrect results.

If you’re just playing around with the function, nrepeat between 5 and 10 is appropriate. If you’re running these for real analyses, then an absolute minimum nrepeat of 50 is required, with 100 being more appropriate.

Hence, the order of your blocks is not related to your issue at all. It’s your minimal number of repeats and/or lack of set.seed(). Hope this helps