Number of samples, folds, nrepeats, runtime

Hello, thank you very much for all your efforts and this open source for the community. I am using sPLS-DA DIABLO for multi-omics integration analysis for multiple cancer data sets that vary in nsamples, nomics data sets, and n variables for each omic and I am dividing each data set in two groups of samples “Good” and “Bad”, depending on its clinical information.

1.As the nsamples vary too much between datasets lets say I have data sets of 30 samples and the smallest biological group has 3-5 samples. Can DIABLO make a robust anaylisis with this sample size? Is there a minimum sample size and group size for running DIABLO?

2.I have runned DIABLO a couple of times and with a data set of ~900 samples and in the perf and tune functions I am setting the folds=5 and nrepeats=50. How can I determine the values of this arguments and with considerations I have to take into account to set this values?

3.For the keepx in tune I want to keep with the 10% of the variables of each omic, if the omic has a small size of variables, such as mutations, I try to keep as much as I can, do you have any advice to improve this and to mantain the robustness of the analysis?

test.keepX.10.MAD=list(“mut”= seq(10,50,20),“scna”= seq(40,140,50),
“expr”= seq(200,800,200),“mirna”= seq(60,220,80),“lnrna”= seq(80,260,90))

  1. Finally I would like to know how to optimize the tune function in order to decrease the runtime of the tune function. I am using BPPARAM = bpparam() argument for this. I runned tune function with 948 samples, 8000 mRNA variables, 2500 lncRNA, 650 cnv and 70 mutation variables and it took 104 hrs. Here is the funcion.

tune.omics.10.MAD=tune.block.splsda(X=data,Y=Y_train,ncomp=4,test.keepX=test.keepX.10.MAD,design=design,
progressBar=FALSE,measure = “overall”, validation=“Mfold”,folds=5,nrepeat=50,BPPARAM = bpparam(),
near.zero.var=FALSE,dist = “mahalanobis.dist”)

hi @Dominique_Cortes,

1.As the nsamples vary too much between datasets lets say I have data sets of 30 samples and the smallest biological group has 3-5 samples. Can DIABLO make a robust anaylisis with this sample size? Is there a minimum sample size and group size for running DIABLO?

Our perf() function will calculate the Balanced Error Rate (BER) to take into account groups with a small number of samples. It depends on how many groups you have too. It seems that you would have something like 5 samples in one group and 25 in the other group? In that case, even using the BER for your assessment, your might have an analysis that is unbalanced towards the majority group. You should be able to assess this with plots such as plotLoadings()

2.I have runned DIABLO a couple of times and with a data set of ~900 samples and in the perf and tune functions I am setting the folds=5 and nrepeats=50. How can I determine the values of this arguments and with considerations I have to take into account to set this values?

Usually, you want about 5-6 samples in the test set. In your case n = 30/ folds = 5 = 6 so this is appropriate. For the repeats, it depends on your computational power :slight_smile: 10-50 is appropriate.

3.For the keepx in tune I want to keep with the 10% of the variables of each omic, if the omic has a small size of variables, such as mutations, I try to keep as much as I can, do you have any advice to improve this and to mantain the robustness of the analysis?

I think it is good that you are matching your expectations to the tuning parameters. You could also (to be sure) try smaller value sizes, e.g keepX = 5 to see if these may highlight how much noise there is in your data.

  1. Finally I would like to know how to optimize the tune function in order to decrease the runtime of the tune function. I am using BPPARAM = bpparam() argument for this. I runned tune function with 948 samples, 8000 mRNA variables, 2500 lncRNA, 650 cnv and 70 mutation variables and it took 104 hrs.

You could decide to half the repeats, and also only filter first the most 5,000 most variant mRNA (noting that in the end, you expect to select max 220 of those mRNA per component!). You also have only 2 groups of samples, son comp = 3 might be enough. (you may have seen in our tutorials that we first run a non sparse model to choose what might be the number of components).

Good luck!

Kim-Anh