Hello, thank you very much for all your efforts and this open source for the community. I am using sPLS-DA DIABLO for multi-omics integration analysis for multiple cancer data sets that vary in nsamples, nomics data sets, and n variables for each omic and I am dividing each data set in two groups of samples “Good” and “Bad”, depending on its clinical information.
1.As the nsamples vary too much between datasets lets say I have data sets of 30 samples and the smallest biological group has 3-5 samples. Can DIABLO make a robust anaylisis with this sample size? Is there a minimum sample size and group size for running DIABLO?
2.I have runned DIABLO a couple of times and with a data set of ~900 samples and in the perf and tune functions I am setting the folds=5 and nrepeats=50. How can I determine the values of this arguments and with considerations I have to take into account to set this values?
3.For the keepx in tune I want to keep with the 10% of the variables of each omic, if the omic has a small size of variables, such as mutations, I try to keep as much as I can, do you have any advice to improve this and to mantain the robustness of the analysis?
test.keepX.10.MAD=list(“mut”= seq(10,50,20),“scna”= seq(40,140,50),
“expr”= seq(200,800,200),“mirna”= seq(60,220,80),“lnrna”= seq(80,260,90))
- Finally I would like to know how to optimize the tune function in order to decrease the runtime of the tune function. I am using BPPARAM = bpparam() argument for this. I runned tune function with 948 samples, 8000 mRNA variables, 2500 lncRNA, 650 cnv and 70 mutation variables and it took 104 hrs. Here is the funcion.
tune.omics.10.MAD=tune.block.splsda(X=data,Y=Y_train,ncomp=4,test.keepX=test.keepX.10.MAD,design=design,
progressBar=FALSE,measure = “overall”, validation=“Mfold”,folds=5,nrepeat=50,BPPARAM = bpparam(),
near.zero.var=FALSE,dist = “mahalanobis.dist”)