I’m a Bioinformatics MSc student and I’m looking to use DIABLO in my dissertation.
I’m just attempting the case study on the main website to try and learn more, yet when I reached the step where you can manually input parameters of the data, as in;
I note than I end up with less selected variables than if I ran tune.block.splsda using the script, and the resulting plots are significantly different from those resulting after manually inputting the parameters. I was wondering why these specific parameters were chosen, if the tune.block.splsda returns more.
Apologies if this is a easy question, I’m still getting to grips with the tools and background knowledge!
The reason why you retain more variables when using list.keepX = tune.TCGA$choice.keepX rather than entering them manually (list.keepX = list(mRNA = c(6,14), miRNA = c(5,18), proteomics = c(6,7))), is probably a matter of reproducability. Or maybe the manually entered values in the vignette were set randomly, just to demonstrate it can be done manually too. The downloadable R-script comes with an RData object, and herein the tuning output looks like this:
The two plots seems to be very similar, and there are no issues with data or the script. The reason why you will always observe these minimal differences, is due to the low number of nrepeat, i.e. how many times the cross-validation is repeated. The reproducibility improves when nrepeat is increased, but so does the computational time (especially for DIABLO). Since the vignette is only for demonstration purposes, the perf and tune steps were only repeated 10 and 1 time(s), respectively. To ensure a thorough tuning, I prefer to set nrepeat between 100-200 depending on the dataset, and never below 50. However, this might require a great amount of patience and/or computational power.
Yes, definitely. Just remember to reconsider the design matrix and the perf/tune steps so that they are appropriate for your research question and dataset.