DIABLO - Questions about TCGA case study

Hi there!

I’m a Bioinformatics MSc student and I’m looking to use DIABLO in my dissertation.

I’m just attempting the case study on the main website to try and learn more, yet when I reached the step where you can manually input parameters of the data, as in;

“list.keepX = list(mRNA = c(6,14), miRNA = c(5,18), proteomics = c(6,7)) # from tuning step”

I note than I end up with less selected variables than if I ran tune.block.splsda using the script, and the resulting plots are significantly different from those resulting after manually inputting the parameters. I was wondering why these specific parameters were chosen, if the tune.block.splsda returns more.

Apologies if this is a easy question, I’m still getting to grips with the tools and background knowledge!

Thank you in advance.

1 Like

Hi @studentScot,

The reason why you retain more variables when using list.keepX = tune.TCGA$choice.keepX rather than entering them manually (list.keepX = list(mRNA = c(6,14), miRNA = c(5,18), proteomics = c(6,7))), is probably a matter of reproducability. Or maybe the manually entered values in the vignette were set randomly, just to demonstrate it can be done manually too. The downloadable R-script comes with an RData object, and herein the tuning output looks like this:

list.keepX
$mRNA
[1] 30 16
$miRNA
[1]  5 18
$proteomics
[1] 30  5

If your results are close to this, then you are doing it correctly (given that you copy-pasted the entire script).

  • Christopher
1 Like

Hi Christopher,

Thanks for the reply! It makes sense to me, though I’ve noticed my results are a little different from yours (mostly just for proteomics data though):

$mRNA
[1] 30  7

$miRNA
[1]  9 16

$proteomics
[1] 7 5

Does this seem acceptable?

Could the issue be with the cpus? I noted that it was replaced with BPPARAM at some point, so I removed the cpus arguement.

And, apologies, one last question; I noted the values for:

perf.diablo$choice.ncomp$WeightedVote

came out as:

            max.dist centroids.dist mahalanobis.dist
Overall.ER         3              2                3
Overall.BER        3              2                3

Where in the example they are quite a bit higher in the BER:

##             max.dist centroids.dist mahalanobis.dist
## Overall.ER         2              3                3
## Overall.BER        5              2                4

Does this seem fine? I’m beginning to think there may be an issue with the data file given? As my first plot even looks different from the example:

Mine

Example

Yet, I’m simply using the given script and data file with no adjustments.

Apologies again for the overload of questions.

Thanks again.

Hi @studentScot,

Yes to both, see below.

The two plots seems to be very similar, and there are no issues with data or the script. The reason why you will always observe these minimal differences, is due to the low number of nrepeat, i.e. how many times the cross-validation is repeated. The reproducibility improves when nrepeat is increased, but so does the computational time (especially for DIABLO). Since the vignette is only for demonstration purposes, the perf and tune steps were only repeated 10 and 1 time(s), respectively. To ensure a thorough tuning, I prefer to set nrepeat between 100-200 depending on the dataset, and never below 50. However, this might require a great amount of patience and/or computational power.

Hope it helps

  • Christopher
1 Like

Hi Christopher,

Thanks again for all your help, it’s really helped clear things up for me.

I’ll be attempting my own integration on omic datasets using DIABLO, do you think using similar steps to the TCGA case study will be fine?

Thanks once again.

Sam

Yes, definitely. Just remember to reconsider the design matrix and the perf/tune steps so that they are appropriate for your research question and dataset.

1 Like