DIABLO of selected variables from tuned sPLS-DA

Hi,
Thank you for a great tool! I think I got a pretty good understanding of the (s)PLS-DA rather quickly :slight_smile: DIABLO is a bit complicated though.
I have identified variables from 3 datasets that distinguish each dataset between healthy and disease. Basically, 9 variables of comp 1 in dataset1,
200 variables of comp1 in dataset2
and 8 variables of comp1 in dataset3.
I have extracted these variables and their values to create a list of objects for DIABLO only containing these 217 variables. Does this make sense?
Next, I want to tune the number of variables to keep from each dataset:
test.keepX = list (colon = c(3:9), plasma = c(5:10, seq(20, 100, 20)),olink = c(3:8))
design = matrix(0, ncol = length(X), nrow = length(X),dimnames = list(names(X), names(X)))
design
tune.BBM = tune.block.splsda(X = X, Y = Y, ncomp = 5,test.keepX = test.keepX, design = design, validation = ‘Mfold’, folds = 10, nrepeat = 5,dist = “centroids.dist”, cpus = 7)
But I do not understand the output, e.g. I get an error rate of 0 for all comp and
tune.BBM$choice.keepX
$colon
[1] 3 3 3 3 3
$plasma
[1] 5 10 5 5 5
$olink
[1] 3 3 3 3 3
Please, see the attached screenshots.
How do I interpret the results? Error = 0 can’t be right. What went wrong?

In addition, I ran circosPlot including all variables. I wanted to get the corMat as well, so I ran:
corMat <- circosPlot(MyResult.diablo1,cutoff=.7, ncol.legend = 2, size.legend = 1.1)
as suggested in another post here in the forum. However, I am puzzled by the output.
How does this code calculate the correlation?
Looking at the matrix, I see one metabolite of dataset1 correlating with itself from the same dataset by only 0.65 while correlating with a protein from a different dataset (and biological compartment) with 0.72. How can this be?

/Stef

Hi again,
I would very much appreciate your help as I feel stuck at this point :frowning_face:
/Stef

Hi @stepra

Thank you for using mixOmics.

For Diablo, you don’t have to manually filter variables based on sPLSDA analyses. You can keep all variables and the model will perform the variable selection. As for the BER of 0, it might be because your explanatory Y variable is included in one of the datasets? Let me know if that is not the case. Also, the similarity measure uses the reduced covariant dimensions so it is different from correlation in the higher dimension.

Hope it helps,

Al

Dear Al,
Thank you so much for getting back to me! I am now running DIABLO with the full three datasets. I just want to be sure I am doing this correctly. I first run
MyResult.diablo1 ← block.plsda(X, Y, ncomp=5)
perf.diablo = perf(MyResult.diablo1, validation = ‘loo’, cpus=7)


Which matrix due I use to chose ncomp? Is it overall BER, i.e. ncomp=2 in this case? Or should I select based on maxdist or centroids?
So, if I chose my ncomp here, I proceed to
tune.BBMncomp2 = tune.block.splsda(X = X, Y = Y, ncomp = 2,test.keepX = test.keepX, design = design, validation = 'loo',dist = "centroids.dist", cpus = 35) using
design = matrix(0, ncol = length(X), nrow = length(X),dimnames = list(names(X), names(X)))
I then get this:

list.keepX = tune.BBM3$choice.keepX
list.keepX
$colon
[1] 4 1
$plasma
[1] 3 3
$olink
[1] 1 1
and thus, run MyResult.diablo <- block.splsda(X, Y, keepX=list.keepX, ncomp=2)
Is that correct? How can I extract the error rate of this model?
And could you please explain how corMat is calculated? I understand it is not the same as e.g. Spearman, but how shall I interpret the values if sth has a higher value in a relationship to something else compared to itself? And what does it mean if some variables have a value of 1.0 with itself and others reach only 0.8, for example. Is there any biological meaning with this in relation to Y, i.e. healthy vs disease?
I think I am getting out some extremely interesting relationships using DIABLO, I just want to be sure I do everything correctly before I start drafting the manuscript. So thank you very much for your support! :slight_smile:
Cheers, Stef

hi @stepra,

Choose the prediction distance that results in the lowest error rate. I do not know if the number of samples per classes is unbalanced here (it does not look like it is, as BER ~ ER) so that would be 2 components with the malhanobis distance.

Rerun the tune.BBMncomp2 with those parameters and the rest is correct (as shown in our tutorials on http://mixomics.org/mixdiablo/case-study-tcga/). To extract the final error rate, rerun a perf() on the final model (see tutorial).

corMat is calculated as an extension of https://biodatamining.biomedcentral.com/articles/10.1186/1756-0381-5-19 but for more than 2 data sets (mathematical results at the end of the article). It is more about an ‘association’ value between pairs of variables, with respect to each DIABLO component. You can interpret it similarly to a correlation coefficient (with some caution of course). No relation however with the outcome directly, except that you know that by applying DIABLO you explicitly ask the approach to select discriminative variables (by the way, I hope you are aware that you design does not explicitly ask for maximising the correlation between data sets, but mostly full discrimination with the outcome, see the supplemental material of DIABLO with the simulation study about what it means: https://academic.oup.com/bioinformatics/article-abstract/35/17/3055/5292387)

Note that we are a bit busy those days, alternating between lockdowns and leave so our answers are a bit slow.

Kim-Anh