Choosing Diablo Design Matrix

Hi,
I am integrating three datasets (adducts, metabolites and proteins). There is no prior knowledge on how they are correlated in order to design my matrix. I know I could do pairwise correlation, but I don’t know how to interpret results in order to understand if XY are correlated. Could you share code on how to explain variance of X within Y. Would be grateful for any help. Thanks

2 Likes

Hi,
I would like to ask questions in addition to this question.

  1. What is the design default of block.splsda (I cannot find it)?
  2. What is the effect of changing the design on the output? For example: if I would say (arbitrarily) that two datasets correlate 0.8, but in fact it would be 0.7 or 0.9,what would be the effect (or actually the error) on the output (the number of components, variable inclusion, loadings etc)
  3. I think this relates to the above question. In the manual it is stated that one could estimate the values in the design matrix via the non sparse pls analysis. I have done pls analyses, but from what function, plot, analysis can you delineate the correlation?

Kind regards,
Lonneke

1 Like

hi @Tmekh

As you are intending to apply block.splsda (i.e. a supervised analysis with DIABLO). then you could look at the function plotDiablo after running block.splsda() to examine a posteriori the correlation between components of each data block and see whether that could guide you to refine your design matrix. This is the trial - error approach.

For a more ‘informed’ approach, I would run a PLS approach 2 by 2 datasets, e.g.
data(“liver.toxicity”)
X = liver.toxicity$gene
Y = liver.toxicity$clinic
pls.res = pls(X, Y, ncomp = 1)
cor(pls.res$variates$X, pls.res$variates$Y)
This correlation should be able to inform you about the global correlation that can be extract from both data sets.

Could you share code on how to explain variance of X within Y.
I am not sure I understood the question. We have the code to calculate the explained variance within X based on the X-components, but not within Y. Can you rephrase if I havent addressed your question above.

Kim-Anh

3 Likes

hi @lonnekenouwen

  1. What is the design default of block.splsda (I cannot find it)?

there is no default here, you have to set it up (see our examples given in the helpful files or bookdown)

  1. What is the effect of changing the design on the output? For example: if I would say (arbitrarily) that two datasets correlate 0.8, but in fact it would be 0.7 or 0.9,what would be the effect (or actually the error) on the output (the number of components, variable inclusion, loadings etc)

Short answer: sometimes quite strong! It depends on the cross-correlation structure of the data and their discriminative power. All will vary. The closer you are to 1 the more correlated the variables that are selected, but the less discriminative the model will be (see our simulation study + explanations in the DIABLO paper).

  1. I think this relates to the above question. In the manual it is stated that one could estimate the values in the design matrix via the non sparse pls analysis. I have done pls analyses, but from what function, plot, analysis can you delineate the correlation?

See my answer to the other question, hope it helps!

Kim-Anh

4 Likes

Hi Kim,
Thanks so much, that was very helpful and I finally designed my matrix properly and got the desired features.
Tarana

Hi @Tmekh @lonnekenouwen @kimanh.lecao

Thanks for the great discussion.

I have a related question: Would you then use the PLS results for your design matrix (for DIABLO)?

Say you have three datasets and ran the following:

dataset A vs dataset B

pls.res = pls(A, B, ncomp = 3)
cor(pls.res$variates$A, pls.res$variates$B) %>% diag()

    comp1     comp2     comp3 
0.8762957 0.7924082 0.8575647

dataset A vs dataset C

pls.res = pls(A, C, ncomp = 3)
cor(pls.res$variates$A, pls.res$variates$C) %>% diag()

    comp1     comp2     comp3     
0.8440602 0.8388065 0.9273404

dataset C vs dataset B

pls.res = pls(C, B, ncomp = 3)
cor(pls.res$variates$C, pls.res$variates$B) %>% diag()

    comp1     comp2     comp3     
0.7094788 0.7333678 0.6867342

Would you then use the following design matrix:

design = matrix(c(0,0.85,0.85,0.85,0,0.7,0.85,0.7,0), ncol = length(X), nrow = length(X), 
                 dimnames = list(names(X), names(X)))

design

        A       B        C
A      0.00    0.85     0.85
B      0.85    0.00     0.70
C      0.85    0.70     0.00

Or would you just set the off-diagonal values to 1 since the values from the PLS correlation are relatively high?

Thanks,

Ramiro

@rramiro,
your correlations values are pretty high, and you could stick to what you propose. PLS is unsupervised so it gives you an idea of the amount of agreement / correlation you could extract pairwise.
Have another read at the DIABLO paper regarding the compromise between correlation and discrimination though (especially the supplemental results in the simulated data).

Kim-Anh