Choosing Diablo Design Matrix

Hi,
I am integrating three datasets (adducts, metabolites and proteins). There is no prior knowledge on how they are correlated in order to design my matrix. I know I could do pairwise correlation, but I don’t know how to interpret results in order to understand if XY are correlated. Could you share code on how to explain variance of X within Y. Would be grateful for any help. Thanks

3 Likes

Hi,
I would like to ask questions in addition to this question.

  1. What is the design default of block.splsda (I cannot find it)?
  2. What is the effect of changing the design on the output? For example: if I would say (arbitrarily) that two datasets correlate 0.8, but in fact it would be 0.7 or 0.9,what would be the effect (or actually the error) on the output (the number of components, variable inclusion, loadings etc)
  3. I think this relates to the above question. In the manual it is stated that one could estimate the values in the design matrix via the non sparse pls analysis. I have done pls analyses, but from what function, plot, analysis can you delineate the correlation?

Kind regards,
Lonneke

1 Like

hi @Tmekh

As you are intending to apply block.splsda (i.e. a supervised analysis with DIABLO). then you could look at the function plotDiablo after running block.splsda() to examine a posteriori the correlation between components of each data block and see whether that could guide you to refine your design matrix. This is the trial - error approach.

For a more ‘informed’ approach, I would run a PLS approach 2 by 2 datasets, e.g.
data(“liver.toxicity”)
X = liver.toxicity$gene
Y = liver.toxicity$clinic
pls.res = pls(X, Y, ncomp = 1)
cor(pls.res$variates$X, pls.res$variates$Y)
This correlation should be able to inform you about the global correlation that can be extract from both data sets.

Could you share code on how to explain variance of X within Y.
I am not sure I understood the question. We have the code to calculate the explained variance within X based on the X-components, but not within Y. Can you rephrase if I havent addressed your question above.

Kim-Anh

4 Likes

hi @lonnekenouwen

  1. What is the design default of block.splsda (I cannot find it)?

there is no default here, you have to set it up (see our examples given in the helpful files or bookdown)

  1. What is the effect of changing the design on the output? For example: if I would say (arbitrarily) that two datasets correlate 0.8, but in fact it would be 0.7 or 0.9,what would be the effect (or actually the error) on the output (the number of components, variable inclusion, loadings etc)

Short answer: sometimes quite strong! It depends on the cross-correlation structure of the data and their discriminative power. All will vary. The closer you are to 1 the more correlated the variables that are selected, but the less discriminative the model will be (see our simulation study + explanations in the DIABLO paper).

  1. I think this relates to the above question. In the manual it is stated that one could estimate the values in the design matrix via the non sparse pls analysis. I have done pls analyses, but from what function, plot, analysis can you delineate the correlation?

See my answer to the other question, hope it helps!

Kim-Anh

4 Likes

Hi Kim,
Thanks so much, that was very helpful and I finally designed my matrix properly and got the desired features.
Tarana

Hi @Tmekh @lonnekenouwen @kimanh.lecao

Thanks for the great discussion.

I have a related question: Would you then use the PLS results for your design matrix (for DIABLO)?

Say you have three datasets and ran the following:

dataset A vs dataset B

pls.res = pls(A, B, ncomp = 3)
cor(pls.res$variates$A, pls.res$variates$B) %>% diag()

    comp1     comp2     comp3 
0.8762957 0.7924082 0.8575647

dataset A vs dataset C

pls.res = pls(A, C, ncomp = 3)
cor(pls.res$variates$A, pls.res$variates$C) %>% diag()

    comp1     comp2     comp3     
0.8440602 0.8388065 0.9273404

dataset C vs dataset B

pls.res = pls(C, B, ncomp = 3)
cor(pls.res$variates$C, pls.res$variates$B) %>% diag()

    comp1     comp2     comp3     
0.7094788 0.7333678 0.6867342

Would you then use the following design matrix:

design = matrix(c(0,0.85,0.85,0.85,0,0.7,0.85,0.7,0), ncol = length(X), nrow = length(X), 
                 dimnames = list(names(X), names(X)))

design

        A       B        C
A      0.00    0.85     0.85
B      0.85    0.00     0.70
C      0.85    0.70     0.00

Or would you just set the off-diagonal values to 1 since the values from the PLS correlation are relatively high?

Thanks,

Ramiro

@rramiro,
your correlations values are pretty high, and you could stick to what you propose. PLS is unsupervised so it gives you an idea of the amount of agreement / correlation you could extract pairwise.
Have another read at the DIABLO paper regarding the compromise between correlation and discrimination though (especially the supplemental results in the simulated data).

Kim-Anh

Hi,
May I please ask a question following on from this very helpful discussion?
I am integrating 2 omics datasets obtained from the same samples using DIABLO. The 2 datasets are quite highly correlated:


When I check the output from running plotDiablo (having tested a range of different design matrices), the correlation structure between both datasets is generally about 0.9 also.
However, the features selected by running DIABLO with a design matrix of 0.9 between both datasets are less biologically interesting than those selected when I specify a design matrix with a lower correlation structure - for example, 0.3. Is it reasonable to justify the selection of a design matrix of 0.3 in this case on the basis that it selects more biologically interesting features from both datasets? Invariably, the correlation structure of the selected features is approx 0.9 anyway regardless of the specified design matrix.

May I also ask how to generate reproducible results using DIABLO? I use set.seed() prior to running tune.block.splsda() and block.splsda() but the features selected by DIABLO change slightly each time I run the analysis despite keeping all of the other parameters the same?

Thank youn very much for your help.

1 Like

Hi,

We are also working in integrating two different omics data and facing the same issue of not getting reproducible feature selection. Could you please let us know how you dealt with this issue?

Thank you.

hi @r.priyanka1802,

If your data sets are highly correlated, it is highly possible that one feature may be switched with another invariably (if they are highly correlated). This is a downside of lasso penalisation.

I would use the perf function, run this across several nrepeat, and look at the stability of selection to get a better idea of which feature are really important when the data set is perturbed using cross-validation. See here for PLS-DA (but the interpretation / code would be similar for a DIABLO): sPLSDA SRBCT Case Study() | mixOmics

Kim-Anh