Using keep.X from separate sPLS-DA analyses for Diablo

Hi, I have three datasets I wish to integrate. I have performed separate sPLS-DA for each dataset and I have two categories: healthy (n=9) and diseased (n=6). The ideal ncomp and keep.X according to tune.splsda were as follows:
dataset ncomp keep.X
A 1 9
B 3 600, 460, 110
C 1 10

Analysing B further, the first 30 variables of comp1 are only significantly changed.
I thus would like to integrate these variables in Diablo.
I ran:

list.keepX <- list(colon = c(9,1), plasma = c(30,1), olink = c(10,1))
MyResult.diablo <- block.splsda(X, Y, keepX=list.keepX, ncomp=2)

But when I visualise the data, e.g. by circosPlot many of the variables included in the plot are not those I wished to select, i.e. colon = c(9,1), plasma = c(30,1), olink = c(10,1) and many interesting ones are missing. So it seems my code is not extracting the correct variables. Did I misunderstand? How can I integrate only the variables of interest? Or would you argue against doing this at all given the small number of samples? Alternatively, how can I identify the best number of variables for each dataset for DIABLO? Is there something similar to tune.splsda that works for X with three datasets?
The variables we identified using the separate analyses are highly significant and make sense, so I wish to identify relationships amongst them.
Thank you very much for your help.
I very much enjoy mixomics and it is very easy to do PLS-DA with it and get beautiful figures :slight_smile:

Cheers,
Stef

Hi @stepra

Thank you for using mixOmics and sharing your analysis thoughts.

In fact, we do have tune.block.splsda function so I strongly recommend you look into that.

On a side note, please keep in mind that when you perform splsda, the variables which explain Y are selected from X. However, when you apply Diablo, the design matrix designates whether you are also interested in correlations between X datasets (full design). If you wish to perform a single-step integration with respect to Y and you are only interested in selecting for variables that explain (correlate with) Y, you can use a design matrix whose elements are all 0 (null design). See ?block.splsda for more.

Hope it helps

Al

Hi Al,
Thank you for these very useful comments. I will try your suggestions! One question though, when I use design = matrix(0, ncol = length(X), nrow = length(X),dimnames = list(names(X), names(X))) does that mean I cannot see how the variables in each dataset that explain Y correlate to each other? I basically want to identify the variables in each X that explain the disease (which I have done by separate PLS-DA analyses) AND I want to know which of these variables correlate with each other.
Thanks again!
/Stef

Hi @stepra,

My pleasure. The variables selected are often naturally correlated, even with a ‘null’ design. You would be ale to verify that using circosPlot. However, the algorithm does not optimise for such criteria anymore and it only focuses on those that are most correlated with Y. Please note that you can use any value from 0 to 1 on each element of the design matrix. For instance:

data(nutrimouse)
Y = nutrimouse$diet
X = list(gene = nutrimouse$gene, lipid = nutrimouse$lipid, gene.copy = nutrimouse$gene +1)
design = matrix(0.1, ncol = length(X), nrow = length(X),dimnames = list(names(X), names(X)))
diag(design) <- 0
design
          gene lipid gene.copy
gene       0.0   0.1       0.1
lipid      0.1   0.0       0.1
gene.copy  0.1   0.1       0.0

Hope it helps

Al