Morning @DjamilaE,
No apologies required!
While I know you can colour by groups in some of the plot options, I’m not sure if it is possible to somehow account for group membership in the analysis itself
Unfortunately, there isn’t a way to do this directly within the spls()
function. However, we can explore data in certain ways to achieve something similar. I’ll start by just setting up the data and such:
library(mixOmics)
data("liver.toxicity")
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic
group <- liver.toxicity$treatment$Dose.Group
# taken from the sPLS Case Study (mixOmics.org)
optimal.ncomp <- 2
optimal.keepX <- c(35, 45)
optimal.keepY <- c(4,4)
liver.spls <- spls(X, Y, mode = "canonical",
ncomp = optimal.ncomp,
keepX = optimal.keepX,
keepY = optimal.keepY)
One way I can think to incorporate the group into your analysis would be to run individual splsda
on each of your input dataframes (X and Y) and then to run visualise the relation between the selected features in each of these cases.
Firstly, we can look at the plotLoadings()
(read more about the colouring here) for each dataset against the group.
par(mfrow=c(2,2))
gene.splsda <- splsda(X, group,
ncomp = optimal.ncomp,
keepX = optimal.keepX)
plotLoadings(gene.splsda, contrib = "max", method = "median",
title = "Figure 1a, comp1")
plotLoadings(gene.splsda, contrib = "max", method = "median", comp = 2,
title = "Figure 1b, comp2") #
par(mfrow=c(2,2))
treatment.splsda <- splsda(Y, group,
ncomp = optimal.ncomp,
keepX = optimal.keepY)
plotLoadings(treatment.splsda, contrib = "max", method = "median",
title = "Figure 2a, comp1")
plotLoadings(treatment.splsda, contrib = "max", method = "median", comp = 2,
title = "Figure 2b, comp2") #
The next idea I can think of would be to produce a heatmap of correlations between the variables selected by each of these splsda()
calls. I use heatmap()
here so it can be shown in the forum - for your analysis, I’d recommend using cim()
. You could also explore using the network()
function
selected.genes <- rownames(which(gene.splsda$loadings$X!=0, arr.ind = T))
selected.treaments <- rownames(which(treatment.splsda$loadings$X!=0, arr.ind = T))
X.s <- X[, selected.genes]
Y.s <- Y[, selected.treaments]
heatmap(cor(X.s, Y.s))
The subsetting data (X.s
and Y.s
) can then be fed into its own spls()
call and analysed.
could you suggest a way to select the optimal number of components in this case?
You can definitely still use the tune()
and perf()
functions when using sPLS in canonical mode. The two below code chunks depict how you could go about this.
subset.spls <- spls(X.s, Y.s, mode = "canonical", ncomp = 5)
sub.spls.perf <- perf(subset.spls, folds = 5, nrepeat = 5)
plot(sub.spls.perf, criterion = "cor.tpred") # explore different criteria
sub.spls.tune <- tune.spls(X, Y, test.keepX = c(1:10), folds = 5, ncomp = 5)
plot(sub.spls.tune)
how would I best go about setting this range for both sets of data?
I would suggest an iterative approach. Start with a broad range of values, with large intervals and repeat the tuning, using a finer test.keepX
each time. I go into it more in this post.
Hope these answers help.
Max.