How to account for group membership in sPLS?

Hi mixOmics team,

Thank you so much for creating such a fantastic package! I am quite new to microbiome analysis so I apologise if the following questions are very straightforward:

I am interested in using sPLS (canonical mode) to find associations between various dietary factors with microbiome abundance data. I was wondering if you could suggest what to do to account for group membership as I have 4 different participant groups (controls, IBS, anxiety/depression, comorbid IBS and anxiety/depression) and I’d be very interested to see if diet-microbiota associations differ in these various groups. While I know you can colour by groups in some of the plot options, I’m not sure if it is possible to somehow account for group membership in the analysis itself. Any advice would be really appreciated!

In relation to using the canonical mode, I saw that tuning is not possible - could you suggest a way to select the optimal number of components in this case?

Another question I had is how one would decide how to set the range of test values for the no. of variables to use from each data set. Is this quite an arbitrary decision or should that be based on how many variables are in each data set? If I have 1000 ASVs and 65 dietary variables, how would I best go about setting this range for both sets of data?

Thank you in advance (and apologies if these questions have been asked before)!

1 Like

Morning @DjamilaE,

No apologies required!

While I know you can colour by groups in some of the plot options, I’m not sure if it is possible to somehow account for group membership in the analysis itself

Unfortunately, there isn’t a way to do this directly within the spls() function. However, we can explore data in certain ways to achieve something similar. I’ll start by just setting up the data and such:

library(mixOmics)

data("liver.toxicity")

X <- liver.toxicity$gene
Y <- liver.toxicity$clinic
group <- liver.toxicity$treatment$Dose.Group

# taken from the sPLS Case Study (mixOmics.org)
optimal.ncomp <- 2
optimal.keepX <- c(35, 45)
optimal.keepY <- c(4,4)

liver.spls <- spls(X, Y, mode = "canonical",
                   ncomp = optimal.ncomp,
                   keepX = optimal.keepX,
                   keepY = optimal.keepY)

One way I can think to incorporate the group into your analysis would be to run individual splsda on each of your input dataframes (X and Y) and then to run visualise the relation between the selected features in each of these cases.

Firstly, we can look at the plotLoadings() (read more about the colouring here) for each dataset against the group.

par(mfrow=c(2,2))
gene.splsda <- splsda(X, group,
                      ncomp = optimal.ncomp,
                      keepX = optimal.keepX)
plotLoadings(gene.splsda, contrib = "max", method = "median",
             title = "Figure 1a, comp1")
plotLoadings(gene.splsda, contrib = "max", method = "median", comp = 2,
             title = "Figure 1b, comp2") #





par(mfrow=c(2,2))
treatment.splsda <- splsda(Y, group,
                           ncomp = optimal.ncomp,
                           keepX = optimal.keepY)
plotLoadings(treatment.splsda, contrib = "max", method = "median",
             title = "Figure 2a, comp1")
plotLoadings(treatment.splsda, contrib = "max", method = "median", comp = 2,
             title = "Figure 2b, comp2") #

The next idea I can think of would be to produce a heatmap of correlations between the variables selected by each of these splsda() calls. I use heatmap() here so it can be shown in the forum - for your analysis, I’d recommend using cim(). You could also explore using the network() function



selected.genes <- rownames(which(gene.splsda$loadings$X!=0, arr.ind = T))
selected.treaments <- rownames(which(treatment.splsda$loadings$X!=0, arr.ind = T))

X.s <- X[, selected.genes]
Y.s <- Y[, selected.treaments]

heatmap(cor(X.s, Y.s))

The subsetting data (X.s and Y.s) can then be fed into its own spls() call and analysed.

could you suggest a way to select the optimal number of components in this case?

You can definitely still use the tune() and perf() functions when using sPLS in canonical mode. The two below code chunks depict how you could go about this.

subset.spls <- spls(X.s, Y.s, mode = "canonical", ncomp = 5)
sub.spls.perf <- perf(subset.spls, folds = 5, nrepeat = 5)
plot(sub.spls.perf, criterion = "cor.tpred") # explore different criteria
sub.spls.tune <- tune.spls(X, Y, test.keepX = c(1:10), folds = 5, ncomp = 5)
plot(sub.spls.tune)

how would I best go about setting this range for both sets of data?

I would suggest an iterative approach. Start with a broad range of values, with large intervals and repeat the tuning, using a finer test.keepX each time. I go into it more in this post.

Hope these answers help.
Max.

Hi Max,

Thank you so much for your quick and thoughtful reply, I really appreciate your generous time! I hope you don’t mind if I ask a few follow-up questions:

I’ve gone ahead and run individual sPLS-DAs for each of the datasets (ASV, diet) and produced the plotLoadings for each of these. I’ve then used these loadings to produce the heatmap as you’ve suggested (I haven’t tried doing this using cim() but will try to do so). What I was wondering is if I can try to delineate the groupings somehow within this heatmap?

As for the next part,

The subsetting data ( X.s and Y.s ) can then be fed into its own spls() call and analysed

I wasn’t sure how you could get the grouping variable out of this again if you were to run a follow-up sPLS. Apologies if I’ve misunderstood something but the main idea was to see if I could find meaningful differences between the different groups in these microbiome-diet associations (if they exist).

Thank you so much in advance :slight_smile:

What I was wondering is if I can try to delineate the groupings somehow within this heatmap?

This whole method is far from exact, but here’s a suggestion. Have a look at the group associated with each feature (via the colouring of the plotLoadings() output). With this in mind, look at the clustering of features in the heatmap. This heatmap won’t explicitly show groups as you are not displaying any samples. However, features with high correlations may be associated (maximum contributing class, look at contrib and method parameters in plotLoadings()) with the same class. Feature pairs which are closely clustered with this pair may help in your understanding of which features define each class.

I wasn’t sure how you could get the grouping variable out of this again if you were to run a follow-up sPLS.

The intention here wasn’t to explicitly examine the grouping, but to understand the relationship between variables which define each class. Unfortunately, there is not way to directly examine classes within the sPLS framework.

I understand that you said you were interested the the sPLS methodology, though I’d highly recommend exploring the DIABLO framework. This is an extension of the sPLS algorithm and is able to incorporate multiple datasets and a classification response variable. Read more here.

Hi Max,

Thank you once again for your suggestions. I will definitely look into DIABLO.

I really appreciate your time! :smiling_face: