Transcriptomic signature with sPLS-DA


I applied the sPLS-DA to find a transcriptomic signature. Here is briefly the process I had:
I have 3 groups I want to discriminate (group A I am interested in, and groups B and C are controls) and I strictly followed your well made SRBCT study case.
X_Sig is my gene expression matrix (52 samples x 24716 genes), and Y_Sig is the response matrix (GROUP_A, GROUP_B and GROUP_C, the smallest group has 10 samples).

# Compute PLS-DA
core_sig.plsda <- plsda(X_Sig, Y_Sig, ncomp = 10)

# Compute performance 
set.seed(40) # for reproducibility, only when the 'cpus' argument is not used
perf.core_sig.plsda <- perf(core_sig.plsda, validation = "Mfold", folds = 10, 
                progressBar = TRUE, auc = TRUE, nrepeat = 10) 

opt_nb_comp <- perf.core_sig.plsda$choice.ncomp 

# sPLS-DA tuning
tune.core_sig.splsda <- tune.splsda(X_Sig, Y_Sig, ncomp = 2, validation = 'Mfold', folds = 10,
                   progressBar = TRUE, dist = 'mahalanobis.dist', measure = "BER",
                   test.keepX = c(1:10,  seq(20, 300, 10)), nrepeat = 10, cpus=3)

# Optimal number of components based on t-tests
ncomp <- tune.core_sig.splsda$choice.ncomp$ncomp


# Optimal number of variables to select
select.keepX <- tune.core_sig.splsda$choice.keepX[1:ncomp]


One component and one gene is sufficient to discriminate the 3 groups. However, I had decided to increase to 2 components and take 1 and 5 genes respectively:

# Computing the final sPLS-DA model
core_sig.splsda <- splsda(X_Sig, Y_Sig, ncomp = 2, keepX = c(1, 5))

# Performance assessment
set.seed(40) # for reproducibility, only when the `cpus' argument is not used
# takes about 1 min to run
perf.core_sig <- perf(core_sig.splsda, validation = "Mfold", folds = 10,
          dist = 'mahalanobis.dist', nrepeat = 10,
          progressBar = TRUE)

However, I decided to increase nrepeat (in every step there is nrepeat) from 10 to 1000: can you confirm it is more robust to do that instead?

By increasing from 10 to 1000 nrepeat, the final sPLS-DA model has to be computed with 2 components and 1 and 240 genes. The gene on the first component is the same than when I used nrepeat=10.

Using for the whole process nrepeat=1000, here is the plot of my final sPLS-DA model:

In red is my GROUP_A, and greys are my controls.

Here is the performance plot of this final model:

Then, I would like to define a signature that “positively” defines the GROUP_A.
Is it relevant to only extract the genes which contribute the most to this group A, and then use only them with the predict function to validate (I plan to use another transcriptomic dataset)?

# Plot loadings
ContribComp1 <- plotLoadings(core_sig.splsda, comp = 1, title = 'Loadings on comp 1', 
             contrib = 'max', method = 'mean')
ContribComp2 <- plotLoadings(core_sig.splsda, comp = 2, title = 'Loadings on comp 2', 
             contrib = 'max', method = 'mean')

# Get the genes that contribute the most to GROUP_A
selectVar(core_sig.splsda, comp=2)$name[which(ContribComp1$GroupContrib == "GROUP_A")]
selectVar(core_sig.splsda, comp=2)$name[which(ContribComp2$GroupContrib == "GROUP_A")]

Thank you,


Hi Lisa,
Thanks for sharing your code and outputs, it makes it easier for us to respond.

  • Yes, increasing the number of repeats in perf is advisable, but you don’t need to go that far. I would say 100 is pretty safe.

What your results suggest are:
There is some discrimination on component 1, and only one gene suffice to do so. However I would encourage you to tune from 5 genes onwards, otherwise the interpretation for this component is quite limited! They may not add much discrimination power, but they probably are as important, biologically, so you can check this.
You can see from your performance plot that your discrimination is quite good already (0.035 on component 1!) which means that the second component refines only slightly the classification, but not drastically (probablysome samples in Group A). Note: it is possible that you dont really need 240 genes (that sounds a lot, and maybe the performance would be similar if you included 100). From the perf() outputs you can obtain the matrix true vs predicted, and work out exactly what is the contribution of each component to discriminate each group.

From your sample plot it seems that you may need a mix of component 1 + 2 to discriminate Group A. I can’t see the loading plots but that will tell you. PlotVar will also tell you those transcripts.

As long as you select those genes and predict on another dataset it is fine (if you do it on the same dataset then you would overfit).

Overall you are on the good path, just relax a little bit the keepX to ensure some biological relevance :slight_smile:



Thanks a lot for your detailed and helpful reply. I finally did again the whole process with nrepeat=500. I also used keepX=c(20,200) for the sPLS-DA final model, as the select.keepX from the new tuning suggested 1 and 190 variables, on 2 components.

Here are the updated sPLS-DA final model and associated performance plots:

And here are the contribution plots:

Surprisingly, the 20 selected genes on the first component do not contribute to GROUP_A nor GROUP_B at all, whereas they look quite well separated on component 1 on the final model plot. Would that make sense (bioinformatically and would it be biologically relevant) to increase keepX on the first component to the point I get genes contributing the most to GROUP_A? I had tried it, and I saw some genes contributing to GROUP_A only from 200 genes selected.

I looked for getting the matrix true vs predicted from the perf() output as you suggested, but did not find how to do it (which slot).

Thanks a lot for your insight and the link with biological relevance :slight_smile:


Hi Lisa,
plotLoadings(..., method = 'mean', contrib = 'min') I suspect those genes are lowly expressed in group A.

It is hard to distinguish the gray colors in your component 2 plotLoadings but you would like to see a clear discrimination between the negative and the positive weights that correspond to specific classes (i.e. dark gray vs light gray+red)

Regarding the perf outputs, here is an example:
X <- srbct$gene
Y <- srbct$class

ncomp = 2

srbct.splsda <- splsda(X, Y, ncomp = ncomp, keepX = rep(10, ncomp))
perf.srbct <- perf(srbct.splsda, validation = "Mfold", folds = 10, nrepeat = 10, progressBar = FALSE) 

Hope that helps!

Hello Kim-Anh,

Yes, you are right, those genes on the first component are lowly expressed in group A:

plotLoadings(core_sig.splsda, comp = 1, title = 'Loadings on comp 1', contrib = 'min', method = 'mean', legend.color=c("red", "darkblue", "lightgrey"))

On component 2, yes, I have dark gray on a side (I changed it by darkblue) vs. red + light gray:

plotLoadings(..., method = 'mean', contrib = 'max', comp=2)

Thanks a lot for your help, I could find how to do it:

truevspred <- data.frame(perf.core_sig$class$mahalanobis.dist[,500,1], perf.core_sig$class$mahalanobis.dist[,500,2], Y_Sig) colnames(truevspred) <- c("Pred.comp1", "Pred.comp2", "True")

I took the last repeat which is (I think) the more reliable.

Everything is well predicted on both components, except for two samples, but I think I misclassified them at first (in my Y matrix).

I will have some more questions, but about the prediction of new samples, so I think it will fall in another topic

Thanks again a lot for your help, Kim-Anh, everything is much clearer on this side now :smiley:


Hi Lisa,
you should average across all repeats. So basically calculate truevspred for each repeat, then average.


Hello Kim-Anh,

Ok, thanks a lot for your precious advice. I built contigency tables, for each sample across all repeats (I just did not divide by the number of repeats, which was set to 500). Here is my code:

## Contigency table for each sample
# Component 1
cont.pred.comp1 <- matrix(NA, nrow = 52, ncol = 3)
rownames(cont.comp1) <- rownames(perf.core_sig$class$mahalanobis.dist[,,1])
#contigency table per 
for (i in rownames(perf.core_sig$class$mahalanobis.dist[,,1])){
    cont.comp1[i,] <- summary(factor(perf.core_sig$class$mahalanobis.dist[i,,1], levels=c("GROUP_A", "GROUP_B", "GROUP_C")))
colnames(cont.comp1) <- c("prd.cp1.GRP_A", "prd.cp1.GRP_B", "prd.cp1.GRP_C")

# Component 2
cont.pred.comp2 <- matrix(NA, nrow = 52, ncol = 3)
rownames(cont.comp2) <- rownames(perf.core_sig$class$mahalanobis.dist[,,2])

for (i in rownames(perf.core_sig$class$mahalanobis.dist[,,2])){
    cont.comp2[i,] <- summary(factor(perf.core_sig$class$mahalanobis.dist[i,,2], levels=c("GROUP_A", "GROUP_B", "GROUP_C")))

colnames(cont.comp2) <- c("prd.cp2.GRP_A", "prd.cp2.GRP_B", "prd.cp2.GRP_C")

# Build true vs pred data frame
truevspred <- data.frame(cont.comp1, cont.comp2, Y_Sig)

For all of my 52 samples but two (Sample_29 and 30), I have 500 for the same group on the two components and that matches with the true group (Y_Sig column). The samples 29 and 30 are misclassified on component 1 (GROUP_C instead of B): to have a correct classification, should I increase the number of genes to take on component 1? Could that be explained by the fact that all of the genes on component 1 contribute the most to GROUP_A? Or maybe increase the number of repeats?



Hi Lisa,
Thanks for sharing your results. What you obtained suggests the following:

  • indeed (as we discussed before) component 1 is good at discriminating group A from the other groups primarily, but also good at discriminating group C.
  • when you add component 2 then group B is better classified (as I ca see from the column prt.cp2.Grp_B).

No need to do anything, your final result is that you need to include 2 components in your sPLS-DA to get a good classification of your samples. So I think in terms of tuning and evaluation performance that step is complete.