Hello,

I applied the sPLS-DA to find a transcriptomic signature. Here is briefly the process I had:

I have 3 groups I want to discriminate (group A I am interested in, and groups B and C are controls) and I strictly followed your well made SRBCT study case.

X_Sig is my gene expression matrix (52 samples x 24716 genes), and Y_Sig is the response matrix (GROUP_A, GROUP_B and GROUP_C, the smallest group has 10 samples).

```
# Compute PLS-DA
set.seed(2543)
core_sig.plsda <- plsda(X_Sig, Y_Sig, ncomp = 10)
# Compute performance
set.seed(40) # for reproducibility, only when the 'cpus' argument is not used
perf.core_sig.plsda <- perf(core_sig.plsda, validation = "Mfold", folds = 10,
progressBar = TRUE, auc = TRUE, nrepeat = 10)
opt_nb_comp <- perf.core_sig.plsda$choice.ncomp
opt_nb_comp
# sPLS-DA tuning
set.seed(2543)
tune.core_sig.splsda <- tune.splsda(X_Sig, Y_Sig, ncomp = 2, validation = 'Mfold', folds = 10,
progressBar = TRUE, dist = 'mahalanobis.dist', measure = "BER",
test.keepX = c(1:10, seq(20, 300, 10)), nrepeat = 10, cpus=3)
# Optimal number of components based on t-tests
ncomp <- tune.core_sig.splsda$choice.ncomp$ncomp
ncomp
```

1

```
# Optimal number of variables to select
select.keepX <- tune.core_sig.splsda$choice.keepX[1:ncomp]
select.keepX
```

1

One component and one gene is sufficient to discriminate the 3 groups. However, I had decided to increase to 2 components and take 1 and 5 genes respectively:

```
# Computing the final sPLS-DA model
set.seed(2543)
core_sig.splsda <- splsda(X_Sig, Y_Sig, ncomp = 2, keepX = c(1, 5))
# Performance assessment
set.seed(40) # for reproducibility, only when the `cpus' argument is not used
# takes about 1 min to run
perf.core_sig <- perf(core_sig.splsda, validation = "Mfold", folds = 10,
dist = 'mahalanobis.dist', nrepeat = 10,
progressBar = TRUE)
```

However, I decided to increase nrepeat (in every step there is nrepeat) from 10 to 1000: can you confirm it is more robust to do that instead?

By increasing from 10 to 1000 nrepeat, the final sPLS-DA model has to be computed with 2 components and 1 and 240 genes. The gene on the first component is the same than when I used nrepeat=10.

Using for the whole process nrepeat=1000, here is the plot of my final sPLS-DA model:

In red is my GROUP_A, and greys are my controls.

Here is the performance plot of this final model:

Then, I would like to define a signature that “positively” defines the GROUP_A.

Is it relevant to only extract the genes which contribute the most to this group A, and then use only them with the `predict`

function to validate (I plan to use another transcriptomic dataset)?

```
# Plot loadings
ContribComp1 <- plotLoadings(core_sig.splsda, comp = 1, title = 'Loadings on comp 1',
contrib = 'max', method = 'mean')
ContribComp2 <- plotLoadings(core_sig.splsda, comp = 2, title = 'Loadings on comp 2',
contrib = 'max', method = 'mean')
# Get the genes that contribute the most to GROUP_A
selectVar(core_sig.splsda, comp=2)$name[which(ContribComp1$GroupContrib == "GROUP_A")]
selectVar(core_sig.splsda, comp=2)$name[which(ContribComp2$GroupContrib == "GROUP_A")]
```

Thank you,

Lisa