Hello,
I applied the sPLS-DA to find a transcriptomic signature. Here is briefly the process I had:
I have 3 groups I want to discriminate (group A I am interested in, and groups B and C are controls) and I strictly followed your well made SRBCT study case.
X_Sig is my gene expression matrix (52 samples x 24716 genes), and Y_Sig is the response matrix (GROUP_A, GROUP_B and GROUP_C, the smallest group has 10 samples).
# Compute PLS-DA
set.seed(2543)
core_sig.plsda <- plsda(X_Sig, Y_Sig, ncomp = 10)
# Compute performance
set.seed(40) # for reproducibility, only when the 'cpus' argument is not used
perf.core_sig.plsda <- perf(core_sig.plsda, validation = "Mfold", folds = 10,
progressBar = TRUE, auc = TRUE, nrepeat = 10)
opt_nb_comp <- perf.core_sig.plsda$choice.ncomp
opt_nb_comp
# sPLS-DA tuning
set.seed(2543)
tune.core_sig.splsda <- tune.splsda(X_Sig, Y_Sig, ncomp = 2, validation = 'Mfold', folds = 10,
progressBar = TRUE, dist = 'mahalanobis.dist', measure = "BER",
test.keepX = c(1:10, seq(20, 300, 10)), nrepeat = 10, cpus=3)
# Optimal number of components based on t-tests
ncomp <- tune.core_sig.splsda$choice.ncomp$ncomp
ncomp
1
# Optimal number of variables to select
select.keepX <- tune.core_sig.splsda$choice.keepX[1:ncomp]
select.keepX
1
One component and one gene is sufficient to discriminate the 3 groups. However, I had decided to increase to 2 components and take 1 and 5 genes respectively:
# Computing the final sPLS-DA model
set.seed(2543)
core_sig.splsda <- splsda(X_Sig, Y_Sig, ncomp = 2, keepX = c(1, 5))
# Performance assessment
set.seed(40) # for reproducibility, only when the `cpus' argument is not used
# takes about 1 min to run
perf.core_sig <- perf(core_sig.splsda, validation = "Mfold", folds = 10,
dist = 'mahalanobis.dist', nrepeat = 10,
progressBar = TRUE)
However, I decided to increase nrepeat (in every step there is nrepeat) from 10 to 1000: can you confirm it is more robust to do that instead?
By increasing from 10 to 1000 nrepeat, the final sPLS-DA model has to be computed with 2 components and 1 and 240 genes. The gene on the first component is the same than when I used nrepeat=10.
Using for the whole process nrepeat=1000, here is the plot of my final sPLS-DA model:
In red is my GROUP_A, and greys are my controls.
Here is the performance plot of this final model:
Then, I would like to define a signature that “positively” defines the GROUP_A.
Is it relevant to only extract the genes which contribute the most to this group A, and then use only them with the predict
function to validate (I plan to use another transcriptomic dataset)?
# Plot loadings
ContribComp1 <- plotLoadings(core_sig.splsda, comp = 1, title = 'Loadings on comp 1',
contrib = 'max', method = 'mean')
ContribComp2 <- plotLoadings(core_sig.splsda, comp = 2, title = 'Loadings on comp 2',
contrib = 'max', method = 'mean')
# Get the genes that contribute the most to GROUP_A
selectVar(core_sig.splsda, comp=2)$name[which(ContribComp1$GroupContrib == "GROUP_A")]
selectVar(core_sig.splsda, comp=2)$name[which(ContribComp2$GroupContrib == "GROUP_A")]
Thank you,
Lisa