I applied the sPLS-DA to find a transcriptomic signature. Here is briefly the process I had:
I have 3 groups I want to discriminate (group A I am interested in, and groups B and C are controls) and I strictly followed your well made SRBCT study case.
X_Sig is my gene expression matrix (52 samples x 24716 genes), and Y_Sig is the response matrix (GROUP_A, GROUP_B and GROUP_C, the smallest group has 10 samples).
# Compute PLS-DA set.seed(2543) core_sig.plsda <- plsda(X_Sig, Y_Sig, ncomp = 10) # Compute performance set.seed(40) # for reproducibility, only when the 'cpus' argument is not used perf.core_sig.plsda <- perf(core_sig.plsda, validation = "Mfold", folds = 10, progressBar = TRUE, auc = TRUE, nrepeat = 10) opt_nb_comp <- perf.core_sig.plsda$choice.ncomp opt_nb_comp # sPLS-DA tuning set.seed(2543) tune.core_sig.splsda <- tune.splsda(X_Sig, Y_Sig, ncomp = 2, validation = 'Mfold', folds = 10, progressBar = TRUE, dist = 'mahalanobis.dist', measure = "BER", test.keepX = c(1:10, seq(20, 300, 10)), nrepeat = 10, cpus=3) # Optimal number of components based on t-tests ncomp <- tune.core_sig.splsda$choice.ncomp$ncomp ncomp
# Optimal number of variables to select select.keepX <- tune.core_sig.splsda$choice.keepX[1:ncomp] select.keepX
One component and one gene is sufficient to discriminate the 3 groups. However, I had decided to increase to 2 components and take 1 and 5 genes respectively:
# Computing the final sPLS-DA model set.seed(2543) core_sig.splsda <- splsda(X_Sig, Y_Sig, ncomp = 2, keepX = c(1, 5)) # Performance assessment set.seed(40) # for reproducibility, only when the `cpus' argument is not used # takes about 1 min to run perf.core_sig <- perf(core_sig.splsda, validation = "Mfold", folds = 10, dist = 'mahalanobis.dist', nrepeat = 10, progressBar = TRUE)
However, I decided to increase nrepeat (in every step there is nrepeat) from 10 to 1000: can you confirm it is more robust to do that instead?
By increasing from 10 to 1000 nrepeat, the final sPLS-DA model has to be computed with 2 components and 1 and 240 genes. The gene on the first component is the same than when I used nrepeat=10.
Using for the whole process nrepeat=1000, here is the plot of my final sPLS-DA model:
In red is my GROUP_A, and greys are my controls.
Here is the performance plot of this final model:
Then, I would like to define a signature that “positively” defines the GROUP_A.
Is it relevant to only extract the genes which contribute the most to this group A, and then use only them with the
predict function to validate (I plan to use another transcriptomic dataset)?
# Plot loadings ContribComp1 <- plotLoadings(core_sig.splsda, comp = 1, title = 'Loadings on comp 1', contrib = 'max', method = 'mean') ContribComp2 <- plotLoadings(core_sig.splsda, comp = 2, title = 'Loadings on comp 2', contrib = 'max', method = 'mean') # Get the genes that contribute the most to GROUP_A selectVar(core_sig.splsda, comp=2)$name[which(ContribComp1$GroupContrib == "GROUP_A")] selectVar(core_sig.splsda, comp=2)$name[which(ContribComp2$GroupContrib == "GROUP_A")]