sPLS-DA with only stable features?


I’m performing sPLS-DA on RNA sequencing data (25 samples, 5000 features). I have designated 2 groups, PLS-DA gives good separation of the 2 groups, sPLS-DA with 110 features even better separation. I’d like to perform sPLS-DA only with features that are stable during M-fold cross-validation and use these results to predict the same groups in external RNAsequencing data. However, I’m uncertein as to how to approach this.

What I’ve come up with so far is the following:

perform sPLS-DA, ncomp = 2, keepx = (100,10)

evaluate performans with perf(), validation = “Mfold”, folds = 5, nrepeat = 50

select variables with stability over 0.8. This gives a list of 26 features from comp1, 0 from comp2

Extract these 27 stable features from the original data and make a new sPLS-DA model containing only these features.

Use the new “stable only” model to predict groups in external RNA seq data.

Is this approach valid or am I misunderstanding or over-thinking the whole thing?



1 Like


The pipeline you have described will yield what you want. I was able to create a basic version of it with the code below. You can use this as a point of comparison to verify your methodology was correct

I believe I understand your rationale - such that the most stable features represent those which consistently are useful for a model developed on your data. Therefore, we can reduce the complexity of our model while still retaining maximum predictive ability.

However, by neglecting slightly less stable features you might actually be hindering your model. Take this scenario: by pure chance, >20% of the repeats that were done used training and testing splits which meant that generally useful features had their discriminatory power reduced.

At the end of the day, we can look at the classification metrics to assess whether its worth it. Does this new model using the stable subset of selected features improve performance? If so, to what degree?

If it improves things negligibly (or makes the model worse non-negligibly), then I must ask if the reduction in model complexity is worth it.

Looking forward to your response

X.tr = breast.TCGA$data.train$mirna
X.te = breast.TCGA$data.test$mirna
Y.tr = breast.TCGA$data.train$subtype
Y.te = breast.TCGA$data.test$subtype

res.splsda <- splsda(X.tr, Y.tr, ncomp = 2, keepX = c(100,10))

predictions <- predict(res.splsda, newdata = X.te)
get.confusion_matrix(Y.te, predicted=predictions$class$max.dist[,2])

res.perf <- perf(result.splsda, validation = 'Mfold', folds = 5, nrepeat = 3)

stbl.feats.c1 <- names(which(p$features$stable$comp1>0.8)) # 92 features
stbl.feats.c2 <- names(which(p$features$stable$comp2>0.8)) #  4 features

subset.X.tr <- breast.TCGA$data.train$mirna[, c(stbl.feats.c1, stbl.feats.c2)]
subset.X.te <- breast.TCGA$data.test$mirna[, c(stbl.feats.c1, stbl.feats.c2)]

new.plsda <- splsda(subset.X.tr, Y.tr, ncomp = 2)

new.predictions <- predict(new.plsda, newdata = subset.X.te)
get.confusion_matrix(Y.te, predicted = new.predictions$class$max.dist[,2])
1 Like

Thanks very much for your reply @MaxBladen!

Your code performs exactly what I was describing. I ran the code and I see what you mean, there isn’t much of a difference in performance.

The reason why I started to think about this is that my features aren’t nearly as stable as the ones in your example. I feel I need to present this in my manuscript, that I can extract 26 features that are stable after cross-validation and use these to predict the status of the external data. I can’t validate my predictions in the external data but I’ll be looking for other features that may or may not correlate with the predicted groups. It’s good to know I’m not totally off-base.

Thanks again for your help!