G’day!
The pipeline you have described will yield what you want. I was able to create a basic version of it with the code below. You can use this as a point of comparison to verify your methodology was correct
I believe I understand your rationale - such that the most stable features represent those which consistently are useful for a model developed on your data. Therefore, we can reduce the complexity of our model while still retaining maximum predictive ability.
However, by neglecting slightly less stable features you might actually be hindering your model. Take this scenario: by pure chance, >20% of the repeats that were done used training and testing splits which meant that generally useful features had their discriminatory power reduced.
At the end of the day, we can look at the classification metrics to assess whether its worth it. Does this new model using the stable subset of selected features improve performance? If so, to what degree?
If it improves things negligibly (or makes the model worse non-negligibly), then I must ask if the reduction in model complexity is worth it.
Looking forward to your response
data(breast.TCGA)
X.tr = breast.TCGA$data.train$mirna
X.te = breast.TCGA$data.test$mirna
Y.tr = breast.TCGA$data.train$subtype
Y.te = breast.TCGA$data.test$subtype
res.splsda <- splsda(X.tr, Y.tr, ncomp = 2, keepX = c(100,10))
predictions <- predict(res.splsda, newdata = X.te)
get.confusion_matrix(Y.te, predicted=predictions$class$max.dist[,2])
res.perf <- perf(result.splsda, validation = 'Mfold', folds = 5, nrepeat = 3)
stbl.feats.c1 <- names(which(p$features$stable$comp1>0.8)) # 92 features
stbl.feats.c2 <- names(which(p$features$stable$comp2>0.8)) # 4 features
subset.X.tr <- breast.TCGA$data.train$mirna[, c(stbl.feats.c1, stbl.feats.c2)]
subset.X.te <- breast.TCGA$data.test$mirna[, c(stbl.feats.c1, stbl.feats.c2)]
new.plsda <- splsda(subset.X.tr, Y.tr, ncomp = 2)
new.predictions <- predict(new.plsda, newdata = subset.X.te)
get.confusion_matrix(Y.te, predicted = new.predictions$class$max.dist[,2])