We recommend filtering the data for two reasons:
- first, if you plan to tune the DIABLO model, then it is going to take a long time (potentially R wont even be able to handle the memory!)
- these methods are aimed to mine the data, and thus extract what is deemed important / relevant. So we would assume that out of the 800,000 probes that you have, not that many are actually useful to explain your biological system!
We recommend to filter based on the variance across all samples
var.probes = apply(X, 2, var) # X is of size number of samples x number of probes
hist(var.probes). # gives you an idea of the variance
If the variance is small across all samples, then those probes are not moving much and can be filtered out. We usually only keep the top 5,000 (max 10,000) features that are highly variant. If you filter based on differential methylation then there is a risk you are overfitting (i.e. include already a bias in the analysis and consequently, the DIABLO model might do very well but this is over optimistic). It all depends on your assumptions here.
Re annotation: it all depends on what you would like to interpret post analysis. We removed the un annotated because we knew we would not be able to annotate those if they end up being selected. But maybe those would be interesting too.
Size limit < 5,000 - 10, 000 features per data set (I would err on the 5,000, at least for a first pass!)