I am working with methylation data from long reads, which includes approximately 12 million CpG sites. According to the mixOmics guidelines, it is recommended to use a maximum of around 10,000 features.
I initially applied filtering by removing the bottom 5% of variance and excluding CpG sites with mean beta values close to zero or one. However, after these steps, I still retained a very large number of CpG sites (around 7 million).
Running mixOmics with such a high number of features is extremely time-consuming and leads to instability in the analysis.
Do you have any recommendations for more effective filtering strategies to reduce the number of features to a manageable size?
Thank you very much for your help.