Filtering large data to use with DIABLO

tjwhyte · April 24, 2020, 2:01am

Hi,

I want to use DIABLO with my datasets, a RNAseq dataset, genotyping and DNA methylation array.

I used the Illumina MethylationEPIC array so I have ~800,000 probes that passed QC. As my cohort is n=100 (50 cases/50 controls) this seems quite large for any analysis.

Should I run DIABLO with all the probes, or can I filter them based on those that are differentially methylated?

Your mRNA preprocessing in the draft manuscript mentions removing unannotated transcripts, should this also be done for unannotated CpG probes? Where no gene is annotated?

Is there a recommended limit to the size of input data?

Thanks!

kimanh.lecao · April 25, 2020, 1:07am

hi @tjwhyte

We recommend filtering the data for two reasons:

first, if you plan to tune the DIABLO model, then it is going to take a long time (potentially R wont even be able to handle the memory!)
these methods are aimed to mine the data, and thus extract what is deemed important / relevant. So we would assume that out of the 800,000 probes that you have, not that many are actually useful to explain your biological system!

We recommend to filter based on the variance across all samples

var.probes = apply(X, 2, var) # X is of size number of samples x number of probes
hist(var.probes). # gives you an idea of the variance

If the variance is small across all samples, then those probes are not moving much and can be filtered out. We usually only keep the top 5,000 (max 10,000) features that are highly variant. If you filter based on differential methylation then there is a risk you are overfitting (i.e. include already a bias in the analysis and consequently, the DIABLO model might do very well but this is over optimistic). It all depends on your assumptions here.

Re annotation: it all depends on what you would like to interpret post analysis. We removed the un annotated because we knew we would not be able to annotate those if they end up being selected. But maybe those would be interesting too.

Size limit < 5,000 - 10, 000 features per data set (I would err on the 5,000, at least for a first pass!)

Kim-Anh

Topic		Replies	Views
Train and test set division of data	11	806	May 18, 2021
DIABLO: Handling high dimensionality and tuning keepX Analysis	10	999	December 11, 2022
Pre-filtering and binary data Analysis	1	381	June 9, 2022
DIABLO for small N Analysis	1	879	April 15, 2020
DIABLO data transformation and tuning Analysis	1	514	February 28, 2022

Filtering large data to use with DIABLO

Related topics