Pre-filtering and binary data

Hi, thanks for this cool tools and this open community. We are running a project where the first strategy is to integrate multi-omics data. As we have worked with Diablo, some doubts have arisen. I hope you can help us to better understand and perform this magnificent method.

  1. From what I understand, Diablo is capable of handling large high dimensional omic data and during the computation of Diablo there is a penalty step. In some comments o this forum I read that too many predictors from the start, even with an internal prefiltering step, could breaks down. But in general, do you recommend performing feature selection before analyzing the data with Diablo by, for example, pre-filtering the data with a lasso regression or selecting the most variable genes calculated by the mean absolute deviation, or would it be better to analyze all data (i.e. expression matrix where low counts genes are removed) without any additional filter, although some problems like runtime might appear?

  2. We are interested in integrating mutations and copy number alteration along with RNA expression. But since this data type is binomial, it is difficult to handle with traditional statistical approaches that assume normality like PLS. As a contingency strategy, we pre-filtered the mutation and SCNA data with the nearZeroVar function. Do you have any other suggestions for managing this type of data?

We will appreciate your comments.

Greetings

In regards to your first question, DIABLO is a sparse method. This means that you can tell it directly to use a certain amount of the best features for a given component on a given block of data - the keepX parameter. Hence, you can tune this value via perf() and tune.block.splsda(). Therefore you won’t need any pre-filtering but you are reducing your model complexity. Please read here.

nearZeroVar() is definitely a good place to start in that context. While not microbial data, the mixMC methodology may be applicable given your data. It accounts for sparse dataframes (those with many 0s). I’ll have a bit more of a think of what to do with binary input data.

Hope this helped!