Using DIABLO Output for ML Training

Hi everyone,

I’m working on a multi-omics data integration project using transcriptomics, proteomics, and methylation data from cancer samples. My main goal is to extract informative features for downstream model training. I plan to build and compare several predictive models using different algorithms.

I’m trying to decide on the best approach for selecting features after running DIABLO (sparse multiblock PLS-DA):

  • Should I use the components obtained during DIABLO training as input features for my downstream models?
  • Or would it be better to extract the omics-specific features (e.g., genes, proteins, CpGs) selected for each component, combine them into a single feature set, and use those as the input for model training?

I’d really appreciate any advice, best practices, or experiences you can share regarding this decision. Thank you!

Hi @Mehrdadameri,

In terms of tuning DIABLO models, we recommend tuning both 1) the number of components and 2) the number of features for each component for each omic block. You can tune both of these using the tune() function in mixOmics, see more information here. The most efficient way is to first tune the number of components and then the number of variables, see our DIABLO case study for more details.

Cheers,
Eva