Integrating multiple layers of (really) highly dimensional data

I need to solve a multiomic challenge. I have access (N=160) to a precious collection of cross-sectional samples from two body sites (stools and blood) for which I want to generate as much information as possible. This implies applying several omic tools to create six data layers from the stool microbiome and blood from study participants, resulting in thousands of variables per sample.

I have underlying causality assumptions. The central one is that microbiome-host interactions drive a unique clinical phenotype. More specifically, some bacterial genes (level 1) are translated into bacterial proteins (level 2) that generate key metabolites (level 3). These proteins and metabolites interact with some host immune cells (level 4), inducing gene expression (level 5) that is translated into human proteins (level 6) and metabolites (level 7), driving the phenotype of interest.

The sample size limits the use of unsupervised approaches. I was thinking of using dimension reduction techniques at each level for feature selection to fit then structural equations based on the path assumptions. I wonder if DIABLO would be a better tool to solve this analytical challenge.

Any thoughts will be appreciated.

I can’t say I’ve never used a framework similar to yours so I’m not sure I’m the best person to ask.

DIABLO might be useful here if you’re wanting to pass all your datasets to the method at once. This is likely to take an extremely long time however.

I might recommend exploring PLS in regression mode, but doing this for each ‘step’. Eg, take all features from level 1 bacterial genes as X and level 2 bacterial proteins and Y. Run a sPLS on this X and Y and extract all the relevant information you want (look at the loadings primarily). Then, use level 2 bacterial proteins as X and level 3 metabolites as Y and repeat. Continue this until you’re through your entire interaction pathway is complete.