Hi all!
I know this topic has been discussed previously, but I was just hoping to get some insights into whether my approach using DIABLO is justified WRT my data structure.
I don’t actually have “multi-omic” data per se, but rather mRNA expression data for a custom panel of ~120 genes across 4 tissues for each animal.
I’ve set up my analysis treating each tissue type as a separate data “block” for the N-integration.
I have a total of 96 animals, across 2 timepoints (postnatal day [P] 6 and P24), 2 genotypes (WT and TCR) and 2 sexes (for a total of n = 6 per group)
For the purposes of this analysis I want to examine the multi-tissue expression signature that discriminates genotypes.
Given the large variation in gene expression across development, I’ve currently partitioned my data according to age and conducted separate analyses on each, thus each analysis only contains ~48 samples.
I’ve seen the analysis performed in the literature with a similar sample size (40-60), but I want to make sure what I’m doing makes sense. As per previous related posts, I’ve tried setting up the tuning procedure using LOO CV and Mfold CV and am tuning the number of features between 5 and 90% the number of samples. I’ve also set up a data-driven design matrix based on a pairwise PLS regression, which calculated a correlation between datasets of roughly 0.58.
My results actually look quite good for Mfold CV; I’m able to achieve a low error rate (<0.15) with a reasonable number of components (3-4) and features (5-20) chosen for each block across these components. LOO, on the other hand, produces pretty awful results with higher error rates that don’t get below 0.3 in some blocks.
Using the 3 components from the Mfold and the selected number of features, from tuning, I performed the block.splsda function and seem to have achieved a very good clustering of my samples by group across data blocks.
Given all the information here, is my approach reasonable? Is there some way to perform an omnibus analysis that “controls” for age, and would this be recommended instead? How should I be interpreting my current results and what additional measures can I take to check their validity?
Thanks in advance!!!