Sample size (again)

Hi all!

I know this topic has been discussed previously, but I was just hoping to get some insights into whether my approach using DIABLO is justified WRT my data structure.

I don’t actually have “multi-omic” data per se, but rather mRNA expression data for a custom panel of ~120 genes across 4 tissues for each animal.
I’ve set up my analysis treating each tissue type as a separate data “block” for the N-integration.
I have a total of 96 animals, across 2 timepoints (postnatal day [P] 6 and P24), 2 genotypes (WT and TCR) and 2 sexes (for a total of n = 6 per group)

For the purposes of this analysis I want to examine the multi-tissue expression signature that discriminates genotypes.
Given the large variation in gene expression across development, I’ve currently partitioned my data according to age and conducted separate analyses on each, thus each analysis only contains ~48 samples.
I’ve seen the analysis performed in the literature with a similar sample size (40-60), but I want to make sure what I’m doing makes sense. As per previous related posts, I’ve tried setting up the tuning procedure using LOO CV and Mfold CV and am tuning the number of features between 5 and 90% the number of samples. I’ve also set up a data-driven design matrix based on a pairwise PLS regression, which calculated a correlation between datasets of roughly 0.58.

My results actually look quite good for Mfold CV; I’m able to achieve a low error rate (<0.15) with a reasonable number of components (3-4) and features (5-20) chosen for each block across these components. LOO, on the other hand, produces pretty awful results with higher error rates that don’t get below 0.3 in some blocks.

Using the 3 components from the Mfold and the selected number of features, from tuning, I performed the block.splsda function and seem to have achieved a very good clustering of my samples by group across data blocks.

Given all the information here, is my approach reasonable? Is there some way to perform an omnibus analysis that “controls” for age, and would this be recommended instead? How should I be interpreting my current results and what additional measures can I take to check their validity?

Thanks in advance!!!

hi @mirG,

Yes, your approach regarding Mfold CVis appropriate. It does not make sense to use LOO-CV when you have move than 10 samples, and usually LOO-CV is pretty biased. Make sure you use a few repeats for your Mfold CV.

Regarding the timepoints, you can use the withinVariation() function to extract a matrix that would take into account age. However I am currently in discussion with some users where, in the case of two time points, the outputs seems to ‘oppose’ the time points (see previous ‘multilevel’ posts). You can follow those steps and first assess if there is a time effect with the two time points. If not, maybe you dont need to worry about it and you can complement your analyses with some linear mixed models, one variable at a time.


1 Like


thanks so much for your reply!

I should clarify that my samples are not repeated measures, but rather two independent groups, and thus Age is more of an ordered categorical variable. In this case, is a multilevel approach still appropriate?
I can tell there is quite a significant age effect purely based on PCA and the output of an attempted omnibus PLS-DA…in both cases the samples cluster strongly by age even without providing that information to the latter model.
Given the effect of age is so strong (much stronger than genotype, which was expected), is there anything more I can do to validate within-timepoint models (~50 samples each)? Or, would you still recommend the omnibus approach (100 samples) even if the Age-related variation within each genotype is large?

Thanks so much again :blush:


Hi @mirG,

If you do not have repeated measures on the same individuals, then no, it does not make sense to use a multilevel approach.
I’d recommend you do both analyses, one with all samples, showing the strong age effect, and then per age group to dig further in your analysis


1 Like