Sample size (again)

mirG · August 11, 2023, 5:05pm

Hi all!

I know this topic has been discussed previously, but I was just hoping to get some insights into whether my approach using DIABLO is justified WRT my data structure.

I don’t actually have “multi-omic” data per se, but rather mRNA expression data for a custom panel of ~120 genes across 4 tissues for each animal.
I’ve set up my analysis treating each tissue type as a separate data “block” for the N-integration.
I have a total of 96 animals, across 2 timepoints (postnatal day [P] 6 and P24), 2 genotypes (WT and TCR) and 2 sexes (for a total of n = 6 per group)

For the purposes of this analysis I want to examine the multi-tissue expression signature that discriminates genotypes.
Given the large variation in gene expression across development, I’ve currently partitioned my data according to age and conducted separate analyses on each, thus each analysis only contains ~48 samples.
I’ve seen the analysis performed in the literature with a similar sample size (40-60), but I want to make sure what I’m doing makes sense. As per previous related posts, I’ve tried setting up the tuning procedure using LOO CV and Mfold CV and am tuning the number of features between 5 and 90% the number of samples. I’ve also set up a data-driven design matrix based on a pairwise PLS regression, which calculated a correlation between datasets of roughly 0.58.

My results actually look quite good for Mfold CV; I’m able to achieve a low error rate (<0.15) with a reasonable number of components (3-4) and features (5-20) chosen for each block across these components. LOO, on the other hand, produces pretty awful results with higher error rates that don’t get below 0.3 in some blocks.

Using the 3 components from the Mfold and the selected number of features, from tuning, I performed the block.splsda function and seem to have achieved a very good clustering of my samples by group across data blocks.

Given all the information here, is my approach reasonable? Is there some way to perform an omnibus analysis that “controls” for age, and would this be recommended instead? How should I be interpreting my current results and what additional measures can I take to check their validity?

Thanks in advance!!!

kimanh.lecao · August 18, 2023, 12:54am

hi @mirG,

Yes, your approach regarding Mfold CVis appropriate. It does not make sense to use LOO-CV when you have move than 10 samples, and usually LOO-CV is pretty biased. Make sure you use a few repeats for your Mfold CV.

Regarding the timepoints, you can use the withinVariation() function to extract a matrix that would take into account age. However I am currently in discussion with some users where, in the case of two time points, the outputs seems to ‘oppose’ the time points (see previous ‘multilevel’ posts). You can follow those steps and first assess if there is a time effect with the two time points. If not, maybe you dont need to worry about it and you can complement your analyses with some linear mixed models, one variable at a time.

Kim-Anh

mirG · August 18, 2023, 2:20pm

Hi,

thanks so much for your reply!

I should clarify that my samples are not repeated measures, but rather two independent groups, and thus Age is more of an ordered categorical variable. In this case, is a multilevel approach still appropriate?
I can tell there is quite a significant age effect purely based on PCA and the output of an attempted omnibus PLS-DA…in both cases the samples cluster strongly by age even without providing that information to the latter model.
Given the effect of age is so strong (much stronger than genotype, which was expected), is there anything more I can do to validate within-timepoint models (~50 samples each)? Or, would you still recommend the omnibus approach (100 samples) even if the Age-related variation within each genotype is large?

Thanks so much again

Miranda

kimanh.lecao · August 31, 2023, 11:28pm

Hi @mirG,

If you do not have repeated measures on the same individuals, then no, it does not make sense to use a multilevel approach.
I’d recommend you do both analyses, one with all samples, showing the strong age effect, and then per age group to dig further in your analysis

Kim-Anh

Topic		Replies	Views
Number of samples, folds, nrepeats, runtime Analysis	1	339	February 23, 2023
Integration with DIABLO for N-ingretaion with low sample size Analysis	7	3156	June 27, 2024
DIABLO for small N Analysis	1	877	April 15, 2020
Using DIABLO with unmatched samples in one dataset, ideas? Analysis	3	52	October 24, 2024
Integration of 2 data sets with DIABLO Analysis	4	1454	April 22, 2020

Sample size (again)

Related topics