Using DIABLO with unmatched samples in one dataset, ideas?

Hi!

I am trying to integrate microbiome relative abundance data, neural gene expression (transcriptomics) data, and behavioral metrics using DIABLO. This data comes from a toxicity study where I have 6 treatment groups of adult zebrafish exposed to a nuerotoxicant. The microbiome and transcriptomics data are paired (n=8, per treatment), however the behavioral data (n=36, per treatment) are not paired to the other 'omics datasets. All datasets are related by treatment group though, so I think it should be possible to still perform the analysis but in modified way.

I have thought about aggregating all the datasets to the treatment groups and using those as my samples across all three datasets, I think the problem here will be too low number of samples for separating into training and testing data that will be meaningful.

I’ve also thought about just aggregating the behavior data and entering those averages to match with the two paired 'omics datasets. Or perhaps I might subsample the behavioral data to 8 using applications like Fair Subset to get the best representation of behavior samples per treatment. My concern with either of these approaches is if they may violate important statistical assumptions with DIABLO since it is designed with the idea paired samples across all datasets in mind.

So my question is, have others attempted to use DIABLO with uneven sample sizes? And if so, how would one go about doing so?

Thank you for your time,

Lauren

hi @LaurenG,

You will have to break your analysis in different parts.

  1. Consider first a PLS or DIABLO with just microbiome and transcriptomics. That will give you some small insights about the discriminative property of the data before you decide to include the behaviour

  2. As you mentioned, and moving forward you will then add the behaviour data by making some strong assumptions there or subsampling for best representation. In all our PLS methods (incl DIABLO) we do assume we have matched samples, so I think you just need to be very careful in your interpretation when you link behaviour with the other datasets (hence my point 1 above). In the literature we refer to this problem as ‘mosaic’ information, when nor the samples nor the variables match the other datasets, and I dont think anyone has found a solution yet!

Kim-Anh

Hi!

Thank you so much for the response and guidance! I have previously analyzed the the paired transcriptomic and microbiome data using hierarchical all against all analysis (GitHub - biobakery/halla) to determine which transcripts have significant associations with particular microbial taxa, so my plan is to filter my data to those features before incorporating the behavior with DIABLO, which I’m thinking will provide better resolution for biological interpretation. I think I will also try a PLS with just the transcripts and microbiome too though since it will be interesting to see how it compares to the other method. In that method I also pre-filtered the datasets using mulitivariable association with linear models (GitHub - biobakery/Maaslin2: MaAsLin2: Microbiome Multivariate Association with Linear Models) to features that had a correlation across our dosing curve before running running the hierarchical all against all analysis. So, I’m also thinking I should use the same pre-filtered datasets with the PLS for a fair comparison, or perhaps I should just use the original un-filtered data and see what I get? So, there are still few things I still need to consider for my specific analysis.

I was reading a previous post where someone had asked about running PLS methods with small sample sizes and skipping the validation part of the modeling and was wondering if that was still an appropriate way to analyze data with these methods? I think I am more comfortable drawing inferences about our data by aggregating all the variables to each treatment group and having an n=6 if that is the case? Plus, with pre-filtering to transcripts and taxa that already have a statistical established relationship, I’m thinking that should lower the number of variables and therefore hopefully also any “noise” in the analysis?

Again, thank you so much the advice and your time!

Lauren

hi @LaurenG ,

So, I’m also thinking I should use the same pre-filtered datasets with the PLS for a fair comparison, or perhaps I should just use the original un-filtered data and see what I get?

It’s up to you, I often try to be ‘consistent’ and have comparable results for a paper.

I was reading a previous post where someone had asked about running PLS methods with small sample sizes and skipping the validation part of the modeling and was wondering if that was still an appropriate way to analyze data with these methods?

Thanks for looking in our (now extensive) discussion forum past posts! (it saves us some time). The regression methods (PLS, block.pls) remain exploratory as the validation relies on predicting continuous variables and in this multivariate context it is very tricky. So skipping that part is ok (I think most articles using these methods only investigate the graphical outputs and selected variables, Infant airway microbiota and topical immune perturbations in the origins of childhood asthma | Nature Communications).

If you use a classification method (PLS-DA, DIABLO) then you can use leave-one-out cross-validation to assess the performance. That is a much easier metric to use.

Just beware in your analysing of overfitting, i.e when you pre filter already on the dataset on promising variables if you do cross-validation, as the performance results will be optimistic (since the variables have been prefiltered on the whole set to do exactly what you are evaluating them on)

Kim-Anh