Independent studies with multilevel structure

Hello,

I am relatively new to MixOmics, and I find it to be an exciting and powerful tool for analyzing microbiome data.
for independent multiple studies with microbiome datasets, which also have multilevel structures. I would greatly appreciate your guidance on a few specific questions regarding how to best approach using MixOmics:

  1. sPLS-DA with Multilevel Structures
    I understand that sPLS-DA multilevel can account for within-individual variation. However:
    I am considering removing the study-specific batch effects first using sPLS-DA and then applying sPLS-DA within variation to account for the multilevel structure.
    Can it also appropriately handle multiple studies and account for study effects as confounders?

  2. Data Preparation for Multiple Studies
    When working with datasets from multiple studies:
    Should I merge the data into a single matrix along with the corresponding metadata for analysis?
    Or is it better to analyze each study separately and combine results later?

  3. MINT-sPLS-DA and Multilevel Structures
    I came across the MINT framework in MixOmics, which is designed to integrate multiple independent studies while accounting for study effects. However:
    I have not seen examples in the MixOmics tutorials demonstrating how MINT can handle multilevel structures.
    Does MINT-sPLS-DA support multilevel designs? If so, could you provide guidance or examples for such an analysis?

  4. Normalization with Centered Log-Ratio (CLR)
    For data normalization:
    Should I apply the centered log-ratio (CLR) transformation separately for each study before integration?
    Or should I integrate all studies first and then apply the CLR transformation to the combined dataset?

Thank you

hi @HAly,

There are different options you can try:

A. MINT has been shown to be quite suitable to remove unwanted variation directly inside the method, see this study here (you can find a preprint on research square). There are other studies that have used MINT for similar purposes, such as this one from Poirier et al.
B. In order to account for the multilevel aspect, you can use the external function WithinVariation() first on each study separately, extract the result and input into MINT (see ?WithinVariation)
C. In terms of normalisation, yes I would recommend you use CLR if you have microbiome data, this is applied sample-wise after filtering the low count taxa (so this is done study by study - I think we have tried other options, I will ask Saritha to share her experience). The trick for you is that you need the same subset of taxa across all studies before input into MINT.
D. Finally you could try PLS-DA batch if you are interested in correcting for batch effect, but I am not sure it would be efficient (you would still have to go through points C and B).

Kim-Anh

Hi @HAly

When working with datasets from different studies, preprocessing each study separately (like filtering and CLR transformation) tends to work better because it keeps the study-specific differences intact. After that, you can combine the datasets by taking the union of taxa. The downside is that this often creates NA values for taxa that aren’t present in all studies, so you’ll need to either fill in those NAs or use methods that can handle missing data. (This was our preferred approach when we combined five type 2 diabetes datasets, which initially contained 369, 601, 493, 613, and 572 taxa, respectively. After filtering and taking the union of the taxa, we were left with 361 features.)

Another option is to combine the datasets by using only the taxa that overlap across studies (the intersection). This avoids missing values but usually leaves you with fewer taxa, which might limit what you can do in later analyses.

If you merge the datasets first and then preprocess, you’re likely to keep only the taxa that are abundant in most studies. This can also reduce the number of taxa and might miss out on study-specific details.

Saritha

Thank you for all the valuable information! I really appreciate it.
Should I use supervised or unsupervised MINT sPLS-DA in this case?

Hi @HAly,

This will depend on whether you have an outcome of interest which you would like to include in your model. If you are running MINT sPLS-DA this is a supervised model with a categorical outcome, for more information on which model to use for you analysis check out our Select your Method guide.

Cheers,
Eva