Selecting method for integrating multiple data

AdrianLG · June 6, 2024, 2:10pm

Hi!
I am struggling a bit with the mixOmics tools. I have a set of samples with three type of data associated: a functional metegenomics profile, a taxonomical metagenomics profile and an environmental measures matrix (quantitative). I am trying to see associations between the three matrices, i.e., if any gene family is correlated with any taxonomical group and/or with environmental measures. I tested multiblock sPLS and sgcca, but I am not sure which model is more appropriate to my data.

kimanh.lecao · June 13, 2024, 10:31pm

hi @AdrianLG,

Looking at the correlation between the variates across the different data sets from block.spls or sgcca put you in the right direction.

Also these 2 methods are slightly different in their application, one has a Y matrix as a response (block.spls) and the other does not. That should also guide your analysis

Screenshot 2024-06-14 at 8.31.05 AM

Kim-Anh

AdrianLG · June 14, 2024, 9:28am

Hi @kimanh.lecao

Thank you for your answer, I suppose my case fits more with sgcca. I have a few more questions regarding the analysis:

Following your tutorials, I can see that the model allows to calculate the correlation between two variables as the cosine of the angle between their component coordinates vectors. However, the correlation I get is different from the one I calculate from the input data. So what is the utility of this correlation? Which is the difference in interpretation compared to simply calculate the pairwise correlations in the raw datasets?
Since there is no method for tuning a multi-block sPLS or sgcca (as far as I know, correct me if I’m wrong) I have been trying combinations arbitrarily. How does impact the model to select higher/lower number of components, or higher/lower values of keepX?
Since I have no observation groups, the plotIndiv seems a little useless in my case. Is there a way to use a continuous variable as group variable to color the observations as a gradient? Or an alternative to plotIndiv? The same occurs with circosPlot.
In your tutorials, you use plotArrow and circosPlot with a subset of the original model. I was capable of doing it only by calculating the model again with less keepX. Is there another way?

Thanks in advance!!

kimanh.lecao · June 27, 2024, 10:28pm

hi @AdrianLG,

Following your tutorials, I can see that the model allows to calculate the correlation between two variables as the cosine of the angle between their component coordinates vectors. However, the correlation I get is different from the one I calculate from the input data. So what is the utility of this correlation? Which is the difference in interpretation compared to simply calculate the pairwise correlations in the raw datasets?

It will be more robust as a simple correlation between variables is likely to give you spurious results. You can read: Visualising associations between paired ‘omics’ data sets | BioData Mining | Full Text (also in our book Chapter 6)

Since there is no method for tuning a multi-block sPLS or sgcca (as far as I know, correct me if I’m wrong) I have been trying combinations arbitrarily. How does impact the model to select higher/lower number of components, or higher/lower values of keepX?

You are correct, we have not implemented tuning functions. Assuming you don’t go too far in your number of components or keepX, the impact is to potentially include some spurious associations, or difficulty in interpretation. As you say below, you approach seems to be relatively exploratory, so try focus your interpretation on the top associations.

Since I have no observation groups, the plotIndiv seems a little useless in my case. Is there a way to use a continuous variable as group variable to color the observations as a gradient? Or an alternative to plotIndiv? The same occurs with circosPlot.

You can use the plotIndiv() with style - ‘graphics’ or ‘lattice’, that will give you a bit more freedom to try color samples according to a gradient (this is not provided in the package). We do not have a circosPlot option for sgcca.

In your tutorials, you use plotArrow and circosPlot with a subset of the original model. I was capable of doing it only by calculating the model again with less keepX. Is there another way?

We use plotArrow to compare the embeddings from the different data sets that are integrated, not to compare different models. One way would be to calculate the correlation between variates between the different models and see whether there is some improvement (but that can be a bit crude).

Kim-Anh

Topic		Replies	Views
CIM for blockpls? Suggestions for improvement	2	1134	September 14, 2020
Obtaining the group of analytes correlating across blocks as seen in `circosPlot` Analysis	7	1500	January 6, 2022
Comparing PCA/mixOmics tools with other methods Analysis	2	463	August 7, 2020
Integration two dataset microbiome - Metabarcode Support	2	537	May 11, 2020
N-integration with 10 datasets Support	3	482	September 25, 2020

Selecting method for integrating multiple data

Related topics