My background is biological, so even after reading some articles and package documentation, some concepts are tough to understand. So I would like to ask some questions that started with the wrapper.sgcca() function but escalated for other doubts (that I think are somehow related):
(i) Regarding the wrapper.sgcca(), if I apply no penalty (no variable selection) and input X with only two blocks, will the function behave as a “regular” CCA (cancor)?
(ii) I read most (or all?) mixOmics algorithms are based on PLS. Then wrapper.sgcca is more similar to CCA (cancor) or PLS (canonical mode)? I mean, is wrapper.sgcca maximizing correlation (CCA) or covariance (PLS)?
(iii) When we use PLS in the canonical mode, is it maximizing correlation instead of covariance?
(iv) Quoting one of your articles: “The values in the similarity matrix can be seen as a robust approximation of the Pearson correlation”. Then, the correlation values that I see after I run the network() function can be interpreted as the R value from Pearson? Can I “R square” then?
I am sorry for this many questions, but I am using mixOmics package in one of my articles and wanna be able to interpret the results properly and minimally explain why I chose a particular algorithm instead of the classical x or y.
No it won’t. The two use different algorithmic implementations with differing parameters
wrapper.sgcca() function uses a modified algorithm of sGCCA from the
rGCCA package (a paper can be found here). This ultimately could be thought of a more akin to CCA compared to PLS as it does seek to maximise correlations between pairs of components.
Not quite. The different PLS modes relate to the way in which the Y matrix is deflated. In “regression” mode, Y is deflated based on information from X (as X is being used as a predictor). In “canonical”, Y is deflated based on information from Y - such that X and Y are treated symmetrically. It uses this term as the CCA approach considers the two dataframes symmetrically, rather than with one as a predictor. Even in “canonical” model, PLS is still maximising covariance.
I’m not sure what you mean by “Can I R square them” sorry. R-squared is a statistic used to represent the degree to which the variation in Y is explained by X.
Hi @MaxBladen ,
Thank you very much for your answers and your patience.
I read (or tried to) the paper you cited. Unfortunately, there are too many equations in it and I was not able to fully understand them. I started using wrapper.sgcca() because I have more than two datasets and I wanted to do variable selection. After exploring the data, I saw that would be better to perform the variable selection before integration, and when I put all the datasets together only a few samples remained (I had not the same samples over all datasets). So I decided to do the correlations among pairs of experiments (to have a higher number of samples in common). But in the end, I was wondering if wrapper.sgcca() is the better algorithm for this. Because I am not using the two main advantages of it, the “s” (I am not doing variable selection) and the “g” (I am correlating only two blocks at time). Do you think another algorithm will be more appropriate or even using the function this way the correlations are still reliable? And the doubt about R or R square is how exactly I can interpret the correlation values given by the relevance networks. Can I say a lower value is weak and a higher value is a strong correlation? Because I want to compare the pair-wise associations among all my relevance networks (because now I am running the function with only two blocks using all combinations possible among my five different experiments). Then, hypothetically, if I have a cutoff of 0.75 in one network and the same cutoff in another, can I say that the associations depicted are equally strong (that’s why I asked about R-squared), either positive or negative? By the way, do the correlation values range from -1 to 1 (my results only had values in this range)?
I hope I could make myself clear and I am sorry for any nonsense questions.
Do you think another algorithm will be more appropriate or even using the function this way the correlations are still reliable?
Seeing as you’re just using two dataframes, I’d consider basic PLS (via the
pls() function) or maybe rCCA (via
Can I say a lower value is weak and a higher value is a strong correlation?
I would avoid thinking about R-squared values as “correlations”. Again, this statistic represents the amount of variation in a feature, Y, explained by a set of predictor features, X. If your R-squared is 0.6, it means 60% of the variation in Y is explained by X. The negative sign denotes a inverse relationship between X and Y.
Because I want to compare the pair-wise associations among all my relevance network
network() produces and uses correlation matrices, R-squared is not involved in this process.
if I have a cutoff of 0.75 in one network and the same cutoff in another
Yes, these could be classed as comparable.
do the correlation values range from -1 to 1