I am trying to understand the methods behind DIABLO and I am confused between SGCCA and PLS. In DIABLO’s paper, it is explained that DIABLO extends SGCCA, while in youtube tutorials it seems that DIABLO is based on PLS. From the block.splsda() manual specifications, I understand that actually DIABLO is based on both SGCCA and PLS.
I know, from Tenenhause & Tenenhause, 2011 and from Tenenhause et al., 2014, that RGCCA, from which SGCCA derives, employs an approach from PLS which allows it to indicate the degree of conection between the different blocks of data.
I would like to ask the following questions:
Which is the relationship between SGCCA and PLS? from what I read, I understand that both of them are methods to reduce the dimensions of a matrix. So, in which way DIABLO employs SGCCA and in which way does it employ PLS?
In which way does PLS allow to indicate the degree of the connection between blocks of data? I know that the design matrix is employed to achieve this, but, I don’t understand why SGCCA needs to employ PLS to do it.
Could you explain what is actually this connection between blocks of data? If DIABLO is going to maximize the covariance between the latent variants of the different blocks of data; what is the difference between indicating a null design matrix or a full design matrix? I know the first one means that the program is going to focus more on the discriminant variants of each block regarding the groups of samples instead of the connection between blocks of data; but, I don’t understand the methods behind this. From the equation (1) explained in DIABLO’s paper, I understand that the design matrix is multiplied by the covariances between blocks of data for each component; so, if the design matrix is 0, then the result of the equation should also be 0. Could you explain this? I am new to this type of methods and I am lost.
Coming back to my first questions, I understand that this equation comes from SGCCA, so, how can I explain the impact of PLS on the calculation of the principal components? (both on the component scores and on the loading vectors?)
Here I paste a picture of the equation I am referring to:
Thank you very much!
I’ve been doing some digging through the
mixOmics source code to answer your questions, @Jeni. Here’s what I’ve found:
sGCCA and sPLS are complementary methods feature reduction methods. However, sGCCA (note the “G”) is a “generalised” form of sCCA to work in multiblock contexts. rGCCA (note the “r”) is the “regularised” form of GCCA. Within
mixOmics specifically, sGCCA and DIABLO (
block.splsda()) essentially call the exact same code, and these derive from the PLS algorithm. However, rGCCA uses methodology derived from Tenenhaus, A. and Guillemot.
Its easier to think of DIABLO as being able to behave as either a pseudo rGCCA algorithm or a PLS-derived one, not both simultaneously.
The design matrix is employed at each iteration of the PLS algorithm and is involved in the calculation of loadings and variates (see here). PLS is not required for sGCCA, but a portion of the PLS algorithm was repurposed to format the data into a form sGCCA can use effectively (see here).
I feel you have somewhat answered your own question here. If the covariance between a specific pair of blocks is not of importance to consider, then set this value in the design matrix to 0. It therefore has no contribution to the above equation, but not that we are summing over all design values. This is only a problem if all design values are 0, but then there’s no point in using this method if the design matrix is all 0’s.
I’m a bit confused as to what you’re actually asking here sorry. Do you mean from a mathematical, a programmatic or a conceptual level?
Thanks for your answers!
Regarding 4. I was trying to understand that loading vectors and component scores are calculated from the equation shown in the figure. However, this equation comes from SGCCA. So, in which part of the equation is PLS involved?
I think this passage may help. Note the first line of the first paragraph: “[sGCCA], which contrary to what its name suggests, generalises PLS and PLS-DA for multiple, matching data sets”.
That equation which you sent is derived from the PLS algorithm rather than CCA (Canonical Correlation Analysis). The difference between the two is that PLS seeks to maximise covariance whereas CCA seeks to maximise correlation. The equation uses the cov() function, not cor().
As a result of that, you may ask why the CCA acronym is used rather than PLS in the available resources. Within
mixOmics, “sGCCA” uses the PLS algorithm to decompose the input data, but algorithms from the
rGCCA package are utilised to deflate the dataframes. Refer to ** at bottom for explanation of deflation.
I don’t know why it was named as it was, but sometimes in these resources sGCCA is used this way. I definitely see how the language of some of the available resources are potentially confusing.
** This algorithm constructs its components in an iterative manner. That’s why we can have first and second, etc, components, as they’re calculated one after the other. Upon completion of a component, the input dataframe is “deflated”. This is the process of removing the variability defined by the component from the input dataframe. It aids in reducing the correlation between components of differing counts (eg. component 2 and 3 in a block are unlikely to be high correlated. When comp 3 is calculated, the information of component 2 is not found in the dataframe anymore due to deflation).