Hello everybody,
I am writing my first post because I have just started to use the rCCA method to integrate two datasets, i.e. DEGs and MixMC-filtered 16S data. Previous to rCCA, I made several trials with sPLS, but I could not get any valid model out of my data.
Now, rCCA seems to be working, but I have some doubts about the reliability of what I get. When you use other methods provided by Mixomics, like sPLS, you use a threshold to decide whether your Q2 values can be accepted or not.
But with rCCA, I do not see an equivalent metric that could be used as a guidance. Of course, one could use scree plots and see if the R2 values obtained are high ‘enough’ to be considered significant. In my case, the first three components are roughly in the 0.6 - 0.8 range.
I also have another question. I get ‘good’ results, i.e. the ones I have just mentioned, when I use the ‘shrinking’ method, but when I use the ‘CV’ one the R2 values decrease dramatically. Does this mean I have to go for the ‘shrinkage’ method, or is there some bias to consider?
Thanks in advance for your help. Best regards,
Marco
hi @Moroldo,
You are right, rCCA is a very exploratory approach, so there are no numerical criteria really to evaluate how well the method is performing objectively.
What we use as guides:
- the canonical correlations (if they are too close to 1 and do not decrease fast enough, is it an indication that the matrices are ill-conditioned - or the regularisation parameters are not well estimated. Perhaps what is happening with the shrinkage method.
- the sample plot, look at them separately in the X and Y space (or use
plotArrow()
) to make sure there is some agreement extracted between the two data sets
- the correlation circle plots,
plotVar()
will show whether you are unravelling potential associations between the two data types (the closer to the large circle the better).
I have also noticed that the shrinkage method tends to give higher canonical correlations, but what I listed above could help you to also decide.
If you send me your email via private message through discourse, I’ll send you some documentation we are currently working on.
Kim-Anh
Dear Kim-Anh,
Thanks for your prompt answer. In fact, the canonical correlations seem to decrease pretty fast in my case – which should be good according to what you say.
I need to take a look to the other plots you mention, because up to date I have only produced the scree plot and the CIM.
You can contact me at marco.moroldo@inrae.fr.
Best regards,
Marco
[Appending the rest of the conversation here]
Dear Kim-Anh,
I hope you are doing well. After your last emails, I have decided to compare in a more thorough way the ‘shrinkage’ and the ‘cv’ methods on my dataset.
I have included the main plots in a .ppt file that I have attached. The ‘X’ dataset corresponds to roughly 800 genes, and the ‘Y’ dataset to roughly 80 MixMC-filtered genera.
As you can see, there are many differences, but at this stage I am trying to focus on the CIM figures (not included in the .ppt file), because the highest correlated genes and genera should be the same.
Do you have any suggestion or remark about these results? In my opinion, the ‘shrinkage’ method performs better, for instance in terms of the scree plot. The X-Y space seems also somehow more ‘balanced’, while in the case of the ‘cv’ method one would say that two samples are outliers, which is not the case.
Thanks in advance for your help and for your time. Best regards,
Marco
Dear Marco,
Your results are similar to what I highlighted in our chapter, the shrinkage seems to maximise the correlation better, but the correlations do not decrease well. In the CV, you seem to highlight some sample outliers, do they make sense? Do you have any phenotypic information about those samples? You could also put a more stringent cutoff for the CV.
At this stage, you will have to rely on the biology. My feeling is that the shrinkage method extracts more correlation (but with potential overlap across dimensions) but less correlation structure between variables, whereas the CV is the inverse, at the expense of highlighting outliers (have a look at the expression levels of the highly associated features on both axes to work out what those outliers are about).
Kim-Anh
Dear Kim-Anh,
As you suggest, I will try to rely on biology to make a choice between the two methods. So far, I still haven’t found an explanation for the outliers – which may indeed suggest that the ‘cv’ method tends to produce ‘false’ outliers in my case.
Have a nice day and thanks for your help and your time,
Marco