Selection of number of CVs from rCCA for downstream investigation

Hi,

Thanks very much for this fantastic resource, which I’ve recently started exploring (apologies for the novice questions!). I am working with a longitudinal dataset consisting of several hundred individuals with two omics (transcriptomics and metabolomics) measured on them at three or more timepoints. I have been exploring use of rCCA for analysis of these data.

I have run rCCA using the shrinkage method (for the purposes of this specific analysis I am ignoring the longitudinal aspect, but am interested in exploring this further in the future), and have found that the resulting canonical correlations are extremely high (the first 300 are above 0.9). They each explain quite a small amount of the variance from the transcriptomic data, but a larger amount of the variance from the metabolomic data. The canonical correlations seem to be reproducibly high - is how high they are necessarily problematic?

Assuming they are robust, are you able to offer any suggestions on how I should go about selecting the number of canonical variates to include in downstream analyses (I am interested in identifying the top features in each ‘omic accounting for the covariance between the two datasets, and to visualise these and the relationships between them in a network plot, potentially using GEPHI).

I’d be very grateful for any advice you could offer. Thanks!

Julia

hi @jsem,

rCCA is not doing very well when the variables are highly correlated, or when the number of variables from both X and Y is much larger than the number of samples. The fact that you have high canonical correlations is an indicator of this problem.

I’d suggest you move directly to a PLS (mode = ‘canonical’) as it handles the high collinearity better, and the visualisations would be similar in interpretation. There is also the sPLS version to select variables (although in your case you will need to decide how many variables to select yourself. Assuming you are still in the exploration stage, I’d suggest selecting about 50 genes and 5-10 metabolites per component. Have a look at the plotIndiv and plotVar outputs to assess if these outputs make sense). You should not need many components. I dont know anything about your data, but probably 3 should be able to extract most of the covariance from the data.

You can look for other posts regarding the time course analysis for later :slight_smile: (there is also a YouTube video on that topic on our website)

Kim-Anh