Issues with hierarchical clustering after running tune.spls()

Hello,
I have been using the mixOmics package for sPLS analysis. The main aim is to select the response variables to include in downstream analyses based on their correlation with explanatory variables. I start with a matrix Y of 127 samples and 67,968 response variables and a matrix X with 26 explanatory variables. I’ve been running some tests using an arbitrary choice for keepY and ncomp. This resulted in 3 clusters of response variables related to explanatory variables (Plot 1).

Trying to optimize the keepY and ncomp parameters, I used the tune.spls() function, which returned a choice for keepY that I used for an sPLS model as input for the cim() function. The number of variables chosen was higher on the first component and lower on the second as compared to what I was using before running the tune.spls() function. However, when using the cim() function, this resulted in a hierarchical clustering that makes no sense with regards to the correlation patterns shown on the heatmap (Plot 2), nor does it makes sense on the variable plot (Plot 3). Would you have any explanation as to how I get such inconsistent clusters of response variables? Is the clustering not based on the Pearson’s correlation value when using the cim() function?

Regardless of those inconsistent clusters being created, I have some trouble understanding how the tune.spls() function chose the optimal keepY parameters (Plot 4). For the second component, the choice seems to match the highest mean correlation value, but the lowest for the first component. Is this a normal behaviour, and if not do you know why this could happen?

Plot 1: Plot of variables with clusters of response variables highlighted as returned by the hierarchical clustering from the cim() function. The sPLS model was run with ncomp=2, keepY=c(3000,3000).
Plot 2: The output of the cim() function on the sPLS model using the same matrix, but with keepY=c(4000,500). Response variables are represented as columns, and explanatory variables are represented as rows.
Plot 3: Corresponding variable plot with clusters highlighted as suggested by the dendrogram in Plot 2.
Plot 4: tune.spls() output with measure = “cor”, mode = “regression”, and 5 repeats of 4-fold CV.

Many thanks

Hi @lmdelpech ,

Thanks for the comprehensive post. Working my way through your post:

this resulted in a hierarchical clustering that makes no sense with regards to the correlation patterns shown on the heatmap (Plot 2), nor does it makes sense on the variable plot (Plot 3)

Based on Plot 2, I would disagree with this statement. Under the assumption of there being 4 clusters, I can see four distinct “blocks” of genetic expression within the clustered image map. I’ve drawn some crude verical lines (lime green) to show roughly where I can see these blocks that correspond to the four clusters seen if you cut the tree at the horizontal green line.

asdf

When I look at Plot 2 and Plot 3, I don’t see anything particularly wrong. The size of the clusters of the dendrogram seem to align with those seen in Plot 3. Each of these clusters has fairly homogenous within-cluster expression. They are well separated on the first two principal components. However, I may be missing something. Would you mind please clarifying what you mean by it making “no sense”?

Would you have any explanation as to how I get such inconsistent clusters of response variables?

When you say inconsistent, do you mean an inconsistency between Plots 2&3 compared to Plot 1? The difference between these is caused by the varying keepY values over your two components. If you mean comparing Plot2 and Plot 3, I see little inconsistency.

Is the clustering not based on the Pearson’s correlation value when using the cim() function?

If we look at the documentation (accessed via ?cim), the dist.method parameter is described as follows:

character vector of length two. The distance measure used in clustering rows and columns. Possible values are "correlation" for Pearson correlation and all the distances supported by dist, such as "euclidean", etc.

It’s default value is euclidean, so unless you specifically state the distance method you want - seemingly dist.method = c("correlation", "correlation") - it will not be based on Pearson’s correlation. Did you explicitly set this within your cim() call?

but the lowest for the first component. Is this a normal behaviour, and if not do you know why this could happen?

Now this is definitely something which raises an eyebrow. We did have an issue with the tune.spls() function (read an in-depth explanation here) which resulted in the keepY with the minimal correlation being selected, rather than the maximal correlation. What I’d suggest is running the following command and rerunning the tune.spls() call:

install_github("mixOmicsTeam/mixOmics", ref = github_pull("181"))

This will update the tune.spls() function to operate properly when utilising the correlation as the assessment criterion. Please let me know if this resolves your issue, or raises any other questions.

Looking forward to your response

Hi @MaxBladen,

Thank you very much for your reply.

Did you explicitly set this within your cim() call?

dist.method was indeed set to c(“correlation”,”correlation”) in the call to cim(), sorry I did not specify this.

When you say inconsistent, do you mean an inconsistency between Plots 2&3 compared to Plot 1? The difference between these is caused by the varying keepY values over your two components. If you mean comparing Plot2 and Plot 3, I see little inconsistency.

To clarify what I see as inconsistent: I do agree that Plot 2 and 3 are consistent, they should actually show the exact same clusters. What I think makes ‘no sense’ here is rather the clustering that is made on the heatmap from Plot 2 (the colors of dots on Plot 3 are chosen based on the 4 clusters you showed with the green lines on the CIM plot). I do see 4 clear ‘blocks’ on this heatmap, but if we attribute numbers to these four clusters from 1 to 4 going from left to right in the tree, it makes no sense to me that cluster 3 can possibly cluster closer to cluster 4 than to cluster 1 with regards to the Pearsons’ correlations. Cluster 3 shows a pattern of correlations that seem much closer to cluster 1 than to cluster 4, at least to my eyes… Moreover, it seems very clear to me on Plot 2 that there are 3 clusters of points, and I don’t understand how 4 branches are emerging this high in the tree. To me, everything from the colors of the heatmaps to the distribution of points on the correlation circle points towards the creation of 3 clusters (Cluster 1+3 / Cluster 2 / Cluster 4) (see Plot 5 below).

Plot 5 : Correspondance between CIM (on components 1 and 2 of spls) clusters and dispersion in correlation plots. cim() was run with dist.method = c(“correlation”,“correlation”). Upper panels: spls run with keepY = c(3000,3000). Lower panels: spls was run with keepY = c(4000,500). When cutting the tree along the dashed line, the numbered clusters correspond to the highlighted clusters on the left plots (correlation plots).

Now this is definitely something which raises an eyebrow. We did have an issue with the tune.spls() function (read an in-depth explanation here 1) which resulted in the keepY with the minimal correlation being selected, rather than the maximal correlation. What I’d suggest is running the following command and rerunning the tune.spls() call:
install_github("mixOmicsTeam/mixOmics", ref = github_pull("181"))

Regarding the tune.spls() keepY choice, I encountered an issue when trying the suggested command:
install_github("mixOmicsTeam/mixOmics", ref = github_pull("181"))
Downloading GitHub repo mixOmicsTeam/mixOmics@issue-164
Error in utils::download.file(url, path, method = method, quiet = quiet, : cannot open URL 'https://api.github.com/repos/mixOmicsTeam/mixOmics/tarball/issue-164'

So I instead updated to the development version of the package, since I guess the merge would be included?
BiocManager::install("mixOmicsTeam/mixOmics@devel")

I will keep you posted when the function tune.spls() has run again. I am currently also encountering some parallelization issues for this function when using BiocParallel::SnowParam() or BiocParallel::MulticoreParam(), so it takes several days to run, but I have yet to test whether this is a machine-related issue…

Thanks again for the help

it makes no sense to me that cluster 3 can possibly cluster closer to cluster 4 than to cluster 1 with regards to the Pearsons’ correlations

I see what you mean. I’ve racked my brain for what could be causing this, unfortunately to no avail. Without any context or an idea of your data, it’s hard to provide reasoning as to why this is occurring. I will say - in my (albeit limited) experience - when using hierarchical clustering on thousands of features, it becomes a bit of a “black box”, such that the decisions the model has made can sometimes not make any intuitive sense to those of us looking at it. While it is a very effective clustering method, at this scale, the interpretability can suffer. Please keep me updated on if you have a bit of a breakthrough with this, I’m very curious.

I now realise that the pull request (and associated branch) I referred to with that command no longer exists - so you did the correct thing downloading the devel build. Let me know how it fares.

Sorry that I couldn’t be more helpful with this post. If I think of any potential causes, I’ll be sure to let you know.

Cheers,
Max,