Hello,

I have been using the mixOmics package for sPLS analysis. The main aim is to select the response variables to include in downstream analyses based on their correlation with explanatory variables. I start with a matrix Y of 127 samples and 67,968 response variables and a matrix X with 26 explanatory variables. I’ve been running some tests using an arbitrary choice for `keepY`

and `ncomp`

. This resulted in 3 clusters of response variables related to explanatory variables (**Plot 1**).

Trying to optimize the `keepY`

and `ncomp`

parameters, I used the `tune.spls()`

function, which returned a choice for `keepY`

that I used for an sPLS model as input for the `cim()`

function. The number of variables chosen was higher on the first component and lower on the second as compared to what I was using before running the `tune.spls()`

function. However, when using the `cim()`

function, this resulted in a hierarchical clustering that makes no sense with regards to the correlation patterns shown on the heatmap (**Plot 2**), nor does it makes sense on the variable plot (**Plot 3**). Would you have any explanation as to how I get such inconsistent clusters of response variables? Is the clustering not based on the Pearson’s correlation value when using the `cim()`

function?

Regardless of those inconsistent clusters being created, I have some trouble understanding how the `tune.spls()`

function chose the optimal `keepY`

parameters (**Plot 4**). For the second component, the choice seems to match the highest mean correlation value, but the lowest for the first component. Is this a normal behaviour, and if not do you know why this could happen?

**Plot 1**: Plot of variables with clusters of response variables highlighted as returned by the hierarchical clustering from the `cim()`

function. The sPLS model was run with `ncomp=2`

, `keepY=c(3000,3000)`

.

**Plot 2**: The output of the `cim()`

function on the sPLS model using the same matrix, but with `keepY=c(4000,500)`

. Response variables are represented as columns, and explanatory variables are represented as rows.

**Plot 3**: Corresponding variable plot with clusters highlighted as suggested by the dendrogram in **Plot 2**.

**Plot 4**: `tune.spls()`

output with `measure = “cor”`

, `mode = “regression”`

, and 5 repeats of 4-fold CV.

Many thanks