Hello,
I have been using the mixOmics package for sPLS analysis. The main aim is to select the response variables to include in downstream analyses based on their correlation with explanatory variables. I start with a matrix Y of 127 samples and 67,968 response variables and a matrix X with 26 explanatory variables. I’ve been running some tests using an arbitrary choice for keepY
and ncomp
. This resulted in 3 clusters of response variables related to explanatory variables (Plot 1).
Trying to optimize the keepY
and ncomp
parameters, I used the tune.spls()
function, which returned a choice for keepY
that I used for an sPLS model as input for the cim()
function. The number of variables chosen was higher on the first component and lower on the second as compared to what I was using before running the tune.spls()
function. However, when using the cim()
function, this resulted in a hierarchical clustering that makes no sense with regards to the correlation patterns shown on the heatmap (Plot 2), nor does it makes sense on the variable plot (Plot 3). Would you have any explanation as to how I get such inconsistent clusters of response variables? Is the clustering not based on the Pearson’s correlation value when using the cim()
function?
Regardless of those inconsistent clusters being created, I have some trouble understanding how the tune.spls()
function chose the optimal keepY
parameters (Plot 4). For the second component, the choice seems to match the highest mean correlation value, but the lowest for the first component. Is this a normal behaviour, and if not do you know why this could happen?
Plot 1: Plot of variables with clusters of response variables highlighted as returned by the hierarchical clustering from the cim()
function. The sPLS model was run with ncomp=2
, keepY=c(3000,3000)
.
Plot 2: The output of the cim()
function on the sPLS model using the same matrix, but with keepY=c(4000,500)
. Response variables are represented as columns, and explanatory variables are represented as rows.
Plot 3: Corresponding variable plot with clusters highlighted as suggested by the dendrogram in Plot 2.
Plot 4: tune.spls()
output with measure = “cor”
, mode = “regression”
, and 5 repeats of 4-fold CV.
Many thanks