Unable to understand selectVar() output in sPLS-DA

Hello,

I am trying to find class-specific features in a gene expression data set using sPLS-DA model. After tuning for the optimal number of components and variables, I built the final sPLS-DA model using the below mentioned piece of code:

splsda.res.final = mixOmics::splsda(X = training_data, Y = response.variable.training, ncomp = 6, keepX = c(743, 124, 372, 268, 619, 248), mode = “regression”, scale = FALSE, tol = 1e-06, max.iter = 100, near.zero.var = FALSE, all.outputs = TRUE);

>splsda.res.final
sPLS-DA (regression mode) with 6 sPLS-DA components. 
You entered data X of dimensions: 380 12368 

 You entered data Y with 5 classes. 

Selection of [743] [124] [372] [248] [619] [248] variables on each of the sPLS-DA components on the X data set. 

No Y variables can be selected.

I thought there are 743 variables in component 1, but when I used selectVar(), I got 995 variables.

length(mixOmics::selectVar(object = splsda.res.final, comp = 1)$name)
[1] 995

Can someone help me to understand this observation?

Hi @srikantverma,

Thanks for using mixOmics and reporting this issue.

I had observed this issue before but unfortunately, I hasn’t been able to reproduce it. It could potentially be caused by too many missing values in many features, in which case you can filter out some of them before the analysis.

Running the following could give you some insight into the proportion of missing values in features.

col_na <- apply(training_data, 2, function(x) 100*sum(is.na(x)))/dim(training_data)[2]
hist(col_na,  main = 'Histogram of NA proportions in features')

In any case, this is a problem we need to fix and/or inform the users. Would you be able to send your data so we can reproduce and fix this please?

You can click on this text to send us an email.
Alternatively, you can right-click on the above text and choose ‘Copy Email Address’

Best,

Al

Thanks a lot @aljabadi for your reply.
Regarding missing values in data, I would like to inform you that there is none in the data that I have used.
I have shared the data along with an R script for reproducing the observation. I am hopeful that the team will be able to resolve the issue soon. However, meanwhile, could you please suggest if I should work with 995 features, or sort them on their loading weights to get top 743 features for downstream analysis?
Regards
Srikant

Hi @srikantverma,

Thanks for the email with the data and fully reproducible and well-described code!

The reason why more variables are apparently selected is that some of the feature loadings are indistinguishable from 0 for some R functions ( < 1e-14). The reason is that the tune function is not quite optimal at recommending the number of features to keep and that is something we are working on at the moment. Basically, what I recommend for you is to use the plot function with the tune object to see what are the optimal number of features, especially on the first/second components where the gain in accuracy can be minimal compared to the added complexity by selecting many more features. It could be less than what is recommended by the algorithm. Generally, the recommended parameters are advisory and limited by other hyper-parameters (folds, repeats, etc) and it is important to use the diagnostic summaries and visualisations provided throughout your analysis.

Hope it helps

Al

Thanks a lot @aljabadi !
Your recommendation will certainly help the entire mixOmics’s user cummunity.