Confusion of feature selection with timeomics mulit.block.pls

bzavala · March 1, 2024, 6:29pm

I am trying to implement timeomics feature selection with sparse multi.block.spls but I’m confused on how the feature selection works. I’m aware that it selects the optimal number of features with lasso and the silhouette coefficient but I noticed that when changing the test.list.keepX() values for the grid features, different results are showing. My assumption was that it selected features with the largest loading vector? Also I have questions in regard in some of the results from published examples in your papers. For the seasonality study I noticed that you did something similar with feature selection but the plots do not plot the selected number of features from feature selection. As seen below, the selected number of RNA features was 34 in the negative cluster but that is not what seems to be plotted in the sparse expression plot. Is there a reason for this?

antoine.bodein · March 8, 2024, 2:11am

Hello,

Thanks for using timeOmics!

The feature selection step in timeOmics depends on the methods used in mixOmics (lasso). What changes here is the optimization process to determine the best keepX. In timeOmics, we use the variation of the silhouette coefficient with respect to the grid of values to be tested, rather than the cross-validation in mixOmics.

Cluster assignment is similar to that proposed by the selectVar function, based on the absolute maximum loading value per feature.
With sparse methods, some loadings are set to 0 by lasso. This is performed in mixOmics.
As you have noticed, there may be a difference between the number of features requested and those returned.

This behavior has already been noticed in mixOmics and could be related to this post:

However, concerning the plot, the 34 rna are indeed present but are hidden by others as they feature similar scaled profiles. To be convinced, you may run all the code by yourself and dig into the object returned by getCluster(final.block, user.block = "RNA") to find the 34 RNA.

github.com

abodein/timeOmics_HMP_T2D_seasonality/blob/a09e138a1ca5775a144ac541543a443b20a06bc0/seasonality.Rmd#L391


      
          
          tune.block.res <- tuneCluster.block.spls(X= FINAL.FILTER, indY = 1,
                                                   test.list.keepX=test.list.keepX, 
                                                   scale=FALSE, 
                                                   mode = "canonical", ncomp = 1)
          ```
          ```{r clustering_sparse_final, fig.align="center"}
          tune.block.res$choice.keepX 
          final.block <- block.spls(FINAL.FILTER, indY = 1, ncomp = 1, scale =FALSE, 
                                    keepX = tune.block.res$choice.keepX) 
          plotLong(final.block, legend = TRUE)
          
          getCluster(final.block) %>% group_by(block, cluster) %>% summarise(N = n()) %>%
              spread(block, N) %>%
                  dplyr::select(cluster, RNA, GUT, METAB, CLINICAL)
          
          library("openxlsx")
            cluster_comp <- getCluster(final.block) %>% dplyr::select(molecule, block, cluster, comp, contribution) %>% 
            mutate(cluster = ifelse(cluster == -1, "Cluster 1", "Cluster 2")) %>%
            split(.$cluster)
          write.xlsx(cluster_comp, file = "cluster_composition.xlsx")

Regards,
Antoine

bzavala · March 8, 2024, 5:47am

Ok so for scaling and feature selection, what would be considered the best way to test the number of features since the results may differ based on the gird of features when specifying? I’m also unsure if scaling is considered detrimental to the feature selection and the biological interpretation, since you the selected features will also change?

antoine.bodein · March 8, 2024, 2:35pm

I would say it’s important to check data standardization before integration. As you noticed, this affects the methods. You can do it beforehand, or let the method apply the scale() function block by block.
However, in certain scenarios, I don’t think it’s a bad idea not to apply standardization (e.g. keeping the high expression of certain features of interest).
Knowing that, it’s up to you to make the most of your data.

bzavala · April 3, 2024, 3:33pm

Ok Thanks for the response. But I believe that scale (standardization) will be of best interest since I’m using transcriptomic datasets that consist of normalized read counts, and I have noticed that the variance is driven towards very large expression in PCA of separate datasets. And I’m assuming when doing block.pls with transcriptomics and proteomics which are measured differently would need to be scaled for doing the covariance? Am I thinking correctly when applying scaling with block.pls?

antoine.bodein · April 12, 2024, 1:32pm

As I mentioned earlier, unless you have a good reason not to normalize, I would normalise multi-omics data.
Scaling in block.spls is performed for each block separately.

Topic		Replies	Views
Feature selection vs. using selectVar function with no feature selection Analysis	3	39	December 11, 2024
Spls / keepx / keep specific variables Support	3	366	August 30, 2022
Unable to understand selectVar() output in sPLS-DA Bugs	4	1023	June 9, 2020
Model performance vs. colinearity between features	1	301	September 22, 2020
Best criteria for sPLS-DA feature selection: VIP, weight coeff, stability? Analysis	8	1705	December 9, 2022

Confusion of feature selection with timeomics mulit.block.pls

Related topics