How to deal with varying number of features and high feature correlation in DIABLO?

sergiomosquim · February 19, 2024, 8:57am

Hi,

I am working with three datasets (proteomics, ptm and transcriptomics) of over 100 samples. I am interested in building a binary or multiclass classification model integrating these three data types. After reading some of the suggestions and running a few rounds of tuning in different models, I came across a few different issues, and would appreciate some insight.

When selecting the optimal number of features, I run 5-fold CV with 100 repeats. I start with a grid of say seq(1,301,50) with the goal of roughly estimating where further tunings should be. In the case, for instance, where the suggested keepX is like 1, 50 and 300 for different blocks, how to best approach it? My first thought is to potentially sacrifice model performance and stick with say seq(1,71,10) for a second round and go from there.
My second question relates to the first one. I suspect one of the issues I have is high correlation between features, especially for PTM data. As it would be interesting for me to keep the resolution of the data, is there a best way to circumvent this? On all three datasets, I already perform some filtering prior to integration to keep features with say less than 30% NA values and the top 30% highest variance. So I am working with around 3000 features in each block from initial 20k.

Thank you in advance for the assistance.

sergiomosquim · February 21, 2024, 8:29am

Sorry. For point 1, I meant largely different keepX values for different components within the same block. For instance, I end up with 4 and 100 for components 1 and 2 respectively on the PTM block. Given that the other blocks seem more stable, I wonder if it could be a result of high correlation within the block.

kimanh.lecao · February 29, 2024, 9:41pm

hi @sergiomosquim,

1 - yes, sometimes a small number of selected variables indicate that either the classification is easy with only a top few. Best is to look at how the classification errors look like on the first component (using perf() after you fit the block.splsda with some chosen keepX values) to see if this is the case. Usually a large number of variables means there is a lot of noise and no clear minimum in the error rate. You have to think of how you would ‘like’ the results to look like: small or large signature and choose your grid accordingly (or, not tune at all, but you can still estimate the performance with perf() on your final model). Also, since you mention it in point 2, perhaps the NA values also create some instability. Have a look at the stability from perf() to see what is happening in relation to the number of variables selected.

2 - I think it is fine to keep 3,000 features. Stability will tell you a lot about the correlation between these features and whether the method selects some PTM variables interchangeably.

Kim-Anh

Topic		Replies	Views
DIABLO: Handling high dimensionality and tuning keepX Analysis	10	993	December 11, 2022
DIABLO interpretation in light of low stability of feature selection Analysis	2	295	November 2, 2022
Generic questions about DIABLO: perf, keepX and no variable selection Support	5	1381	December 11, 2022
Using DIABLO Output for ML Training Analysis	1	22	June 13, 2025
keepX and feature selection for circos plot Analysis	2	175	February 23, 2024

How to deal with varying number of features and high feature correlation in DIABLO?

Related topics