How to deal with varying number of features and high feature correlation in DIABLO?


I am working with three datasets (proteomics, ptm and transcriptomics) of over 100 samples. I am interested in building a binary or multiclass classification model integrating these three data types. After reading some of the suggestions and running a few rounds of tuning in different models, I came across a few different issues, and would appreciate some insight.

  1. When selecting the optimal number of features, I run 5-fold CV with 100 repeats. I start with a grid of say seq(1,301,50) with the goal of roughly estimating where further tunings should be. In the case, for instance, where the suggested keepX is like 1, 50 and 300 for different blocks, how to best approach it? My first thought is to potentially sacrifice model performance and stick with say seq(1,71,10) for a second round and go from there.

  2. My second question relates to the first one. I suspect one of the issues I have is high correlation between features, especially for PTM data. As it would be interesting for me to keep the resolution of the data, is there a best way to circumvent this? On all three datasets, I already perform some filtering prior to integration to keep features with say less than 30% NA values and the top 30% highest variance. So I am working with around 3000 features in each block from initial 20k.

Thank you in advance for the assistance.

Sorry. For point 1, I meant largely different keepX values for different components within the same block. For instance, I end up with 4 and 100 for components 1 and 2 respectively on the PTM block. Given that the other blocks seem more stable, I wonder if it could be a result of high correlation within the block.

hi @sergiomosquim,

1 - yes, sometimes a small number of selected variables indicate that either the classification is easy with only a top few. Best is to look at how the classification errors look like on the first component (using perf() after you fit the block.splsda with some chosen keepX values) to see if this is the case. Usually a large number of variables means there is a lot of noise and no clear minimum in the error rate. You have to think of how you would ‘like’ the results to look like: small or large signature and choose your grid accordingly (or, not tune at all, but you can still estimate the performance with perf() on your final model). Also, since you mention it in point 2, perhaps the NA values also create some instability. Have a look at the stability from perf() to see what is happening in relation to the number of variables selected.

2 - I think it is fine to keep 3,000 features. Stability will tell you a lot about the correlation between these features and whether the method selects some PTM variables interchangeably.