How to determine test.keepX

Hello, how can we determine test.keepX for DIABLO analysis where I plan to integrate metabolome and microbiome data for 100 adults and would like to have robust results at the end. I read in another thread that when you would like to identify a minimal signature for prediction only, a small grid is obviously preferred (up to 10 or maybe 20), but when you want to interpret biology (e.g. by looking into protein-protein interactions), use a larger grid (up to 50, 100 or even 150 in rare cases). Based on this, do I go with the first option below if I want to interpret biology and also because fine grids are suggested to provide very precise results? Also, is this related to in any way sample size or number of feature in the two omics datase, if so, what would you advise if my sample size is 100 and I have 100 metabolites and 300 microbiome features? Thanks!

test.keepX = list (microbiome = seq(1:100),
metabolome = seq(1:100))
OR
test.keepX ā† list (microbiome = c(5:9, seq(10, 18, 2), seq(20,30,5)),
metabolome = c(5:9, seq(10, 18, 2), seq(20,30,5)))

Hello again @DJT,

The answer to your question truly relies on whether a model using a larger number of features (eg. 100) performs significantly better than a model with a smaller set of features (eg. 10). If the improvement to predictive performance is negligible, then the ā€˜simplerā€™ model is certainly the better option.

when you would like to identify a minimal signature for prediction only, a small grid is obviously preferred

This is data dependent. If all the features in your data have high correlations with one another, then using a grid with a smaller maximum value is the better choice as to avoid including spurious relations in your model.

do I go with the first option below if I want to interpret biology and also because fine grids are suggested to provide very precise results?

This may not be the most useful answer, but itā€™s hard to say without an idea of what your data looks like. My suggestion would be to generate models using keepX grids with both low and high maximum values. Then, test these models on the same novel samples and evaluate the differences in performance.

Ensure when you tune these models that you use an adequately high number of repeats (ie. nrepeat = 100). As this may take a lot of time, Iā€™d also suggest doing the tuning over multiple steps - increasing the resolution and decreasing the range of the grid at each iteration. Eg. Start with test.keepX = seq(10, 150, 10) and based on the output (lets say it selects 50), then undergo tuning again with test.keepX = seq(30, 70, 5); rinse and repeat.

is this related to in any way sample size or number of feature in the two omics datasets

Not especially, though if your model selects more features than you have samples, thats a decent indicator something may have gone wrong - though this is not a hard and fast rule.

Sorry that the info Iā€™ve given was a bit vague but I hope itā€™s helped somewhat.

Iā€™ll keep my eyes on your posts in the coming days,
Max.

1 Like

This was extremely helpful. Thank you so much Max!!

Hi Max,

I wanted to ask a question on your above comment on feature selection by the model. You mentioned

Not especially, though if your model selects more features than you have samples, thats a decent indicator something may have gone wrong - though this is not a hard and fast rule.

However, does this not depend on the nature of the data, e.g. if one has a dataset of 30 RNA-seq samples of three conditions (rather decent size), would you say that selecting 50 genes for integration with another modality is too high?

Hi @DJT,

Based on this, do I go with the first option below if I want to interpret biology and also because fine grids are suggested to provide very precise results?

You can try both if you have enough compute time (potentially, if you donā€™t put enough nrepeat in the cross-validation, results might differ). Answer is yes, although fine grid does not necessary mean you will get more accurate results, more that the search for the optimum is more thorough.

Also, is this related to in any way sample size or number of feature in the two omics datase, if so, what would you advise if my sample size is 100 and I have 100 metabolites and 300 microbiome features?

No, the number of features to select is not related to sample size. We are using a soft thresholding approach, not a proper lasso.

Kim-Anh