How to determine test.keepX

Hello, how can we determine test.keepX for DIABLO analysis where I plan to integrate metabolome and microbiome data for 100 adults and would like to have robust results at the end. I read in another thread that when you would like to identify a minimal signature for prediction only, a small grid is obviously preferred (up to 10 or maybe 20), but when you want to interpret biology (e.g. by looking into protein-protein interactions), use a larger grid (up to 50, 100 or even 150 in rare cases). Based on this, do I go with the first option below if I want to interpret biology and also because fine grids are suggested to provide very precise results? Also, is this related to in any way sample size or number of feature in the two omics datase, if so, what would you advise if my sample size is 100 and I have 100 metabolites and 300 microbiome features? Thanks!

test.keepX = list (microbiome = seq(1:100),
metabolome = seq(1:100))
OR
test.keepX ← list (microbiome = c(5:9, seq(10, 18, 2), seq(20,30,5)),
metabolome = c(5:9, seq(10, 18, 2), seq(20,30,5)))

Hello again @DJT,

The answer to your question truly relies on whether a model using a larger number of features (eg. 100) performs significantly better than a model with a smaller set of features (eg. 10). If the improvement to predictive performance is negligible, then the ‘simpler’ model is certainly the better option.

when you would like to identify a minimal signature for prediction only, a small grid is obviously preferred

This is data dependent. If all the features in your data have high correlations with one another, then using a grid with a smaller maximum value is the better choice as to avoid including spurious relations in your model.

do I go with the first option below if I want to interpret biology and also because fine grids are suggested to provide very precise results?

This may not be the most useful answer, but it’s hard to say without an idea of what your data looks like. My suggestion would be to generate models using keepX grids with both low and high maximum values. Then, test these models on the same novel samples and evaluate the differences in performance.

Ensure when you tune these models that you use an adequately high number of repeats (ie. nrepeat = 100). As this may take a lot of time, I’d also suggest doing the tuning over multiple steps - increasing the resolution and decreasing the range of the grid at each iteration. Eg. Start with test.keepX = seq(10, 150, 10) and based on the output (lets say it selects 50), then undergo tuning again with test.keepX = seq(30, 70, 5); rinse and repeat.

is this related to in any way sample size or number of feature in the two omics datasets

Not especially, though if your model selects more features than you have samples, thats a decent indicator something may have gone wrong - though this is not a hard and fast rule.

Sorry that the info I’ve given was a bit vague but I hope it’s helped somewhat.

I’ll keep my eyes on your posts in the coming days,
Max.

2 Likes

This was extremely helpful. Thank you so much Max!!