With these sorts of dimension reduction techniques, there is a constant balancing act between model performance and model complexity. Generally, increasing the model complexity (ie. feature count) will increase the model performance. You must draw a line as to how complex you are willing to have your model, and this line depends on the biological question at hand. Unfortunately, in some cases (yours being one of them it seems) constricting the model complexity will directly constrict your model performance.
My suggestion would be: consider an absolute upper limit on the number of features you’re willing to include for each component (eg. 50) and don’t go beyond it. By the sounds of it, your protein data requires many of its features to discriminate its classes - but if you can’t compromise on prioritising interpretability, then enforce this hard limit.
My second suggestion involves quoting myself from this post:
Ensure when you tune these models that you use an adequately high number of repeats (ie. nrepeat = 100
). As this may take a lot of time, I’d also suggest doing the tuning over multiple steps - increasing the resolution and decreasing the range of the grid at each iteration. Eg. Start with test.keepX = seq(10, 150, 10)
and based on the output (lets say it selects 50), then undergo tuning again with test.keepX = seq(30, 70, 5)
; rinse and repeat.
In your application of this, you increased the upper bounds while maintaining the resolution. If your model selects keepX = 50
from {20, 30, 40 ,50}
, then in the next run, rather than using {50, 60, 70}
, consider trying {40, 42, 44, 46, 48, 50}
or something of the like. The aim for each iteration of keepX
tuning is to decrease the total range of values while increasing the resolution (ie. decreasing the interval between each value).
As per your last point, I’ll refer to what I said above. Despite the desire to optimise the performance of the model, depending on the context of your study - you must impose limitations on the model. If you feel ~300 proteins is too much, then restrict the total number of features your model can use. Just know that depending on your data, this may come at the cost of your model’s discriminative ability to some degree.
I might also suggest exploring the correlations between your features (you could use plotVar()
). Potentially, try removing features with high correlations with another and rerun the DIABLO framework.
Let me know if this clears things up.
Cheers,
Max.