DIABLO: Handling high dimensionality and tuning keepX

Hi all,

I’m currently working on multi-comics integration using DIABLO. My datasets include whole genome sequence (pruned, N=~200k), metabolites (N=150), and proteins (N=4000). I’m not quite sure if DIABLO can handle high dimensional genotype data. In my case, even after pruning, we still have ~200k SNPs. Would that be a problem?

My second question is about tuning keepX. Do you typically expect the number of features being proportional to the total number of features in the input dataset? For example, in my case #metabolites < #proteins < #SNPs. However, the model with lowest BER suggests #protein=70 and #SNPs=40. I should have more proteins than SNPs. What do you think? Would it be a problem?

Thanks!

Yuqing

G’day @Yuqing,

In regards to your first question, the DIABLO model will be able to handle that many dimensions. The mixOmics package was designed with massive feature counts in mind. I will say however that the runtime (especially of the perf() function) may be extremely long. I would recommend exploring the use of the BPPARAM parameter to ensure you parallelise this calculation to save some time.

The number of input features will have little to no impact on the number of selected features. The latter of which is dictated purely by the performance of the models. One possible explanation of this would be that a higher proportion of your SNP features are highly correlated with one another when compared with your protein features. Hence, less SNPs need to be selected to achieve optimal performance.

There may be a totally different explanation as to what causes this with your data specifically. The discrepancy between the ratio of input features to selected features should not be a point of concern in of itself.

Cheers,
Max.

1 Like

Thanks Max for your quick response!

A follow-up question for the number of selected features: My outcome only has 2 categories: cases and controls. In this case, I should look at component 1 only right? The perf() shows that the BER (manhalanobis) decreases from 1 component to 3 components. Do you suggest we also looking at the second and third components? Thanks.

Regards,
Yuqing

While it is a general guideline that you will only need K-1 components to describe your model (where K is the number of classes - in your case 2), this is not a hard and fast rule. Again, there are a variety of different explanations as to what might cause this.

Without knowledge of your data, I can’t guarantee that the above output is accurate, though I would trust the perf() function. One thing I will ward against would be a low number of nrepeat used within the perf() function. If you are constructed your finalised model, nrepeat = 100 is the minimum. If the model is still suggesting 3 components with this number of repeats, then use three components.

Thanks for your suggestion. I’ll try to increase nrepeat and see if the final model still suggest 3 components. If so, I will end up having 200~300 proteins selected by 3 components. I tried to tune the keeps parameter using multistages you suggested in another post: I started from {20,30,40,50}, and 50 had the lowest BER, then I tried {50,60,70} and 70 had the lowest BER, so I then tried {70,80,90,100}, and 100 had the lowest BER. Every time the model would gave me the upper limit of the choices. I’m currently at 100 and don’t want to go higher. Any suggestions on it?

Another question is that even if I stay with 100, I will end up with ~300 proteins from 3 components. My aim is to look into the biological mechanism, but I’m afraid 300 would still be too many.

With these sorts of dimension reduction techniques, there is a constant balancing act between model performance and model complexity. Generally, increasing the model complexity (ie. feature count) will increase the model performance. You must draw a line as to how complex you are willing to have your model, and this line depends on the biological question at hand. Unfortunately, in some cases (yours being one of them it seems) constricting the model complexity will directly constrict your model performance.

My suggestion would be: consider an absolute upper limit on the number of features you’re willing to include for each component (eg. 50) and don’t go beyond it. By the sounds of it, your protein data requires many of its features to discriminate its classes - but if you can’t compromise on prioritising interpretability, then enforce this hard limit.

My second suggestion involves quoting myself from this post:

Ensure when you tune these models that you use an adequately high number of repeats (ie. nrepeat = 100 ). As this may take a lot of time, I’d also suggest doing the tuning over multiple steps - increasing the resolution and decreasing the range of the grid at each iteration. Eg. Start with test.keepX = seq(10, 150, 10) and based on the output (lets say it selects 50), then undergo tuning again with test.keepX = seq(30, 70, 5) ; rinse and repeat.

In your application of this, you increased the upper bounds while maintaining the resolution. If your model selects keepX = 50 from {20, 30, 40 ,50}, then in the next run, rather than using {50, 60, 70}, consider trying {40, 42, 44, 46, 48, 50} or something of the like. The aim for each iteration of keepX tuning is to decrease the total range of values while increasing the resolution (ie. decreasing the interval between each value).

As per your last point, I’ll refer to what I said above. Despite the desire to optimise the performance of the model, depending on the context of your study - you must impose limitations on the model. If you feel ~300 proteins is too much, then restrict the total number of features your model can use. Just know that depending on your data, this may come at the cost of your model’s discriminative ability to some degree.

I might also suggest exploring the correlations between your features (you could use plotVar()). Potentially, try removing features with high correlations with another and rerun the DIABLO framework.

Let me know if this clears things up.

Cheers,
Max.

Thanks Max! That makes sense to me. I’ll try and let you know if I have other problem.

Thanks again! It’s a great discussion forum!!!

Hi Max,

I have tuned the top5 components and reran the perf() function to test the performance of my model. The BER increases at component 2 but the decreases. I have binary outcome (cases and controls). In this case, should I just use component 1?

If you are favouring model interpretability, a simpler model (ie. ncomp=1) will be of more use. If this is not so much of a constraint, according to your figure then including more components (ie. ncomp=3 or 4) will improve the classification ability of the model.

Out of curiosity, how many repeats was this perf() run over?

In this case, should I just use component 1?

Without any context, it’s hard to say. Personally, seeing as the improvement in BER is a ~2% for ncomp=4 compared to ncomp=1, and that you are dealing with a binary classifcation problem, one component would be my choice. I would suggest extensively exploring your data and increasing your nrepeat before making up your mind though.

Hi @MaxBladen

I am trying to identify a set of discriminant features between two groups of patients by integrating 5 blocks of different type of data (using DIABLO block.splsda()).

When tunning the model to find how many components to use, I found that component 2 has higher error rates.

However, it is also the component which best discriminates between both groups of samples.

diablotuning1

So, I was wondering if it would make sense to implement the model with 2 components even though the errors could increase.

Note: I am aware that all the errors are high, but my dataset is composed by only 14 samples, so I believe this could be the cause.

Thanks!

However, it is also the component which best discriminates between both groups of samples.

Apart from qualitatively looking at the sample plot, how are you determining this?

So, I was wondering if it would make sense to implement the model with 2 components even though the errors could increase.

It’s hard to say, but with a sample size that small, and a change of 5% in error rate, I think the second component is worth keeping in your model

1 Like