Design / Weighted Vote Error interpretation

Hello everybody!

I wrote a few months ago about an analysis I am doing with mixOmics (here). Today I come with other doubts about the same analysis, I summarize anyway:

I want to study the interaction between the host (transcriptomics data) and the bacteria (microbiome data). With mixOmics I have integrated both datasets and I have made tests to see which is the best design and distance matrix, staying at the end with the models that obtained the best Weighted Vote error, as suggested in my previous query. But I still have some doubts:

1. I am a bit confused as to the meaning of the values chosen for design. From what I understand, a value of design = 0 indicates that my datasets are not related (I don’t know what the actual relationship is), and a value = 1 indicate that they are very closely related. But I have also read that if design = 1, the model will take into account information from both datasets, if design = 0.5 the model will give more weight to one dataset than the other, and if design = 0, one of the data will be excluded from the model. Can you confirm how much of this is true? I have been reading a lot about this topic and I am still in doubt.

2. Regarding the design chosen, do I finally go with the model that obtains a better Weighted Vote Error? In the case that the results of the model with a higher Weighted Vote Error make more biological sense, would it be justifiable to say that we choose the model with a higher error because the results are more interpretable? (as long as it has a considerable error). I give you an example in case it is simpler: would it be correct to choose model 2 if I get more biologically meaningful interactions?

Model 1:
design = 0.5, distance = mahalanobis
comp1 = 0.39
comp2 = 0.28
comp3 = 0.27
comp4 = 0.26

mean = 0.300

Model 2:
design = 0, distance = mahalanobis
comp1 = 0.39
comp2 = 0.38
comp3 = 0.35
comp4 = 0.36

mean = 0.370

3. We will finally stay with a model with 4 components so as not to lose information. Is it correct to report the model error as an average of the errors of the four components or should it be reported separately?

4. Do you think that the metrics obtained are of a decent model?

Best regards and thanks for your time!
Marta

hi @Margonmon,

1. I am a bit confused as to the meaning of the values chosen for design . From what I understand, a value of design = 0 indicates that my datasets are not related (I don’t know what the actual relationship is), and a value = 1 indicate that they are very closely related. But I have also read that if design = 1, the model will take into account information from both datasets, if design = 0.5 the model will give more weight to one dataset than the other, and if design = 0, one of the data will be excluded from the model. Can you confirm how much of this is true? I have been reading a lot about this topic and I am still in doubt.

1 = you want to maximise the correlation between the 2 datasets (of course this assumes that if there is no existing relationship, then the method will fail at extracting this correlation.
0 = the data are still included in the model, but the aim is to maximise the discrimination with Y instead - > less ‘integration’ is happening
0.5: in between the above.

2. Regarding the design chosen, do I finally go with the model that obtains a better Weighted Vote Error ? In the case that the results of the model with a higher Weighted Vote Error make more biological sense, would it be justifiable to say that we choose the model with a higher error because the results are more interpretable? (as long as it has a considerable error). I give you an example in case it is simpler: would it be correct to choose model 2 if I get more biologically meaningful interactions?

The error rate give you an indication of how the model is generalising to test data sets. It should not be a criterion for you to choose the best model per say (more like an interpretation of what is happening). However, it does give you a clearer understanding of the correlation structure between your two data sets. So, short answer is: it depends on what story you want to tell.

3. We will finally stay with a model with 4 components so as not to lose information. Is it correct to report the model error as an average of the errors of the four components or should it be reported separately?

Your interpretation is incorrect. The error rate is cumulated and the final error rate is the one you obtain on the last component you have chosen. E.g model 1 with 4 components → 0.26

4. Do you think that the metrics obtained are of a decent model?

I can’t really comment on that, it depends on the data and their complexity. It seems ok but some people get an error rate of 0.05 and others of 0.8…

Kim-Anh

1 Like

Thank you for the answers! Now I can understand more my analysis.

Best,
Marta

Hello again, I have worked on this analysis again and in the end we have chosen design = 0 because the results we obtain are more robust.

The question came in relation to this:

I didn’t understand this phrase “aim is to maximise the discrimination with Y instead - > less ‘integration’ is happening”.

Could you explain it to me again in more detail?

Thank you very much for your time,
Marta

hi @Margonmon

You can refer to the DIABLO paper, or the mixOmics book!

Kim-Anh