Data pre-processing

Thanks for developing this great package!!

I have six omics data (transcriptome, proteome, cytokines, metabolome, gut16s, nares16s) to integrate using DIABLO, each one of them is normalized by appropriate method according to its platform.

My first question is: should I also transform the data after the normalization step e.g for RNA-seq median of ratios normalization + variance stabilizing transformation? I am asking about this because my log2 transformed and normalized proteome data showed a better separation than the normalized data in the plotIndiv, however, I am afraid that the data will then be scaled in mixomics and if all of this will affect the data true values.
Note: some data after the normalization don’t follow normal distribution.

My second question is: if I have unbalanced groups do you recommend to randomly select equal number of each group or just should depend on the BER as my BER is high (~0.35).

Many thanks in advance

Kindly, I am waiting your support please.

Hi @Oweda

Thanks for using mixOmics and getting in touch regarding your questions.

Regarding the preprocessing, all of our methods assume the right preprocessing method has been applied so unfortunately we cannot advise on the specific preprocessing methods. But log-transformation for abundance data are typically a good idea as fold changes are more relevant than just changes themselves.

Regarding the model with the unbalanced groups, BER should conceptually achieve the same thing as equal subsampling in ensuring equal representation of all groups in the model performance evaluation. You can, however, look at different distances (dist) and see which one achieves the best BER.

Hope it helps,


Hello AI,

I am very grateful for your reply.

Regarding the unbalanced groups; 210 control and 41 condition, I randomly selected 41 control samples to be included in the model and here is the performance evaluation for both balanced and unbalanced data respectively:



Clearly the balanced data have a better classification, however I am worried about the random selection itself; as I have low number in the condition group relative to the control group. Do the model preform sample stratification to overcome the unbalanced groups? Do you recommend to use the balanced data or to stick to the unbalanced with mahalanobis distance in my case?

Thanks for your support

Hi @Oweda,

The model does perform stratified subsampling for cross-validation, which means the proportion of case & control would remain the same as the full data (so still unbalanced). Mahalanobis distance seems to be a better choice of distance measure. What happens to the model performance when you include more components (say 10)?



Hi AI,

Thanks again for your support

Here is the model performance for the unbalanced data with 10 components, thus, we got the best BER with the sixth component.

Also for the balanced, data I tried another three different random subsets of control samples and I got a BER ranging from 0.1 to 0.15

Thank you.