Data pre-processing

Oweda · March 11, 2021, 11:09am

Hi,
Thanks for developing this great package!!

I have six omics data (transcriptome, proteome, cytokines, metabolome, gut16s, nares16s) to integrate using DIABLO, each one of them is normalized by appropriate method according to its platform.

My first question is: should I also transform the data after the normalization step e.g for RNA-seq median of ratios normalization + variance stabilizing transformation? I am asking about this because my log2 transformed and normalized proteome data showed a better separation than the normalized data in the plotIndiv, however, I am afraid that the data will then be scaled in mixomics and if all of this will affect the data true values.
Note: some data after the normalization don’t follow normal distribution.

My second question is: if I have unbalanced groups do you recommend to randomly select equal number of each group or just should depend on the BER as my BER is high (~0.35).

Many thanks in advance

Oweda · March 19, 2021, 3:31pm

Kindly, I am waiting your support please.

aljabadi · March 22, 2021, 7:56am

Hi @Oweda

Thanks for using mixOmics and getting in touch regarding your questions.

Regarding the preprocessing, all of our methods assume the right preprocessing method has been applied so unfortunately we cannot advise on the specific preprocessing methods. But log-transformation for abundance data are typically a good idea as fold changes are more relevant than just changes themselves.

Regarding the model with the unbalanced groups, BER should conceptually achieve the same thing as equal subsampling in ensuring equal representation of all groups in the model performance evaluation. You can, however, look at different distances (dist) and see which one achieves the best BER.

Hope it helps,

Al

Oweda · March 24, 2021, 11:33am

Hello AI,

I am very grateful for your reply.

Regarding the unbalanced groups; 210 control and 41 condition, I randomly selected 41 control samples to be included in the model and here is the performance evaluation for both balanced and unbalanced data respectively:

1-

2-

Clearly the balanced data have a better classification, however I am worried about the random selection itself; as I have low number in the condition group relative to the control group. Do the model preform sample stratification to overcome the unbalanced groups? Do you recommend to use the balanced data or to stick to the unbalanced with mahalanobis distance in my case?

Thanks for your support

aljabadi · March 25, 2021, 12:25am

Hi @Oweda,

The model does perform stratified subsampling for cross-validation, which means the proportion of case & control would remain the same as the full data (so still unbalanced). Mahalanobis distance seems to be a better choice of distance measure. What happens to the model performance when you include more components (say 10)?

Best,

Al

Oweda · March 25, 2021, 10:41am

Hi AI,

Thanks again for your support

Here is the model performance for the unbalanced data with 10 components, thus, we got the best BER with the sixth component.

Also for the balanced, data I tried another three different random subsets of control samples and I got a BER ranging from 0.1 to 0.15

Thank you.

Topic		Replies	Views
Need help with pre processing data (normalization) Analysis	2	930	April 5, 2022
What constitutes as an "omic" Analysis	3	71	July 30, 2024
Pre-processing steps for diablo analysis Analysis	1	168	February 1, 2024
Working on TCGA data using mixOmics Analysis	1	367	September 9, 2019
Train and test set division of data	11	803	May 18, 2021

Data pre-processing

Related topics