Need help in reviewing data analysis

Hi mixOmics team,

I have run a multivariate analysis to compare metabolomics between authentic and adulterated rice samples using mixOmics package version 6.15.0 in R program version 4.0.4. The analysis was completed without any issues, but I am not really sure about the analysis pipeline and there is no one in my team to help me check the results. Does mixOmics team provide help in such situations?

Best regards,
Hoa

Hi @hoanq8x,
Please feel free to ask all the mixOmics-related questions you have. If you want feedback on your pipeline, you can just post your script, outputs, etc. here, and we will do our best to help you.

  • Christopher

Dear Christopher,

Thank you for your kind help. Here is my script and outputs:

[Script]
(7.9 KB file on MEGA)

[Outputs]
(536.9 KB file on MEGA)

I tried to upload the files directed to this post but unsuccessfully. Could you help me check my pipeline and relevant results? Thank you again!

Hi @hoanq8x,

Everything seems to be done correctly. I have some minor comments:

  1. You can set a cutoff in the biplot or do a sparse PCA model to avoid so many overlapping arrows.
  2. The final sPLS-DA model has 3 components, but you don’t show whats happening on component 3.
  3. What does most “significant” metabolites mean? Did you filter away metabolites based on what is significant in the volcano plot, or did you set a threshold for vip score? You can also use the cim function to create a heatmap with clustering for the selected variables on comp 1, comp 2 and comp 3 each, and a heatmap for all the selected variables combined.
  • Christopher

Hi Christopher,

Thanks for your quick response!

  1. You can set a cutoff in the biplot or do a sparse PCA model to avoid so many overlapping arrows.

=> Good suggestion!
2. The final sPLS-DA model has 3 components, but you don’t show whats happening on component 3.
=> Yeap, that’s true. The first 2 components show quite clear separation among groups, so that I did not plot component 3 against the others. I should have such plots anyway.

  1. What does most “significant” metabolites mean? Did you filter away metabolites based on what is significant in the volcano plot, or did you set a threshold for vip score? You can also use the cim function to create a heatmap with clustering for the selected variables on comp 1, comp 2 and comp 3 each, and a heatmap for all the selected variables combined.
    => I will try the heat map again using cim. I have used it for my RNAseq data. The most “significant” metabolites refer to those which could actually differentiate among the groups, and also the targeted metabolites we want to filter. We set the VIP score greater than 1 and AUC greater than 0.7. After filtering, we did some further logistic regression analysis to confirm their effects on group differentiation again. What do you think about this?

Best,
Hoa

Hi @hoanq8x,

I am not sure about this one. @aljabadi maybe you can answer this? :smiley:

  • Christopher

Hi @hoanq8x,

Christopher provided some great feedback on your pipeline so I’ll just encourage you to check out our updated vignette which expands on all the available methods and their functionalities mixOmics vignette.

On a side note, I noticed that the first 2 components are unable to show separation between two of the groups. I recommend you do look into the 3rd component either by another 2D plot or using plotIndiv(style='3d').

I think “signature metabolites” or “selected metabolites” would be a better term to describe the selected features.

Hope it helps.

Al

1 Like

Hi @aljabadi,

Christopher provided some great feedback on your pipeline so I’ll just encourage you to check out our updated vignette which expands on all the available methods and their functionalities mixOmics vignette.

I will check the updated vignette once again.

I think “signature metabolites” or “selected metabolites” would be a better term to describe the selected features.

Thanks for your suggestion!

Also, can you help me answer my question regarding further statistical analysis on detected signature metabolites?

The most “significant” metabolites refer to those which could actually differentiate among the groups, and also the targeted metabolites we want to filter. We set the VIP score greater than 1 and AUC greater than 0.7. After filtering, we did some further logistic regression analysis to confirm their effects on group differentiation again.

Thank you!

Best regards,
Hoa

Hi @hoanq8x,

We don’t recommend using AUC criteria for model performance on its own. You might want to cross-validate the model using the perf function and look into the error rates first. AUC can be a complementary measure.

Also, you can certainly compare the multivariate analysis outcomes with those from logistic regression and investigate any differences/agreements. However, it’s outside the scope of what we can advise you in detail.

Hope it helps,

Al

Hi @aljabadi,
Sorry that I retrieved this topic. As I processed my data analysis, I encountered some problems that got me confused.
I have a dataset including of > 5000 metabolites obtaining from a metabolomic fingerprinting analysis. Our main aims were to employ this vast amount of data to discriminate between 2 rice groups and to identify potential biomarkers for groups differentiation. I ran pca and plsda for an overview of group difference, and perf for cross validation as well as optimal components. The results were straightforward which indicated a clear separation between 2 groups, besides usage of 2 components could provide an error rate of ~ 0.02. I ran tune function for a list of keep.X ranging from 5 to 200 compounds to identify the optimal of variables to retain in the final model. It returned a robust result displaying 5 variables for the optimal model with high VIP values. However, as I checked for group difference using each of those 5 variables by logistic regression analysis, none of them were significant variables. Also, heatmap constructed using only those 5 variables did not show a clear separation of samples of two rice groups, and there were 1-2 samples mixed between them. Could you tell me which step I did wrong? And how could I explain such results?

Here is my script:

Rscript

Thank you so much!
Hoa

hi @hoanq8x,

Please note that logistic regression is a form of linear regression, and hence not a suitable method when the variables are highly correlated, whereas (s)plsda models are highly robust even with highly collinear variables.

Thanks

Al

Hi @aljabadi,

Thank you so much for your response. Now I understand the difference between a logistic regression and splsda model.

Best,
Hoa