PLS-DA questions

Hello, I am new to the mixOmics and this is my first time posting my question in the forum.

I am working on a swine project and plan to use multi-omics to predict the phenotype of pigs.
For the data I have: 836 samples with phenotype, transcriptome (~15000), proteome(~400), and metabolome(~50). First I plan to use the single omics data to predict the phenotype. I found PLS-DA can do this job and tried what the tutorial suggests. I tried “perf” function to estimate how many components I should use for this study and the results for transcriptome was shown below:
single transcriptome:

Based on these results, I have no idea how many components I should use in my study. Can you give me some ideas about the results?
Another question, can we fit more factors in the model when we use the function plsda? For example, I know the batch is another factor affecting the phenotype, how to fit these kind of factors in the model when we use plsda.

Looking forward to your reply!

Best regards,

Hi @yuluc,

As a rule of thumb the number of components should be approx. phenotype levels - 1. How many phenotype levels do you have?

The big difference between BER and overall classification error rate, indicates that you have unbalanced group sizes, and you should therefore set measure = “overall” when tuning the model.

The next question is what distance to use. You might have a complex classification problem and should therefore choose Mahalanobis distance, but it little difficult to tell without information about phenotype. However, this might help you:

In practice we found that the centroid-based distances, and specically the Mahalanobis distance led tomore accurate predictions than the maximum distance for complex classication problems and N-integrationproblems. The centroid distances consider the prediction in a Hdimensional space using the predicted scores, while the maximum distance considers a single point estimate using the predicted dummy variableson the last dimension of the model.
Source: mixOmics: An R package for ‘omics feature selection and multiple data integration - Supplemental information page 4

Regarding the question about batch effects. Yes, a method has recently been developed to account for batch effects. You can find the paper here: and codes/vignettes here GitHub - EvaYiwenWang/PLSDAbatch: R package for batch effect correction

  • Christopher
1 Like