PLS-DA questions

Hello, I am new to the mixOmics and this is my first time posting my question in the forum.

I am working on a swine project and plan to use multi-omics to predict the phenotype of pigs.
For the data I have: 836 samples with phenotype, transcriptome (~15000), proteome(~400), and metabolome(~50). First I plan to use the single omics data to predict the phenotype. I found PLS-DA can do this job and tried what the tutorial suggests. I tried “perf” function to estimate how many components I should use for this study and the results for transcriptome was shown below:
single transcriptome:
image

Based on these results, I have no idea how many components I should use in my study. Can you give me some ideas about the results?
Another question, can we fit more factors in the model when we use the function plsda? For example, I know the batch is another factor affecting the phenotype, how to fit these kind of factors in the model when we use plsda.

Looking forward to your reply!

Best regards,
Yulu

Hi @yuluc,

As a rule of thumb the number of components should be approx. phenotype levels - 1. How many phenotype levels do you have?

The big difference between BER and overall classification error rate, indicates that you have unbalanced group sizes, and you should therefore set measure = “overall” when tuning the model.

The next question is what distance to use. You might have a complex classification problem and should therefore choose Mahalanobis distance, but it little difficult to tell without information about phenotype. However, this might help you:

In practice we found that the centroid-based distances, and specically the Mahalanobis distance led tomore accurate predictions than the maximum distance for complex classication problems and N-integrationproblems. The centroid distances consider the prediction in a Hdimensional space using the predicted scores, while the maximum distance considers a single point estimate using the predicted dummy variableson the last dimension of the model.
Source: mixOmics: An R package for ‘omics feature selection and multiple data integration - Supplemental information page 4

Regarding the question about batch effects. Yes, a method has recently been developed to account for batch effects. You can find the paper here: https://www.biorxiv.org/content/10.1101/2020.10.27.358283v1 and codes/vignettes here GitHub - EvaYiwenWang/PLSDAbatch: Development repository for R package 'PLSDAbatch ' hosted on Bioconductor.

  • Christopher
3 Likes

Hi Christopher,

Really thank you for your answer! It helps me a lot!
I have 836 phenotype data. The phenotype data is a binary data (0/1). There were 647 samples with a phenotype of 0 and 189 samples with a phenotype 1.
This time, I tried what you suggested to tune the model and get the optimal number of components based on t-test on the error rate was 5.
tune.splsda.srbct ← tune.splsda(x, y, ncomp = 7,
validation = ‘Mfold’,
folds = 3, dist = ‘mahalanobis.dist’, progressBar = FALSE,
measure = “overall”, test.keepX = list.keepX,
nrepeat = 10)
Then, used the final model to conduct the performance assessment and looked at the results.
perf.srbct ← perf(MyResult.splsda.final, validation = “Mfold”, folds = 5,
dist = ‘mahalanobis.dist’, nrepeat = 10,
progressBar = FALSE)

perf.srbct$error.rate
$overall
mahalanobis.dist
comp1 0.4703349
comp2 0.4375598
comp3 0.4383971
comp4 0.4321770
comp5 0.4122010

It seems like the overall error rate per class decrease as more components are added but not decrease much.
I also attached the final model results but it seems like it cannot really separate the data.
image

Another question, can we fit batch effect into the DIABLO model?

Looking forward to your reply!

Best regards,
Yulu

Hi again @yuluc,

I am glad to hear it helped you. I suggest you try to do it again in the reverse order: perf first and then tune.splsda. The reason why you would do this, is to identify the “optimal” number of components for tuning, and then tune it to find out how many variables to keep on each component. Try this:

MyPLSDA <- plsda(X, Y, ncomp = 10)
set.seed(1234)
MyPerf ← perf(MyPLSDA, validation = “Mfold”, folds = 3, nrepeat = 100, progressBar = TRUE)
MyPerf$choice.ncomp ## Choose the appropriate distance and number of components

After chosing the appropriate distance and number of components, you then tune the model:

list.keepX <- c(3:9,  seq(10, 100, 5))
tunePLSDA <- tune(X, Y, ncomp = ?,  test.keepX = list.keepX, validation = 'Mfold', folds = 3, progressBar = TRUE, dist = '?', measure = "overall", nrepeat = 100, cpus = 2) ## Note that i have increased the number of repeats.

ncomp <- tunePLSDA$choice.ncomp$ncomp
select.keepX <- tunePLSDA$choice.keepX[1:ncomp]
select.keepX
plot(tunePLSDA)

MySPLSDA <- splsda(X, Y, ncomp = ncomp, keepX = select.keepX) ## Your final model
plotIndiv(MySPLSDA, group = Y, ind.names = FALSE,  ellipse = TRUE, legend = TRUE)

Let me know if this changed anything. If you still get the same results, it might be indicative, that your two phenotypes are very similar, and cannot be discriminated from each other using sPLS-DA.

Regarding correction of batch effects in DIABLO: I don’t think this has been developed yet, but I am not 100% sure.

  • Christopher

Hi Christopher,

Really thank you for your answer!
This time I run based on your suggestion.

  1. Run perf
    MyPerf.plsda ← perf(srbct.plsda, validation = “Mfold”, folds = 3,
    progressBar = FALSE, nrepeat = 10) # I choose 10 because the large dataset
    MyPerf.plsda$choice.ncomp
    max.dist centroids.dist mahalanobis.dist
    overall 1 1 1
    BER 1 1 1
    I would like to know whether the appropriate number of components is 1.
  2. tune the model
    set.seed(30)
    list.keepX ← c(3:9, seq(10,100,5))
    tune.splsda.srbct ← tune.splsda(x, y, ncomp = 1, # we suggest to push ncomp a bit more, e.g. 4
    validation = ‘Mfold’,
    folds = 3, dist = ‘mahalanobis.dist’, progressBar = FALSE,
    measure = “overall”, test.keepX = list.keepX,
    nrepeat = 10)
    ncomp ← tune.splsda.srbct$choice.ncomp

ncomp
$ncomp
NULL

$values
NULL

The number of components after tune the model shows NULL and the error rate is shown below:
comp1
3 0.4662679
4 0.4678230
5 0.4651914
6 0.4617225
7 0.4599282
8 0.4575359
9 0.4565789
10 0.4556220
15 0.4569378
20 0.4570574
25 0.4556220
30 0.4565789
35 0.4555024
40 0.4552632
45 0.4546651
50 0.4523923
55 0.4522727
60 0.4532297
65 0.4543062
70 0.4543062
75 0.4541866
80 0.4534689
85 0.4534689
90 0.4546651
95 0.4558612
100 0.4562201

I think the results get worse and weird. Can you explain a little bit?

Best regards,
Yulu

Hi @yuluc,

Regarding your first question: This does not mean that the number of components should be equal to 1, but it is a good indication. The rule of thumb as to how many components to retain is number of phenotypes - 1. Thus, in this case where you have 2 phenotypes, we would actually expect 1 component to be enough. However, it is a good idea to test a few extra components, or to add an extra component for visualization purposes (plotIndiv, etc.). Based on your previous results, I am actually a little surprised, that only 1 component is retained. Try to increase the folds to 5 again and increase nrepeat to at least 50. I know that it sounds like a lot, but it is really worth it in terms of reproducibility and “precision” of the model.

The second question: Why is ncomp = NULL? The $choice.ncomp output is based on T-tests (Do we get a reduction in error rate?). Since you only tune 1 component, a t-test cannot be conducted. If you re-run the tuning with 2 components or more, it will solve the “problem”.

If you want to, we can arrange a remote session and look at it together.

  • Christopher

Hi Christopher,

That’s would be great if we can arrange a remote session. Can you tell me how to arrange the remote session? I am in the middle of U.S. The time is different compared with you. Please tell me your local time when you schedule the remote session. Thank you very much!

Looking forward to your reply!

Best regards,
Yulu

hi @yuluc,
we are currently implementing this method as a side package: https://www.biorxiv.org/content/10.1101/2020.10.27.358283v1 to correct for batch effects, before you can input in any analysis method, but it requires that your treatment / biological variation is quite strong compared to the batch effect. From the sample plot you showed it seems that your treatment variation is not (very) strong.

Kim-Anh

Hi Kim-Anh,

Thank you for your clarify! You mean if my biological variation is not really strong, it’s not very recommended to correct for batch effect if I understand correct. Please correct me if I understand is wrong.

Some additional questions:1. whether the MixOmics package can training the model based on artificial selected training data and then calculate the accuracy of the testing data by using the model from the training data. 2. where can I find the overall misclassification error rate and balanced error rate equation?

Looking forward to your reply!

Best regards,
Yulu

Hi @christoa

That’s would be great if we can arrange a remote session. Can you tell me how to arrange the remote session?

Looking forward to your reply!

Best regards,
Yulu

@yuluc, I have sent you a private message :slight_smile: