Choice of DIABLO design

egc · June 20, 2024, 4:35pm

Hello,

I am trying to integrate lipidomics and proteomics data from a case-control study (with 15 cases and 16 controls). My goal is to obtain a signature composition for the condition of interest. I have seen that for two omics datasets, the approach is usually PLS or sPLS. However, I want to include the group information, in addition to the lipidomic and proteomic data. Therefore, is it possible to use DIABLO (multiblock.splsda) in this specific case? If so, I have a few more questions:

So far, I have performed a 2PLS for exploratory purposes with the lipidomic and proteomic data in canonical mode (as neither is directly dependent on the other). The correlation between them in the 1st component was 0.71. Should I use a null or a complete matrix for the DIABLO?
I tried to tune the number of variables to choose from each block of data with tune.block.splsda (I have an idea of how many to select from the single omics analysis, but I wanted to check some combinations to see if a different number is better in this approach). However, in the model with 1 and 2 components, the error rate standard deviation is always 0.

Thank you!!

kimanh.lecao · June 27, 2024, 10:35pm

Hi @egc

So far, I have performed a 2PLS for exploratory purposes with the lipidomic and proteomic data in canonical mode (as neither is directly dependent on the other). The correlation between them in the 1st component was 0.71. Should I use a null or a complete matrix for the DIABLO?

Yes this is a good approach. Just bear in mind that the higher this value the less discrimination you might be able to achieve (because there is a trade-off to meet).

I tried to tune the number of variables to choose from each block of data with tune.block.splsda (I have an idea of how many to select from the single omics analysis, but I wanted to check some combinations to see if a different number is better in this approach). However, in the model with 1 and 2 components, the error rate standard deviation is always 0.

For the sd, are you doing repeated cross-validation? Or do you mean just classification error rate? For the latter, while this might be unusual, it can happen that you only need a few variables to have a perfect separation.

Kim-Anh

egc · August 7, 2024, 5:32pm

Hi, thank you for the answers!

Ok, so then would it be a good idea to use the correlation value as the value of the design matrix? Or should I guide this decision otherwise?

This is the code I’m using to tune the number of variables. The error.rate.sd is always 0.


  tune <- tune.block.splsda(X, Y, ncomp = 1, 
                            test.keepX = list("Proteomic" = c(1:20), 
                                              "Lipidomic" = c(25, 50, 75, 100)),
                            validation = "Mfold", folds = 10, 
                            measure = "BER", design = design, tol = 1e-05, 
                            max.iter = 40)

Aside from this, I’ve got some difficulties understanding the meaning of the loading signs. Specially when doing the loading plot (with plotLoadings), what does it mean that the variable has a loading and is classified in one group? For example, the value of the loafing is -0.5 and it 's classified in the Ctrl group.

Again, thanks,
Elena.

kimanh.lecao · August 8, 2024, 10:52pm

hi @egc,

kimanh.lecao:

Yes this is a good approach. Just bear in mind that the higher this value the less discrimination you might be able to achieve (because there is a trade-off to meet).

Ok, so then would it be a good idea to use the correlation value as the value of the design matrix? Or should I guide this decision otherwise?

Can be either your own assumption / wish (i.e what is most important to you? to discriminate? or to maximise the correlation?), or based on your previous analysis (which indicates a somewhat strong correlation of 0.7).

kimanh.lecao:

For the sd, are you doing repeated cross-validation? Or do you mean just classification error rate? For the latter, while this might be unusual, it can happen that you only need a few variables to have a perfect separation.

This is the code I’m using to tune the number of variables. The error.rate.sd is always 0.

For 29 samples I’d advise you do 5-fold cross validation, rather than 10. It’s quite unusual to have an SD = 0 (it means whatever test in the cross-validation you run, you get exactly the same classification error rate). I recommend you do nrepeat = 50 (or 10, depending on your compute time).

For the plotloadings, you can look at either our vignette or examples on our website as we give quite a few detailed explanations. Or this discussion forum as it’s a frequent question!

Kim-Anh

Topic		Replies	Views
Choosing Diablo Design Matrix Analysis	9	2685	April 18, 2024
Design matrix between omics datasets? Analysis	7	1739	May 18, 2020
DIABLO of selected variables from tuned sPLS-DA Analysis	4	1330	October 18, 2020
Combine variates from different omics Analysis	3	146	June 13, 2024
Using keep.X from separate sPLS-DA analyses for Diablo Analysis	3	977	October 8, 2020

Choice of DIABLO design

Related topics