I am trying to integrate lipidomics and proteomics data from a case-control study (with 15 cases and 16 controls). My goal is to obtain a signature composition for the condition of interest. I have seen that for two omics datasets, the approach is usually PLS or sPLS. However, I want to include the group information, in addition to the lipidomic and proteomic data. Therefore, is it possible to use DIABLO (multiblock.splsda) in this specific case? If so, I have a few more questions:
So far, I have performed a 2PLS for exploratory purposes with the lipidomic and proteomic data in canonical mode (as neither is directly dependent on the other). The correlation between them in the 1st component was 0.71. Should I use a null or a complete matrix for the DIABLO?
I tried to tune the number of variables to choose from each block of data with tune.block.splsda (I have an idea of how many to select from the single omics analysis, but I wanted to check some combinations to see if a different number is better in this approach). However, in the model with 1 and 2 components, the error rate standard deviation is always 0.
So far, I have performed a 2PLS for exploratory purposes with the lipidomic and proteomic data in canonical mode (as neither is directly dependent on the other). The correlation between them in the 1st component was 0.71. Should I use a null or a complete matrix for the DIABLO?
Yes this is a good approach. Just bear in mind that the higher this value the less discrimination you might be able to achieve (because there is a trade-off to meet).
I tried to tune the number of variables to choose from each block of data with tune.block.splsda (I have an idea of how many to select from the single omics analysis, but I wanted to check some combinations to see if a different number is better in this approach). However, in the model with 1 and 2 components, the error rate standard deviation is always 0.
For the sd, are you doing repeated cross-validation? Or do you mean just classification error rate? For the latter, while this might be unusual, it can happen that you only need a few variables to have a perfect separation.
Aside from this, I’ve got some difficulties understanding the meaning of the loading signs. Specially when doing the loading plot (with plotLoadings), what does it mean that the variable has a loading and is classified in one group? For example, the value of the loafing is -0.5 and it 's classified in the Ctrl group.
Ok, so then would it be a good idea to use the correlation value as the value of the design matrix? Or should I guide this decision otherwise?
Can be either your own assumption / wish (i.e what is most important to you? to discriminate? or to maximise the correlation?), or based on your previous analysis (which indicates a somewhat strong correlation of 0.7).
This is the code I’m using to tune the number of variables. The error.rate.sd is always 0.
For 29 samples I’d advise you do 5-fold cross validation, rather than 10. It’s quite unusual to have an SD = 0 (it means whatever test in the cross-validation you run, you get exactly the same classification error rate). I recommend you do nrepeat = 50 (or 10, depending on your compute time).
For the plotloadings, you can look at either our vignette or examples on our website as we give quite a few detailed explanations. Or this discussion forum as it’s a frequent question!