Does it make sense to use design "null" to find biomarkers to discriminate between two groups?

Hi!

I am analyzing 5 blocks of data measured across the same samples with DIABLO. I have seen that, if I don’t specify a design matrix or if I use a design=c('full'), then when plotting the samples, for example with plotArrow(), they are not grouping properly. Samples from groupA are not well differenciated from samples from groupB. However, when using design=c('null'), samples from groupA are crearly different from samples from groupB.

From what I understand, if I indicate a ‘null’ design, that means that the algorithm is going to find correlations with Y, but not correlations among my blocks of data. Am I correct? If this is the case, does it makes sense to specify a ‘null’ design in order to find biomarkers to differenciate these two groups? Because, initially I thought that looking for correlations between the blocks of data was more interesting.

Thanks!

That’s quite an interesting observation. I can’t say what might be causing that.

means that the algorithm is going to find correlations with Y, but not correlations among my blocks of data. Am I correct?

Partially. sPLSDA (and by extension, DIABLO) construct the components in order to maximise the covariance between equivalent components in different blocks (eg first component on all X and Y blocks made such that their covariance is maximised). However, we can control the degree to which this maximisation is prioritised, or weighted. By setting the design between two blocks (or all block pairs) as null, it means the model will not take into account the covariance when constructing components.

So when you set design to null, the components are constructed in a block-independent way such that each will built to discriminate your classes best. I do think if you’re wanting to examine the relationships between your blocks you can’t set design to null. Maybe try setting it between 0.1 and 0.3 to see what the results are like?

2 Likes

Hi! I was trying with different values and setting the correlations in the design matrix to 0.05 is a point where I start to see a clear differenciation between the groups.

I am confused because, when calculating individually the correlation between two blocks I get high results, for example:

pls.resAB = pls(X$A, X$B, ncomp = 3)
AB<-mean(cor(pls.resAB$variates$A, pls.resAB$variates$B) %>% diag())
0.9538261

I tried first to fill the design matrix for DIABLO with the corresponding values for each block of data, but using more than 0.05 gives no good discrimination between the groups. What do you think I should do? My interest is to find biomarkers which in the future could be used to predict to which of the groups a patient belongs. I would like to integrate the blocks of data but the discrimination aspect is key to me.

Note: I was based on this thread to perform this code (https://mixomics-users.discourse.group/t/choosing-diablo-design-matrix/204?u=jeni)
Thanks!

I think you have somewhat answered your own question. If discrimination is of greater importance to you, then restrict the degree to which you integrate the blocks in favour of maximising classification accuracy. I’d suggest leaving the design as non-zero (at 0.05) such that you can effectively use functions like cim() and network() to maybe draw some conclusions about the data’s structure.

1 Like