Hi Team,

First of all, I wish you a happy new year.

I have done two models, using DIABLO and MINT (with breast.TCGA and stemcells dataset) and I have the same question for both. I have used the predict function on a test set :

pred ← stats::predict(DIABLO, datatestDIABLO)

Could you tell me how you generated the pred$predict values please ? Could you also tell me what is the meaning of pred$B.hat values ?

Thanks a lot.

Lionel

Apologies for such a belated reply @lpanneel.

First, I’ll quickly outline the basics of DIABLO so we’re all on the same page. You hand an empty model a set of training data. From this, it yields a set of loading values for each input feature (across all matching datasets). These loading values define the components and are chosen such that they maximise the covariance between each pair of datasets (as well as each predictor dataset with the outcome).

Now, we get to `predict()`

. Via this function, we provide it our test data. Using the model from above, variates (linear combinations of the input data weighted by the associated loading value) of each test sample is produced. These variates then allow the model to produce dummy variables, which essentially denote the likelihood a given sample will belong to a given class. These are calculated via linear modelling algebra.

The `B.hat`

component represents the regression coefficient estimates for each feature within each dataset.

**This page** may provide some more insight into your questions. Hope this response was helpful.

Cheers,

Max.

1 Like

Hi,

Many thanks again for this great forum!! Following up on the above question and answer, would you be able to help with some closely related questions about Y-hat scores (i.e. `pred$predict`

) and their interpretation?

As reference, here is the Y-hat definition from the supplementary info of the mixOmics paper:

where W, D and B are derived from the X and Y training data sets. W is a P × H matrix containing the loading vectors associated to X, D is a P ×H matrix containing the regression coefficients of X on its H latent components and B is a H ×K matrix containing the regression coefficients of Y on the H latent components associated to X. Therefore, Y_new is the prediction from a multivariable (several columns) multivariate model.

My questions:

- Can you explain what
`D`

is in the above formula?
- Is it true to say that Y-hat is predicted using a simple linear regression model with Y as response and the latent variates as predictors? And that there is one linear regression per omic? (assuming I’m running DIABLO with >1 omics).
- The
`pred$WeightedPredict`

combines Y-hat predictions over all omics, weighted by how much each omic’s latent variables correlate with the outcome?
- AUC calculations are based on these Y-hat scores, but all other metrics (error rate, BER, …) that consider actual class predictions are based on distances and do not use the Y-hat scores at all? (and therefor neither the regression coefficients B)

I hope I’m not mixing things up… Many thanks in advance, I greatly appreciate your responsiveness and help!

Efrat

Many thanks again.

I think my confusion stems from unclarity about the terminology differences between “loadings”/“weights”/“regression coefficients” in the context of the multi-block PLS methods. While loadings determine how each latent variable is calculated based on original features, I’m not sure when/how you use the other terms. If you have any reading recommendations specifically clarifying the terminology (I’ve read throguh mixOmics and DIABLO papers and supplementary of course), please let me know Thanks!!

Unfortunately, `mixOmics`

documentation uses slightly different terminology to other packages and researchers. I’ve met other developers who would use the term “weights” for what we describe as “loadings”. As far as `mixOmics`

is concerned:

- “loadings”: the RELATIVE final contribution of each input feature after some processing and regularisation
- “weights”: can be thought of the “raw” loading values, i.e. the specific coefficient value used to create the linear combination resulting in each component
- “regression coefficients”: these are used as part of the model generation process. In an iterative manner, the components (and their loadings) are refined by using repeated regressions. These influence the final value of the above two terms.

Thanks again Max. That’s helpful!