Maths : how did you generate the predict values?

Hi Team,

First of all, I wish you a happy new year.

I have done two models, using DIABLO and MINT (with breast.TCGA and stemcells dataset) and I have the same question for both. I have used the predict function on a test set :

pred ← stats::predict(DIABLO, datatestDIABLO)

Could you tell me how you generated the pred$predict values please ? Could you also tell me what is the meaning of pred$B.hat values ?

Thanks a lot.
Lionel

Apologies for such a belated reply @lpanneel.

First, I’ll quickly outline the basics of DIABLO so we’re all on the same page. You hand an empty model a set of training data. From this, it yields a set of loading values for each input feature (across all matching datasets). These loading values define the components and are chosen such that they maximise the covariance between each pair of datasets (as well as each predictor dataset with the outcome).

Now, we get to predict(). Via this function, we provide it our test data. Using the model from above, variates (linear combinations of the input data weighted by the associated loading value) of each test sample is produced. These variates then allow the model to produce dummy variables, which essentially denote the likelihood a given sample will belong to a given class. These are calculated via linear modelling algebra.

The B.hat component represents the regression coefficient estimates for each feature within each dataset.

This page may provide some more insight into your questions. Hope this response was helpful.

Cheers,
Max.

1 Like

Hi,

Many thanks again for this great forum!! Following up on the above question and answer, would you be able to help with some closely related questions about Y-hat scores (i.e. pred$predict) and their interpretation?

As reference, here is the Y-hat definition from the supplementary info of the mixOmics paper:

Screenshot 2022-09-30 114755 where W, D and B are derived from the X and Y training data sets. W is a P × H matrix containing the loading vectors associated to X, D is a P ×H matrix containing the regression coefficients of X on its H latent components and B is a H ×K matrix containing the regression coefficients of Y on the H latent components associated to X. Therefore, Y_new is the prediction from a multivariable (several columns) multivariate model.

My questions:

  1. Can you explain what D is in the above formula?
  2. Is it true to say that Y-hat is predicted using a simple linear regression model with Y as response and the latent variates as predictors? And that there is one linear regression per omic? (assuming I’m running DIABLO with >1 omics).
  3. The pred$WeightedPredict combines Y-hat predictions over all omics, weighted by how much each omic’s latent variables correlate with the outcome?
  4. AUC calculations are based on these Y-hat scores, but all other metrics (error rate, BER, …) that consider actual class predictions are based on distances and do not use the Y-hat scores at all? (and therefor neither the regression coefficients B)

I hope I’m not mixing things up… Many thanks in advance, I greatly appreciate your responsiveness and help!

Efrat

G’day @efratmuller

  1. D (or sometimes referred to as P) is the cross product between the (training) X data frame and the latent components produced by the sPLS algorithm. As the excerpt describes, these are called the “regression coefficients” for the features in X. They contribute to determining the weight associated with each feature in order to make a prediction.
  2. That is correct! For each block of data (omic), there will be unique loadings, weights and variates. Therefore, each block will have a set of predictions using that blocks information. These predictions are then combined. These can be accessed via the MajorityVote and WeightedVote attributes of the output of predict().
  3. Exactly. We calculate the average correlation between each component (ie. latent variable) for a given omic block and the components of the response variable. The intuition is that blocks that more strongly correlate with the response variable will aid in prediction to a greater degree, hence we weight them more.
  4. While AUROC is iterating over the range of thresholds, Y.hat remains constant. This threshold impacts the actual class predictions that are made (resulting in unique specificity and sensitivity values for each threshold). ER & BER can be thought of as being locked at a single threshold, hence both Y.hat and the resulting class predictions are constant. So to answer your question, AUROC does use Y.hat and ER & BER does use class predictions. But remember, in the case of ER & BER, those class predictions are made via a distance based on Y.hat.

Many thanks again.
I think my confusion stems from unclarity about the terminology differences between “loadings”/“weights”/“regression coefficients” in the context of the multi-block PLS methods. While loadings determine how each latent variable is calculated based on original features, I’m not sure when/how you use the other terms. If you have any reading recommendations specifically clarifying the terminology (I’ve read throguh mixOmics and DIABLO papers and supplementary of course), please let me know :slight_smile: Thanks!!

Unfortunately, mixOmics documentation uses slightly different terminology to other packages and researchers. I’ve met other developers who would use the term “weights” for what we describe as “loadings”. As far as mixOmics is concerned:

  • “loadings”: the RELATIVE final contribution of each input feature after some processing and regularisation
  • “weights”: can be thought of the “raw” loading values, i.e. the specific coefficient value used to create the linear combination resulting in each component
  • “regression coefficients”: these are used as part of the model generation process. In an iterative manner, the components (and their loadings) are refined by using repeated regressions. These influence the final value of the above two terms.

Thanks again Max. That’s helpful!