Identify dataset that best segregates samples and loading values interpretation across data sets, and error rate per data set

Hi all,

I am considering using Diablo for multi-omics analysis of a dataset with 3 data types, for which I have a total of 41 individuals, distributed across three groups. The diablo framework seems ideal for what I want to do. However, one of the questions I would like to answer is which of the three data types better segregates the samples across the three treatment groups / better predicts group membership. Looking at the vignette (https://mixomicsteam.github.io/Bookdown/diablo.html) it seemed to me that it might be possible to do this using the AUCROC curves. Thus, for example, would it make sense to compare the AUCROC curves for component 1 between mRNA and proteomic data?

Related to this, can the loadings across different datatypes be directly compared. E.g. if for component 1 my top loading for mRNA is 0.8 and for proteomics this is 0.3, can we say the top loading is higher for mRNA?

Thanks for any help

hi @rramiro,

We dont recommend you use the AUROC to conclude on the data set with best segregation. In the perf() function you should have classification error outputs per class as well as per data set (see ?perf and the output $error.rate). These performance measures are better than the AUROC, who are often inflated (even if we use cross-validation) and not totally appropriate to reflect the performance of DIABLO.

Regarding your question about the loadings, the answer is: no, loadings values across data sets are not comparable as their values depend on the number of features per dataset and number of features selected per data set. We recommend you focus on the top features, whatever their loading values.

Kim-Anh

Hi @kimanh.lecao,

Thanks a lot for your reply. Could you please explain the values returned on the error rate?

I have run the following:

X <- list(A = a_mat, 
          B = b_mat, 
          C = c_mat)
Y <- metadata$diet

list.keepX = tune.keepX.diablo$choice.keepX

design = matrix(1, ncol = length(X), nrow = length(X), 
                dimnames = list(names(X), names(X)))
diag(design) = 0


final.diablo<-block.splsda(X = X, Y = Y, ncomp = 3, 
                          keepX = list.keepX, design = design)


perf.final.diablo = perf(final.diablo, validation = 'Mfold', dist = 'centroids.dist',
                              folds = 10, nrepeat = 50,progressBar = T, auc=T)

perf.final.diablo$error.rate

which returns

$A
      centroids.dist
comp1      0.3746341
comp2      0.2770732
comp3      0.2692683

$B
      centroids.dist
comp1      0.5063415
comp2      0.3541463
comp3      0.3443902

$C
      centroids.dist
comp1      0.3590244
comp2      0.2560976
comp3      0.2663415 

How should I interpret these values? e.g. can i say that based on dataset A on comp1, I get an erroneous classification on 37% of the cases? Moreover, would you recommend any approach to statistically compare the error rates for the same component, across datasets? (I guess I could compare $error.rate +/- $error.rate.sd, but I am wondering if there is a better approach)

Best,

Ramiro

hi @rramiro,

How should I interpret these values? e.g. can i say that based on dataset A on comp1, I get an erroneous classification on 37% of the cases?
Yes your interpretation is correct, 37% of samples are misclassified based on the components associated to the A data set on comp 1 with this distance.

Moreover, would you recommend any approach to statistically compare the error rates for the same component, across datasets? (I guess I could compare $error.rate +/- $error.rate.sd , but I am wondering if there is a better approach)
I would say they are comparable. Remember that the model fits sets of components associated to each data set, so that the covariance is maximised (see our website with many resources including webinar + articles), but the components are ‘comparable’ to each other across data sets. This output just gives you more insight into which data set is more discriminative than the other.

Kim-Anh