Continuous response variable Y in DIABLO?

I am looking to integrate a large transcriptomic data set using DIABLO. I was just wondering, can my outcome data be continuous (such as age) or do I need to make it categorical? Thank you

Hi Santi,

Thanks for using mixOmics!

If you wish to use continuous Y you can use block.spls without variable selection on Y and mode = "regression". You must ensure you provide Y (age) as a matrix with one column. See example below:

library(mixOmics)
data("breast.TCGA")
# this is the X data as a list of mRNA and miRNA
X_block = list(mrna = breast.TCGA$data.train$mrna, mirna = breast.TCGA$data.train$mirna)
numeric_Y = as.matrix(breast.TCGA$data.train$protein[,1]) ## your "age" data goes here
dim(numeric_Y)
# set up a full design where every block is connected
design = matrix(1, ncol = length(data), nrow = length(data),
                dimnames = list(names(data), names(data)))
diag(design) =  0
# set number of component per data set
ncomp = 2
# set number of variables to select, per component and per data set (this is set arbitrarily)
list.keepX = list(mrna = rep(20, 2), mirna = rep(10,2))

TCGA.block.spls = block.spls(X = X_block, Y = numeric_Y, mode = "regression",
                             ncomp = ncomp, keepX = list.keepX, design = design)
TCGA.block.spls

By using such a method, you assume there’s a continuous relationship between predictors and response (age). You have to ensure this is a valid assumption (for example an average person’s height and weight increase with age but up to a certain age only).

As per DIABLO, using age directly as a response variable in Discriminatory Analysis is not advisable, unless you create and use relevant and distinct categorical variables (baby, adult, elderly etc) from it.

Hope it helps.

I would like to add to Al’s answer that choosing the optimal keepX values in the context of block.spls is not straightforward, so we have not developed nor implemented a tuning function for this (whereas it is implemented for block.splsd a.k.a DIABLO).

However, if you are only interested in a first pass exploratory analysis (looking at the plots, identifying the top variables) that may do when you set your own keepX values. Let us know otherwise if you wish to go further in the analysis.

Kim-Anh

Hi, I have a few questions I would like to ask, if I were to use DIABLO with continuous response, it would be best to use the “block.spls” function, but to get the optimal KeepX values, is there any recommended method to use, like cross-validation or such ?

hi @YEE99,

It’s a bit tricky (or at least the output of analysis is not straightforward). You could use cross-validation and assess the quality of the prediction of Y. But we have only implemented the function for sPLS for now because I think it requires further methodological development first (you can have a look for some inspiration with predict / tune / perf).

Kim-Anh