Block.plsda: X block names error and data format?

Hello,

I’m interested in using the block.plsda and DIABLO to integrate RNAseq, ATACseq, 16S, and morphological data. When I try to run the block.plsda step, I get the following warning:

Error in Check.entry.wrapper.mint.block(X = X, Y = Y, indY = indY, ncomp = ncomp,  : 
Each block of 'X' must have a unique name.

My data frames of sequencing results are formatted so that the row names are the samples (and in the same order) and the columns are the gene names/OTUs (with unique identifiers added to account for some duplicated gene names).

I’m running:

X <- list(rna, atac, bacseq)
Y <- morph$Length.measurement_.micron.

result.diablo.MD <- block.plsda(X, Y)

where the rna, atac, and bacseq inputs are counts of each gene/OTU and the Y is list of the body length of each sample.

Any insight on where I’m going wrong with sample input? Thanks!

hi @atan

Have you checked that your column names in each of the data sets are unique?

i.e length(unique(rownames(rna))) for each data set. There might be still some duplicated names somewhere.

Have you run each data set just with a sPLSDA? If you get the same error but only for a given data set, it might also pinpoint you to the issue (plus, we recommend you analyse each data set individually first for a better understansding of your data).

Kim-Anh

Hi Kim-Ahn,

I’ve double checked the column names for each set and they’re all unique. Running the sPLSDA works fine for each data set individually!

I was also able to run the block.plsda on the breast cancer TCGA dataset without errors.

Am I missing something about how the X sets need to be distinguished from each other? The RNAseq and ATACseq sets use the same gene name formats, would that be a problem? The Y set has a single value for each sample because each sample was generated from a pool of larvae–should that set be provided in some other way (like a mean value across treatment groups that’s listed for each sample)?

hi @atan

Difficult for me to say, but it’s possible that the error comes from the fact that your colnames are repeated across the sets. It would be better to name them, for example r_genename and a_genename for the RNA-seq and ATAC-seq.

Kim-Anh