Data preparation for PLSDA classification across multiple data sets

Hello mixOmics Team, thanks for this great package!

My question my be very basic, but I haven’t found any comments in your tutorial on data preparation / normalisation / scaling of the training and test data respective to each other before applying PLSDA.

I have a human gene expression (rnaseq) training data set with known classes (molecular subtypes) that I am using to train a PLSDA model for classification. Then I want to use the predict() function to classify the samples of various test data sets (also human gene expression data) into these molecular subtypes.

How do I need to prepare my training and test data? Do I need to normalise them together, scale or log transform them together to bring them to the same scale and then split them for the actual training and testing steps? Would I run the analysis on only those genes in common between the train and test data sets? What if the data sets have different units of measurements eg. rnaseq counts vs fpkm vs tpm? Would it be valid to apply the PLSDA model across these data sets or does the train and test data need to be in the same unit of measurement? What if my train and test data span across bulk-rnaseq, single.cell-rnaseq and microarray data? And how important is the pre-filtering of lowly expressed genes?

hi @rocanja,
Lots of questions!

How do I need to prepare my training and test data? Do I need to normalise them together, scale or log transform them together to bring them to the same scale and then split them for the actual training and testing steps? Would I run the analysis on only those genes in common between the train and test data sets?

If you normalise all your studies together, then you are introducing a bias in the analysis. So usually we consider them as complete separate studies (I think this is your case, and it would make even more sense when you have RNA-seq vs microarray etc). You do then need to filter on the same genes.

What if the data sets have different units of measurements eg. rnaseq counts vs fpkm vs tpm? Would it be valid to apply the PLSDA model across these data sets or does the train and test data need to be in the same unit of measurement?

you will need to find another data transformation to account for this difference in unit of measurement. We have already though of those questions, wit Yugene (sorry we dont maintain it anymore! CRAN - Package YuGene) and recently with rank transformations, which I think would be better suitable for you

What if my train and test data span across bulk-rnaseq, single.cell-rnaseq and microarray data? And how important is the pre-filtering of lowly expressed genes?

Have a look at Sincast for sc and bulk. I think it is still quite a tricky problem to address. If you had only bulk, also have a look at MINT in mixOmics. The rank transformation and all the pipeline in the two papers above would remove the lowly expressed genes. Prefiltering is extremely important when you are trying to combine / test on independent studies.

Kim-Anh

Thank you so very much for the advice and the super fast reply!