Some doubts about the Case Study of DIABLO with Breast TCGA Dataset

This case study has a particularly good classification effect. The classification error rate is only 0.02539683. I think this may be related to the high correlations among omics data.
The correlations between the first component of each dataset for all three PLS models are very strong. The correlation coefficient is as high as 0.88, 0.83, 0.93.
Usually, the correlation between the different omics data we process cannot reach such a high level. Did the data used in this study use any methods to extract variables with strong correlations during the data processing? The article did not provide much description on this aspect. But I want to know, I hope someone can help me, I am very grateful.

hi @ada

You might be interpreting the results incorrectly. A classification error rate that is low is good.

The correlations are between the PLS components, which summarise the information from both data sets. The criterion is block.splsda is to maximise this correlation so that makes sense that is is high. It is not the correlation between pairs of variables. Variables were filtered to be highly variables, mostly so that we could store them in Bioconductor, not to bias the results. You can read more about PLS methods etc in our resources.

Kim-Anh

Thank you for your reply, but I still have some questions to consult with. In the breast cancer data included in the mixomics package, the sample size of mRNA is 220, and the number of characteristic variables is 200. But the number of variables I obtained, using the preprocessing method of DIABLO: an integrated approach for identifying key molecular drivers from multi omics assessments, is much higher than 200. How is breast cancer data processed in mixoimcs. thanks

hi @ada

From memory for this data example, we randomly selected 220 out of the most variable mRNAs because of storage issues in the package.

For you, you should just pre-filter the normalised data and keep the top most highly variable features (between 500 up to 5,000 for each omics), then do the analysis.

Kim-Anh

Thank you. But why is mRNA using normalized data , while miRNA using raw data.
mRNA: illuminahiseq_rnaseqv2-RSEM_genes_normalized;
miRNA: illuminahiseq_mirnaseqmiR_gene_expression and illuminaga_mirnaseq-miR_gene_expression

Hi @ada

I am not sure what you are referring to, but in the example itself, all data are normalised, and this is how they should be before being input into mixOmics analyses.

data(‘breast.TCGA’)
breast.TCGA$data.train$mirna[1:5,1:5]
breast.TCGA$data.train$mrna[1:5,1:5]

Kim-Anh

Sorry, maybe I didn’t describe it clearly. What I want to ask is that the raw data comes from http://firebrowse.org/ . Why is mRNA using iluminahiseq_rnaseqv2_RSEM_genes_normalized (this is normalized data)? But miRNA uses Illuminahiseq_mircaseqmiR_geneuexpression and Illuminaga_mircaseq miR_geneuexpression (this is raw data).

I also want to know if the data was screened for differentially expressed genes using limma voom, or if it was only normalized using limma voom.

Thank you very much.

hi @ada

No the data were not pre filtered based De genes. We do not recommend doing this as you will introduce overfitting in the analysis.
All data were normalised.

Kim-Anh