Data processing for the use of MINT

Dear All

I am very interested in mixOmics, and want to use them in our research.
I have cancer patients bulk RNA-seq data from TCGA, and also have normal tissues bulk RNA-seq data from GTEx.
I want to integrate them to find something new events occurred only in cancer patients.
I am thinking MINT is the best way to the integrate them.
I’d like to integrate the TPM data downloaded by each site.

I have learned MINT now, I feel anxious about combining the TPM data in a data sheet are not appropriate way.

Seeing the paper listed in the Methods P-integration, one obtained the raw data from the archives, the processed them in the local environment for the use of MINT analysis.

While another paper introducing DIABLO says DIABLO does not assume particular data distributions, all datasets should be normalized appropriately according to each omics platform and pre-processed if necessary

Thus, I have a question.

Can we use the TPM data downloaded from the archives to use MINT method without pre-process by ourselves ?
Or we need to precess the raw data by ourselves?

If we need to process the raw data in our own environment, which is the proper way of data process?

(A)
(1) combine the raw data form different studies in a single data sheet
(2) process the data in the sheet to obtain the normalized data for MINT analysis

(B)
(1) combine the raw data form different studies in each data sheet
(2) process the data in each data sheet, then combine them in one sheet.
(3) normalization in the combined sheet to obtained the integrated dataset.

I’m looking forward to your opinions.

Hi @MotoMoto and welcome
(thanks for your patience - I was on leave)

Normalisation is a tricky and lengthy process and that under appreciated in data analysis, and no one will have a definite answer of ‘what is the best normalisation method’, as it depends on how your data look like. Often we go through cycles of normalisation - exploration - full analysis - up to results interpretation to work out if the normalisation was appropriate, so I advise you do the same.

Regarding your questions

(A)
(1) combine the raw data form different studies in a single data sheet
(2) process the data in the sheet to obtain the normalized data for MINT analysis

(B)
(1) combine the raw data form different studies in each data sheet
(2) process the data in each data sheet, then combine them in one sheet.
(3) normalization in the combined sheet to obtained the integrated dataset.

What we did in MINT is (B), we assumed each study is obtained independently and is thus normalised independently, which is a realistic scenario.

For (A) it depends on the normalisation method. Some methods are sample-based so it does not matter if you combine the studies or not. Some are gene-based across all samples and here you will introduce a bias: you are already asking the datasets to look similar, and the evaluation performance of MINT will be optimistic (as we use ‘leave one dataset out’ for evaluation).

Kim-Anh

Dear Kimono.Lecao and mixOmics

Thank you for your reply.
I understand well.
I’m astonished that "no one will have a definite answer of ‘what is the best normalisation method’”.

I will use our data for our analysis as per instructed.