Need help with pre processing data (normalization)

Dear community,

I have been reading the book on mixOmics (I got to chapter 8), to prepare myself to perform an analysis of a data I have. There are some questions about sample processing that I could not solve, that is why I am asking for your help.

I understand, that in order to run the mixOmics package functions, my data must be Normalized, Centralized and Scaled (I understand that they are three different processes, from what I was reading).

I have two sets of data, obtained from an experiment on rats that we have done in the lab:

  • A table of Microbiota abundance at the Family level (coming from 16S sequencing).

Here, we were given a very clear pre-processing protocol:

http://mixomics.org/mixmc/mixmc-preprocessing/

First question. Is the transformation of the 16S data to CLR just a step prior to normalization, centering and scaling? Or does this protocol leave the data ready to use?

  • Metabolomics tables from the Metabolon, Inc. service.

Metabolon provided me with different versions of the metabolomics data: Peak Area Data, Batch-normalized Data, Batch-norm Imputed Data, Log Tranformed Data.

Originally, I was planning to use one of the last two tables, and scale and center it. However, I was presented with an issue that I don’t know how to resolve.

Not all of the rats that were subjected to Microbiota and Metabolomics sequencing match. That is, there are a couple (few) samples that I need to eliminate. Therefore, in case of eliminating a couple of subjects from the Metabolon data, should I perform the normalization again (without the rats i eliminate)?

In that case, I was looking in several sites, and I did not find a clear normalization protocol for beginners that I can apply. Does anyone have a protocol in R?

Also, I want to know if just doing the normalization is enough. Can the scaling and centering be done automatically when you run PLS? or I should do it before i run the script?

Thank you very much!

Is the transformation of the 16S data to CLR just a step prior to normalization, centering and scaling? Or does this protocol leave the data ready to use?

Your first assumption is correct. While methods within mixOmics can handle skewed and/or non-mesokurtic features, it will likely degrade the efficacy of the model. Hence, if your exploration of the post-CLR data suggests it requires centering/scaling, then undergo these processes.

should I perform the normalization again (without the rats i eliminate)?

Yes. Any transformation should be applied over all samples being used - no more or no less.

Does anyone have a protocol in R [for normalisation]?

Note, normalisation in the general sense means adjusting the scale of features to make them possess a notionally comparable scale. There are various forms of normalisation (from simple to complex). You’ll need to determine the best for your specific application. This paper may assist.

To answer your question - in terms of packages, there are many! This is highly dependent on what type of normalisation you are planning on using as there is quite the variety. Once you have determined which method(s) are best, a brief google search is likely to yield what R packages exist so you don’t need to reinvent the wheel.

Can the scaling and centering be done automatically when you run PLS?

Yes, all PLS-based algorithms automatically center the data and contain the scale parameter. By defaunt, the scaling is set to TRUE.

Max,

First, I wanted to thank you for taking the time to respond to me.
I just have one question left. In the text above you recommended that I had to choose the normalization method that best suits my dataset. However, I also read that I should choose the same type of normalization for all datasets. That is, in my case, which I have a dataset of metabolome and microbiota data, I should choose the same normalization method. Is it true?

Then, the protocol would look like this:

  • Transformation of microbiota data into CLR.

  • Normalization by fold change method for the microbiota CLR data and the peak area metabolome data.

  • Centering and scaling of the data (I can directly do it with the PLS function).

Does this sound logical?

Thank you very much!