Valid data types?

Dear forum members,

I am new to mixOmics and I am wondering whether is good practice to try to use continuos data as many times as possible, or if, on the contrary I is better to keep categorical/binary data as it is?

Imean, I got some data that normally could be coded into categorical values (or binaries) but also comes with the possibility to be coded as continuous values. It is the case of genomic variants, that can be 1 or 0 depending if the gene of interest is mutated or not, but, they can also be modeled as allele frequencies ranging between 0 and 1.

Other example is copy number variations (CNV) that can be ranging from 0 (absence), 1 (one copy), 2 (diploid, so, normal state in humans), 3 (one extra copy), (two extra copies)…and so on. These values can be transformed into log2 ratios making them continuous too.

So here’s my question: what’s more desirable? Is there any specific type of data modality that we should avoid?

Thank you!

Hi @David,

Welcome to mixOmics community and thank you for your query.

In short, our methods work assume that the data are continuous, even if you input binary variables. One exception is the Y variable in discriminant analyses (PLS-DA), which is assumed to be categorical. So yes, it’s best to keep the continuous nature of the data except for the mentioned case. Although generally, you could also incorporate categorical variables in PLS models. If there are more than 2 categories in PLS models, the variable should be ordinal. For instance it is not possible to incorporate a tissue type as a continuous variable (we typically can’t assume lung > liver > pancreas), whereas the CNV example you mentioned is an ordinal categorical variable.

Hope it helps.

Al

1 Like