Comparing PCA/mixOmics tools with other methods

[user via email]

We have developed an omic integration tool to find clusters in extensive data sets, as a way to complement current methods that work very well in smaller sample sizes. We wish to describe similar tools available to highlight the best context to use the different existing statistical tools. We also want to determine if our tool has adequate performance by evaluating it in comparison with mixOmics.

However, to be fair and acknowledge current methods capabilities, we want to ensure we are using the adequate version and combination of arguments (parameters) that give your package the highest possible performance.

Below are the version and arguments we have so far used.

The idea here is to do sparse PCA on a matrix x representing concatenate omic blocks and assuming 479 “signal” features.

out_spca <- spca(X = x, ncomp = 2, keepX = c(479,479))

mixOmics_6.13.11

We would appreciate it if you could give us feedback on the above.

Thank you very much,

Agustin

Hi Augustin,
here :

out_spca <- spca(X = x, ncomp = 2, keepX = c(479,479))

you are assuming that

  • the correlation / variance structure of the data can be summarised in 2 dimensions, i.e. there are 2 distinct sources of variation to extract, on each dimension / component
  • the 479 variables selected in component 1 should be non overlapping (according to the PC definition) with the other 479 variables selected in component 2.

You did not mention how many variables you have in x. Basically, in many of our papers where we benchmark the approaches we do as follow:

  • define the variable selection size according to prior knowledge, but sometimes only focusing on dimension/component 1
  • define the variable selection size according to a tuning process (relevant for sPLS-DA or block.splsda() where we have such functions available, see case studies SRBCT, or mixDIABLO examples on our website).

Kim-Anh

Thank you very much for your prompt reply! Indeed, we are assuming the matrix x has a simple structure that can be summarized with two PCs. The original dimensions of x were 500 and 3000 (is a simple simulation where only 479 features contribute to the variability across subjects). But after considering your answer here, we will adapt the benchmark to use sPLS-DA and include the tuning process.

Best wishes

1 Like