Hello. Thank you for the great package. I would like to ask some questions about choosing appropriate approaches and applying models to my omics data.
For context, I have two omics datasets of DNAme microarray (Illumina EPICv2, 10+ samples containing 890,000 probes) and mass-spec proteomics data (20+ samples, containing 7000 genes), and they share 9 samples derived from the same biological samples. The data include the cells that were damaged and stimulated to see how DNAme and proteomics data differ by these variables, and show that their profiles are associated with each other.
So I decided to use 1) PLS-DA to discriminate samples based on their outcome category (PCA didn’t show clear clusters, and we have only a class variable for both datasets), and combine the data by using 2) N-integration to analyze the data together.
- For PLS-DA, already normalized DNAme and proteomics data (n = 10+) were used to run the code following the mixOmics vignette below. I tried to run the shared samples between the DNAme and proteomics data, but it didn’t work well, so I ran single-omics data to identify the overall direction of the data.
plsda_dname <- plsda(t(betas_norm), class, ncomp = 10) # Since the total
#set.seed(30)
perf.plsda_dname <- perf(plsda_dname, validation = 'Mfold', folds = 2,
progressBar = FALSE,
nrepeat = 10)
plot(perf.plsda_dname, sd = TRUE, legend.position = 'horizontal')
validation = ‘Mfold’ and “loo” doesn’t work on this data showing this error message:
Error in solve.default(Sr): system is computationally singular: reciprocal condition number = 1.30614e-17
This might show that my data are:
- A singular (square) matrix is a (square) matrix that is not invertible.
- A matrix is not invertible if its determinant equals zero.
- or too many probes in DNAme data.
In this case, what can I try to apply pls-da for my single data? (I didn’t show the proteomics data here but I think similar approach can be tested).
- For N-integration, I ran the code below using DIABLO.
X <- list(dname = betas_norm_6000k, # Random filtered 6000 probes to reduce size
protein = ms_norm)
Y <- class
summary(Y)
# Selecting design matrix
design <- matrix(0.1, ncol = length(X), nrow = length(X),
dimnames = list(names(X), names(X)))
diag(design) <- 0
design
# Deciding the number of ncomp
diablo.tcga <- block.plsda(X, Y, ncomp = 5, design = design, near.zero.var = TRUE)
perf.diablo.tcga = perf(diablo.tcga, validation = 'loo', folds = 2, nrepeat = 10)
plot(perf.diablo.tcga)
The result plot seems to show that the model is not good, and Mfold doesn’t work with an error message (system is computationally singular: reciprocal condition number = 5.55298e-31).
For this matter, I would like to ask:
(1) Based on previous comments about too big data or too many variables, it seems I need to filter the DNAme data in more structured way. The recommended way is to filter it by variance, but variance plot look like unsymmetrical bimodal shape.
For this matter, how can I filter the probes and optimize the model? Should I filter higher variance probes or randomly filter probes?
(2) The downstream analyses of applying the results seems bit weird: plotDiblo
tune.diablo <- tune.block.splsda(X, Y, ncomp = 2,
test.keepX = test.keepX, design = design,
validation = 'Mfold', folds = 2, nrepeat = 1,
# Mfold should be loo, but this also works here
BPPARAM = BiocParallel::SnowParam(workers = 2),
dist = "centroids.dist")
The separated clusters from each side (two clusters are clearly separated from top right to bottom left sides) seems to show that the data are not associated, but I wonder if the model is inappropriate to explain the datasets.
Please let me know if you need any other information, and thank you!
