N-integration in single cell data

I am working with single-cell multi-omics data (RNA-seq, EM-seq, and ATAC-seq) from a single sample (either normal or disease group). My goal is to analyze the interactions among these three omics datasets at either the single-cell level or the cluster level. I have a few specific questions:

1)N-Integration Method Applicability:
Can the N-Integration method be used to analyze interactions within a single state (e.g., only normal group or only disease group)?

2)Data Structure for Integration:
My dataset consists of 6,000 cells grouped into 12 cell types. I am unsure whether the input matrix should have 6,000 rows (single-cell level) or 12 rows (cluster-level averages).

If using 12 rows, there are no replicates, causing perf() to report an error.

If using 6,000 rows, the matrix is highly sparse, and correlation coefficients (calculated via cor()) appear unreliable (either too high or too low).
What would be the recommended approach?

3)Parameter Settings and Method Selection:

Based on the second article, how should the folds parameter in perf() be set for cluster-level analysis?For analyzing interactions at the cluster level, which method is more appropriate: sPLS or SPLS-DA? I understand the former but would like confirmation.

I would greatly appreciate your guidance on these questions. Thank you for your time and assistance!

Hi @mixOmics_user,

  1. Yes you can run N-integration across your modalities (RNA-seq, EM-seq and ATAC-seq) within a single group using Multiblock (s)PLS. If you would like to do this N-integration but also see which of your variables across the three modalities distinguish your disease from normal group, you can using Multiblock (s)PLS-DA (also called DIABLO).

  2. I agree that cluster-level analysis of single cell data is not ideal as you greatly reduce the number of samples and don’t have any replicates for each cell type. As demonstrated in this case study, mixOmics multiblock sPLS models can be successfully built on single cell data across different modalities. Single-cell level gene expression information should not be too sparse, but the methylation and ATAC could be depending on how these data have been pre-processed. You can see in the linked case study that methylation and accessibility data were summarised across regions of the genome (gene bodies and promoters) which can help with issues of sparsity, perhaps a similar approach might be applicable to your data. Another thing you can do to overcome issues of sparsity is group cells into metacells, which are much smaller than clusters and therefore should give you enough replicates for robust analysis. There are various tools to generate metacells, SEACells is one example.

  1. The number of folds you set for cross-validation depends on your sample numbers, we generally recommend making sure you have 5-6 samples per fold, see this page for more details.

Hope that helps!
Eva

I wanted to keep some low-variance features for downstream interaction analysis, but running the code would give an error saying “There are features with zero variance in block ‘meta’.” What should I do? All I can find is to adjust the calculation threshold of nearZeroVar, but the bloc.splsda function only has TRUE and FALSE choices, thank you.

Hi @missjing,

In this case I would recommend removing these features for model building with mixOmics and then adding them back in just before running your downstream interaction analysis.

Hope that helps!
Eva

Thank you for your reply. Is in "model = block.splsda(X = X, Y = Y, ncomp = 3,
keepX = list.keepX, design = design) "this step X uses all the features?