Tuning spls keeps running into error

Hello,

I’ve been trying to run spls on my gene expression and microbial abundance data. No matter how much I try to tune for keep.X, I run into errors. My code is below:

dim(meta); dim(filtered_gene)
# 136 148
# 136 4953

ncomptry=10
MyResult.spls <- spls(meta, filtered_gene, ncomp=ncomptry, near.zero.var = FALSE) # this finishes no problem

# Below code snippet runs fine, no problem here. Determined that ncomp=5 should be kept.
set.seed(22) 
perf.pls <- perf(MyResult.spls, validation="Mfold", folds=5, progressBar=TRUE, nrepeat=50)
plot(perf.pls, criterion  = 'Q2.total')
X=seq(1:ncomptry)
Y=perf.pls$measures$Q2.total$summary$sd
par(mar=c(5,5,5,5))
plot.new()
plot(X,Y)
abline(h = 0.0975) # keep 5 components to test

# Tuning
list.keepX <- c(25, 50, 100, 500, 1000, 2500, 3000)
set.seed(22)
tune.spls.cor <- tune.spls(meta, filtered_gene, ncomp = 5,
                           test.keepX = list.keepX,
                           validation = "Mfold", folds = 5,
                           nrepeat = 50, progressBar = TRUE,
                           measure = NULL) 

# below is the output I always obtain
tuning component: 1
[=======                                           ] 14%Error in X.test %*% a.cv : non-conformable arguments

Always crashes at 14%. Even if I change the set.seed value, it just fails no matter what I do. The only time this ever worked was with using a very small subset of my data (31 taxa instead of 148), but I have different datasets of microbial abundances that I want to run spls for.

I realize my microbial abundance has a lot of zeros and somehow this is affecting it - because when I just ran spls in the first place, it would fail unless I removed any taxa with less than 10 counts across the samples. But the issue is - zeros are quite a normal value to be assessed in abundance data so I don’t want to just keep arbitrarily removing taxa with low counts because that is actually meaningful data that I want to compare with host gene expression.

How to overcome this??

My first comment is that your list.keepX has values above 148, which is the number of features in your X dataframe (meta). Not sure why you are trying to select more features than exist in that dataframe.

You have 7 different values in list.keepX. The method fails at 14%. Upon completion of the keepX == 25 iteration, the progress is 14.2% (= 1/7). hence, this is likely an indication that its failing when the method is using keepX == 50. This is in line with your attempts succeeding when you use 31 taxa. Hence, the method doesn’t like the input data frame with somewhere between 31 and 50 features.

By the looks of it, you haven’t applied any of the necessary preprocessing for your microbial data. Have a read of this page on the website for some more information on what is required before any analysis can take place.

With all this info, my best guess would be the lack of preprocessing is causing this issue. It means a dataframe with 31-50 features is far too sparse. Otherwise, there may be something specific with your data. Let me know how you go with implementing the required preprocessing

My microbial data is already pre-processed in that I conduct a filtering step removing all contigs less than 0.1% relative contig abundance. Adding an offset of +1 would skew my data so much because I am working with such low counts of taxa that adding a 1 would be changing the the dataframe a lot; there are also many taxa with values around 1. I am concerned this would just “fake” my data in some way.

But I will give it a try with the offset since it seems to be applicable for microbial data - I am just working with metagenomic microbial data (so not 16S rRNA which would give very high counts of data).

I’ve tried preprocessing it for CLR transformation but once I run the logratio.transfo, I cannot coerce the matrix back into a dataframe which I would need in order to split my data into test and training datasets along with my gene expression dataframe. Is there a reason CLR can’t be in a dataframe?

Applying an offset to all samples and taxa won’t skew the data at all as it is being universally applied. You don’t even need to use a value of 1, it can be any positive, non-zero value. You just need there to be no zeroes in order to apply the CLR.

It can be a data.frame - you just need to change the class from “clr” to “matrix”:

data(diverse.16S)
X <- diverse.16S$data.TSS

CLR = logratio.transfo(X = X, logratio = 'CLR')
class(CLR) <- "matrix"
CLR <- as.data.frame(CLR)

Oh thanks, this fixed it for me. Cheers!