Tuning spls keeps running into error

kdb.chau · August 25, 2022, 1:50pm

Hello,

I’ve been trying to run spls on my gene expression and microbial abundance data. No matter how much I try to tune for keep.X, I run into errors. My code is below:

dim(meta); dim(filtered_gene)
# 136 148
# 136 4953

ncomptry=10
MyResult.spls <- spls(meta, filtered_gene, ncomp=ncomptry, near.zero.var = FALSE) # this finishes no problem

# Below code snippet runs fine, no problem here. Determined that ncomp=5 should be kept.
set.seed(22) 
perf.pls <- perf(MyResult.spls, validation="Mfold", folds=5, progressBar=TRUE, nrepeat=50)
plot(perf.pls, criterion  = 'Q2.total')
X=seq(1:ncomptry)
Y=perf.pls$measures$Q2.total$summary$sd
par(mar=c(5,5,5,5))
plot.new()
plot(X,Y)
abline(h = 0.0975) # keep 5 components to test

# Tuning
list.keepX <- c(25, 50, 100, 500, 1000, 2500, 3000)
set.seed(22)
tune.spls.cor <- tune.spls(meta, filtered_gene, ncomp = 5,
                           test.keepX = list.keepX,
                           validation = "Mfold", folds = 5,
                           nrepeat = 50, progressBar = TRUE,
                           measure = NULL) 

# below is the output I always obtain
tuning component: 1
[=======                                           ] 14%Error in X.test %*% a.cv : non-conformable arguments

Always crashes at 14%. Even if I change the set.seed value, it just fails no matter what I do. The only time this ever worked was with using a very small subset of my data (31 taxa instead of 148), but I have different datasets of microbial abundances that I want to run spls for.

I realize my microbial abundance has a lot of zeros and somehow this is affecting it - because when I just ran spls in the first place, it would fail unless I removed any taxa with less than 10 counts across the samples. But the issue is - zeros are quite a normal value to be assessed in abundance data so I don’t want to just keep arbitrarily removing taxa with low counts because that is actually meaningful data that I want to compare with host gene expression.

How to overcome this??

MaxBladen · August 28, 2022, 11:48pm

My first comment is that your list.keepX has values above 148, which is the number of features in your X dataframe (meta). Not sure why you are trying to select more features than exist in that dataframe.

You have 7 different values in list.keepX. The method fails at 14%. Upon completion of the keepX == 25 iteration, the progress is 14.2% (= 1/7). hence, this is likely an indication that its failing when the method is using keepX == 50. This is in line with your attempts succeeding when you use 31 taxa. Hence, the method doesn’t like the input data frame with somewhere between 31 and 50 features.

By the looks of it, you haven’t applied any of the necessary preprocessing for your microbial data. Have a read of this page on the website for some more information on what is required before any analysis can take place.

With all this info, my best guess would be the lack of preprocessing is causing this issue. It means a dataframe with 31-50 features is far too sparse. Otherwise, there may be something specific with your data. Let me know how you go with implementing the required preprocessing

kdb.chau · August 29, 2022, 12:27pm

My microbial data is already pre-processed in that I conduct a filtering step removing all contigs less than 0.1% relative contig abundance. Adding an offset of +1 would skew my data so much because I am working with such low counts of taxa that adding a 1 would be changing the the dataframe a lot; there are also many taxa with values around 1. I am concerned this would just “fake” my data in some way.

kdb.chau · August 29, 2022, 12:28pm

But I will give it a try with the offset since it seems to be applicable for microbial data - I am just working with metagenomic microbial data (so not 16S rRNA which would give very high counts of data).

kdb.chau · August 29, 2022, 12:51pm

I’ve tried preprocessing it for CLR transformation but once I run the logratio.transfo, I cannot coerce the matrix back into a dataframe which I would need in order to split my data into test and training datasets along with my gene expression dataframe. Is there a reason CLR can’t be in a dataframe?

MaxBladen · August 29, 2022, 10:11pm

Applying an offset to all samples and taxa won’t skew the data at all as it is being universally applied. You don’t even need to use a value of 1, it can be any positive, non-zero value. You just need there to be no zeroes in order to apply the CLR.

It can be a data.frame - you just need to change the class from “clr” to “matrix”:

data(diverse.16S)
X <- diverse.16S$data.TSS

CLR = logratio.transfo(X = X, logratio = 'CLR')
class(CLR) <- "matrix"
CLR <- as.data.frame(CLR)

kdb.chau · August 30, 2022, 12:50pm

Oh thanks, this fixed it for me. Cheers!

Topic		Replies	Views
Error when tuning sPLS parameters Support	8	1677	May 8, 2023
Tune.block.splsda issue AWS Support	0	32	July 23, 2024
Tune.block.splsda error Bugs	8	2059	April 17, 2020
Tune.spls measure issues Support	2	233	November 16, 2022
Tuning multilevel sPLS Analysis	6	674	November 21, 2024

Tuning spls keeps running into error

Related topics