MyPerf.diablo$features$stable already returns a stability selection % for each nrep.
I thought that the stability selection was the frequency of selection (i.e. 8 times) / 10 nreps
Why do we get a % for each nreps and how could I then extract the overall stability selection of particular features?
Remember this is a folded CV model. You set folds = 5 within the function call. Hence, for each repeat, there are actually 5 models which are developed, of which the stability is calculated across.
Yes, you can average this across the repeats. I just realised that we had averaged across repeats for PLSDA but not for block.PLSDA. I’ll mark this as a bug. Thanks for your feedback.
I was wondering if there is a way or function to calculate the average stability of features across all repeats in a block.splsda model. My model includes 100 repeats and 4 components, and I am looking for a way to compute the average stability of the selected features per component.
For example, to retrieve the stability of features for the first repetition and the first component in the proteomics block, I use the following code:
diablo_final_perf$features$stable$nrep1$proteomics$comp1
Furthermore, as an additional question, I plan on doing an overrepresentation analysis with the list of selected features comparing inpatients vs outpatients in heart failure using DIABLO. Would it be more appropriate to include only features above a certain stability threshold (eg. 0.8)? Is there an established or commonly accepted stability threshold for this type of analysis?
We do not have such function, but maybe @evahamrud can show you a quick example to ‘unlist’ the outputs.
It would make sense to set a threshold on the stable features, but we don’t have an accepted threshold, as it really depends on your data (sometimes you can go only as high as 0.6). Some users also input only the selected variables and only check / validate afterwards specific features based on their stability.
Here is a short piece of code you can use to quickly summarise the feature stabilities across repeats per component and variable. Let me know if this works for you!
## load data and build DIABLO model with 2 components
data(nutrimouse)
Y = nutrimouse$diet
data = list(gene = nutrimouse$gene, lipid = nutrimouse$lipid) # gene has 140 variables, lipids 21
nutrimouse.sgccda <- block.splsda(X=data,
Y = Y,
design = 'full',
keepX = list(gene=c(10,10), lipid=c(15,15)),
ncomp = 2)
# run perf on 2 repeats with 3 folds
perf = perf(nutrimouse.sgccda, folds = 3, nrepeat = 2)
# examine structure of output - variables and corresponding values are averaged across folds but not repeats
str(perf$features$stable)
# unlist the feature stability data
unlisted_data <- unlist(perf$features$stable)
# Extract the repeat number, component, and other features from the names and
# put these into different columns in a dataframe
split_names <- strsplit(names(unlisted_data), "\\.")
df <- data.frame(
nrepeat = sapply(split_names, function(x) x[1]),
datatype = sapply(split_names, function(x) x[2]),
component = sapply(split_names, function(x) x[3]),
feature = sapply(split_names, function(x) paste(x[4:length(x)], collapse = ".")),
value = unlisted_data
)
df
# group the data as you wish, e.g. as shown here to average features from each component across repeats
grouped_data <- aggregate(value ~ feature + component, data = df, FUN = mean)
grouped_data
Let me know if this works for you or if it needs modifications!
Something I am noticing though is that features that are now listed in my grouped_data object are different to the selected features I obtained through the tune.block.splsda() function. I thought that it would give the stability of the final features selected, but it is showing more features than the selected.
For example, if I run the code:
grouped_data ← grouped_data[grouped_data$component == “comp1”,]
grouped_data
and I compare it to the selected features for each block using:
selectVar(diablo.4, block = ‘proteomics’, comp = 1)
selectVar(diablo.4, block = ‘transcriptomics’, comp = 1),
the features are different.
this is to be expected. The final selection of variable is on your entire data set (obtained through selectVar), whereas the stability of features is measured on subsets of your samples using cross-validation. You would hope that the variables you select are highly stable, but sometimes it is not the case.
This is why it is good to have a look at both measures, the selected variables and their loading coefficient, and the stability of those.
Have a look at the bottom of this example for the single omics case. The stability list is much larger than the final list of variables. sPLSDA SRBCT Case Study() – mixOmics