Feature stability selection in DIABLO

Eleonore · June 2, 2022, 12:35pm

Hi,

I am trying to get the variables that are stably selected across CV runs of my Diablo model.

MyPerf.diablo ← perf(tuned.diablo, validation = ‘Mfold’, folds = 5,
nrepeat = 10,
dist = ‘centroids.dist’)

MyPerf.diablo$features$stable already returns a stability selection % for each nrep.

I thought that the stability selection was the frequency of selection (i.e. 8 times) / 10 nreps
Why do we get a % for each nreps and how could I then extract the overall stability selection of particular features?

Thanks
Eléonore

MaxBladen · June 9, 2022, 12:42am

Remember this is a folded CV model. You set folds = 5 within the function call. Hence, for each repeat, there are actually 5 models which are developed, of which the stability is calculated across.

nomranian · February 10, 2023, 10:03am

This is my question too. should we simply average all nreps?

kimanh.lecao · February 23, 2023, 10:33pm

hi @Eleonore,

Yes, you can average this across the repeats. I just realised that we had averaged across repeats for PLSDA but not for block.PLSDA. I’ll mark this as a bug. Thanks for your feedback.

Kim-Anh

Leo · January 20, 2025, 11:47pm

Hi @kimanh.lecao,

I was wondering if there is a way or function to calculate the average stability of features across all repeats in a block.splsda model. My model includes 100 repeats and 4 components, and I am looking for a way to compute the average stability of the selected features per component.

For example, to retrieve the stability of features for the first repetition and the first component in the proteomics block, I use the following code:
diablo_final_perf$features$stable$nrep1$proteomics$comp1

Furthermore, as an additional question, I plan on doing an overrepresentation analysis with the list of selected features comparing inpatients vs outpatients in heart failure using DIABLO. Would it be more appropriate to include only features above a certain stability threshold (eg. 0.8)? Is there an established or commonly accepted stability threshold for this type of analysis?

Kind regards,
Leo

kimanh.lecao · January 23, 2025, 9:44pm

hi @Leo,

We do not have such function, but maybe @evahamrud can show you a quick example to ‘unlist’ the outputs.

It would make sense to set a threshold on the stable features, but we don’t have an accepted threshold, as it really depends on your data (sometimes you can go only as high as 0.6). Some users also input only the selected variables and only check / validate afterwards specific features based on their stability.

Kim-Anh

evahamrud · January 24, 2025, 1:22am

Hi @Leo,

Here is a short piece of code you can use to quickly summarise the feature stabilities across repeats per component and variable. Let me know if this works for you!

## load data and build DIABLO model with 2 components
data(nutrimouse)
Y = nutrimouse$diet
data = list(gene = nutrimouse$gene, lipid = nutrimouse$lipid) # gene has 140 variables, lipids 21
nutrimouse.sgccda <- block.splsda(X=data,
                                  Y = Y,
                                  design = 'full',
                                  keepX = list(gene=c(10,10), lipid=c(15,15)),
                                  ncomp = 2)

# run perf on 2 repeats with 3 folds
perf = perf(nutrimouse.sgccda, folds = 3, nrepeat  = 2)

# examine structure of output - variables and corresponding values are averaged across folds but not repeats
str(perf$features$stable)

# unlist the feature stability data
unlisted_data <- unlist(perf$features$stable)

# Extract the repeat number, component, and other features from the names and 
# put these into different columns in a dataframe
split_names <- strsplit(names(unlisted_data), "\\.")
df <- data.frame(
  nrepeat = sapply(split_names, function(x) x[1]),
  datatype = sapply(split_names, function(x) x[2]),
  component = sapply(split_names, function(x) x[3]),
  feature = sapply(split_names, function(x) paste(x[4:length(x)], collapse = ".")),
  value = unlisted_data
)
df

# group the data as you wish, e.g. as shown here to average features from each component across repeats
grouped_data <- aggregate(value ~ feature + component, data = df, FUN = mean)
grouped_data

Let me know if this works for you or if it needs modifications!

Cheers,
Eva

Leo · January 24, 2025, 1:10pm

Hi @evahamrud,

Thank you for the code!

Something I am noticing though is that features that are now listed in my grouped_data object are different to the selected features I obtained through the tune.block.splsda() function. I thought that it would give the stability of the final features selected, but it is showing more features than the selected.

For example, if I run the code:
grouped_data ← grouped_data[grouped_data$component == “comp1”,]
grouped_data

and I compare it to the selected features for each block using:
selectVar(diablo.4, block = ‘proteomics’, comp = 1)
selectVar(diablo.4, block = ‘transcriptomics’, comp = 1),
the features are different.

What could be the cause of this?

Kind regards,
Leo

kimanh.lecao · January 30, 2025, 11:10pm

hi @Leo

this is to be expected. The final selection of variable is on your entire data set (obtained through selectVar), whereas the stability of features is measured on subsets of your samples using cross-validation. You would hope that the variables you select are highly stable, but sometimes it is not the case.

This is why it is good to have a look at both measures, the selected variables and their loading coefficient, and the stability of those.

Have a look at the bottom of this example for the single omics case. The stability list is much larger than the final list of variables. sPLSDA SRBCT Case Study() – mixOmics

Kim-Anh

Leo · February 3, 2025, 2:35pm

Hi @kimanh.lecao,

Ah, I understand, thank you for the clarification!

Kind regards,
Leo

Topic		Replies	Views
DIABLO interpretation in light of low stability of feature selection Analysis	2	295	November 2, 2022
sPLS-DA with only stable features? Analysis	2	335	April 6, 2022
Training and test samples Analysis	5	567	October 18, 2022
Perf() step gives inconsistent results? Support	1	42	October 17, 2024
Extract folds from cross-validation with perf Analysis	1	37	February 28, 2025

Feature stability selection in DIABLO

Related topics