Help with the % variance explained in block splsda (diablo)

I think this is de variance explained:

diablo.tcga$prop_expl_var
$microbiota
    comp1     comp2 
0.1560833 0.1093269 

$metabolon_feces
    comp1     comp2 
0.1138455 0.1318485 

$metabolon_plasma
    comp1     comp2 
0.1865667 0.1051642 

$Y
    comp1     comp2 
1.0000000 0.4195772 

For example, component 1 of the microbiota dataset (i.e. the selected microbiota variables) would explain 15% of the variance in the data. Is this assumption correct?

Now, I do not understand the % variance of $Y. This is my outcome. These are the groups with which I classify the group so why do I get the variance explained for each of these? Is it because you choose the variables that maximize the differences between the two groups?

1 Like

component 1 of the microbiota dataset (i.e. the selected microbiota variables) would explain 15% of the variance in the data. Is this assumption correct?

15% of the variance in the microbiota data

Is it because you choose the variables that maximize the differences between the two groups?

In DIABLO (and sPLSDA), we select components which, as a primary objective, discriminate the groups. Captured variance is not a primary focus in this method, unlike PCA.

Having said that, I’ve seen multiple posts referring to this explained variance of 1 on the first Y component in DA scenarios. Based on the frequency of this observation, I beginning to think it is an error. I’m not sure when I could look into this (as I’m focusing on other work at the moment), but I’ll let you know when I can.

Cheers,
Max.

2 Likes

Max,

Thank you very much for the prompt response!

So when talking about the analysis, instead of talking about variance explained, I should talk about the BER, right? What BER is usually considered good in one of these analyses?

Is there any other parameter used to evaluate the “efficiency” of the model result?

Thank you very much for all the work you do.

1 Like

when talking about the analysis, instead of talking about variance explained, I should talk about the BER, right?

Explained variance is a good thing to mention if it has quite high (or low) values as it can give you an idea of the amount of non-discriminatory information in your dataset. However, when it comes to quantitative evaluation of your method, error rate (or BER) is definitely the better metric to use.

What BER is usually considered good in one of these analyses?

There is no specific threshold as every analysis is unique. For example, if you’re working as part of a preliminary study and have a minimal dataset with 4 response classes, a score of 60% BER is actually quite good as it represents a 15% improvement over random class selection. Being a prelim. study, you’re not looking for amazing results.

However, for more stringent experimental designs, you may be looking to minimise BER to 10-15%. I’d say generally, aiming for about 25% BER is a good starting point, but don’t take that as gospel!

Is there any other parameter used to evaluate the “efficiency” of the model result?

This may be semantics, but I wouldn’t describe BER as measuring “efficiency”, rather accuracy. In assessing accuracy, have a look at the auroc() function (via ?auroc). It provides some numerical and graphical ways for you to assess the Specificity and Sensitivity of your model. If you’re unsure as to what these metrics mean/what AUROC represents, feel free to ask.

Regarding efficiency, if you’re talking about runtime efficiency, then merely tracking the runtime willprovide you assessment (via Sys.time()). Thinking about the efficiency of the method in explaining your data, this is where the proportion of explained variance will be handy to think about. Additionally, examining the loadings (you can use the plotLoadings() function) will elucidate how effective some features were at discriminating classes.

Hope this info helps!

2 Likes

Hi @Lorengol ,

After doing some thinking and some reading, I think I’ve determined the cause of the explained variance equaling 1 for the first Y component.

I can assume that your response variable (Y) only has two classes, correct? In your scenario, the Y dataframe is a represented by a single variable (0 or 1 for each class). As far as the method is concerned, this is considered its own “block” - in the same way your various X blocks are treated.

Components generated for the X blocks use a combination of all the input features. For example, if you have three features, the loadings for the first component might be 0.3, 0.8 and 0.5. Using these weights in a linear combination of the input features allows us to represent all three features “simultaneously” with the one component.

Now when we try to do the same for the Y block, there is only a single variable to generate a component from. Therefore, it just uses this component as is (sometimes flipping the sign), so the resulting loading will just be 1 (or -1). Hence, when calculating the explained variance, the original Y data and the first Y component are essentially identical, meaning the proportion of explained variance is equal to 1.

When calculating the explained variance for subsequent Y components, the process is a little more complicated and subject to few different requirements. Hence, the second (and further) components are not identical to the Y vector, resulting in an explained variance value lower than 1.

Hope this clarifies things a bit

1 Like

Max,

Thank you very much for taking the time. Both answers were very clear. Of course, with the diablo splsda, I got the variables that best classify the Y groups, that’s why I should not focus so much on the explained variance (I guess that works for me if I do a PLS or PCA).

I know the ROC curves, so I will proceed with it to have another parameter about my model.

On the other hand, it was also clear to me what was going on with the Y explained variance values. It makes sense. In the model, I saw in the plotIndiv that the Y groups were classified by the variables in component 1. Now, if the explained variance of Y component 2 is 0.41, does it mean that this component has some influence on the Y axis? Or should I not pay attention to that number?

Thank you very much for your willingness to help. Everything you do is a great help!

Or should I not pay attention to that number?

If you can see that the first component discriminates your sample groups perfectly (clear, linear decision boundary), then the second component may not even be worth keeping in your model (save for the use of the plotIndiv() and plotVar() functions). Either way, don’t ignore this number, but understand it isn’t worth focusing on as it the model doesn’t build it components to maximise explained variance

1 Like