top of page

Adversarial attacks on "Explanation models"​

Before we start our discussion on attacks, let us understand the explanation model, why we need it in the first place and how it works.

As you can see in the workflow, if the model you have developed is complex, you will have to adopt a "post-hoc" method to explain the model. Essentially it means you will build a simplified model on top of your complex model. This simplified model is called the explanation model.

Why do we need an explanation model?

The explanation model is needed to explain the complex model you built for your use case. It explains individual predictions of any black box model by learning an interpretable model (e.g. Linear model) locally around each prediction. The output is feature importance for individual predictions, which captures the feature importance of the overall black box model.

How does your explanation model work?

As you can see in the diagram below, while complex models have complex decision boundaries, they have simpler boundaries when explained locally (as you can see in the dotted line).

But, how will you generate simpler boundaries? Well, you will have to perturb the data, meaning you change the values of the features by adding or subtracting a minimal value or by an operation using a function that slightly changes the data points. You do this for a small sample of data points and create a neighbourhood; however, the class labels are assigned based on the original black box model output.

Given that now you have a neighbourhood, you build a local linear model using this neighbourhood.

What is the vulnerability here?

As we discussed, you perturb the data to create an explanation model. Here the assumption is that the perturbed data is similar to the original data. But in reality, what you can see in the graph, the perturbed data points are out of distribution. See, the red points are away from the original data points in Blue.

So if you are an adversary, you can control these explanation models due to the distribution of perturbed data points. So what is the implication of this vulnerability? As a scammer, you want to use a protected feature such as race or gender and build a biased model. So what you do is that you create an adversarial classifier that behaves like the original biased classifier but behaves unbiased when the data is perturbed. Since your explanation model uses the perturbed data, your model will act like an unbiased one.

Let me explain this with a small story. I buy petrol from a nearby petrol pump and always fill the whole tank. The petrol price frequently fluctuates in India, and I used to pay different amounts every time, usually slightly more than last time due to Global factors. But, on one of the occasions, the change was drastic. So, I was confused. However, I paid and came back.

I wondered why there was huge raise. When I checked the petrol price, it has not increased this drastically. Secondly, my car's petrol tank size doesn't grow and is still the same. There is no leakage in the petrol tank as well. But then why did I pay more? So, I plan to read the reading next time I fill the tank. As planned, I went to the same petrol pump next time and asked for a full tank. Surprisingly, the petrol dispenser machine reading showed 60 litres, while my tank's capacity is still 55. Cleary, the meter has been tampered with. But as a user, you don't know until you pay attention.

Likewise, the explanation can be fooled to show an unbiased output while your original black box model is biased. Though there are many 'post-hoc' methods, I am restricting the discussion only to model agnostic perturbation-based 'post-hoc' techniques such as LIME and Kernal SHAP. I hope you enjoyed reading the article. Please like, share and comment.

Views are personal.

Image Credit:

Banner created using

Photo by Ron Lach:

Additional Reading and References:




3 views0 comments

Recent Posts

See All
bottom of page