Robustness Metrics for ML Models based on Deep Learning Methods

9 min readAug 3, 2022

On July 21st, 2022 Data Science Milan held a Meetup at Alkemy headquarters with Davide Posillipo as a speaker.

“Robustness Metrics for ML Models based on Deep Learning Methods”, by Davide Posillipo, Lead Data Scientist at Alkemy

What does it happen if we pass a chicken to LeNet?

Davide started the talk introducing one of the most famous datasets when it comes to Deep Learning, i.e. MNIST, and how it is possible to easily identify digits leveraging a LeNet architecture. But what does it happen if we pass a picture which is not related to digits, e.g. an image of a chicken, to the LeNet trained on MNIST?

In an ideal setting, since the chicken’s image comes from a completely different distribution respect to the digits one, we would expect that the algorithm doesn’t have any idea of what to do with that image. However, the funny thing is that the LeNet gets a quite strong prediction of the chicken being actually a 6 (see Fig.1).

Fig.1 Passing a chicken to LeNet — Expectation vs Reality

Usually Statisticians, or more general people familiar with the subject, provide a set of ‘metrics’ together with the estimates, so to help understanding how confident we can be about the generated prediction (e.g. the p-value, to trust if this is a good prediction or just noise). Unfortunately, complex Machine Learning or Deep Learning models generally don’t come with these ‘safety metrics’, and looking exclusively to the probability of each prediction could be misleading. Normally this happens also if we do something a bit more complicated. Let’s assume images coming from another famous dataset, FashionMNIST, are passed to the LeNet: as for the chicken’s image, the network provides quite ‘confident’ predictions, with almost 90% of examples which are classified with a probability greater than 0.75.

A solution to handle this problem is represented by ‘robustness metrics’, which is the actual topic presented by Davide.

Robustness metrics

After seeing the weird behaviors occurring even when dealing with an easy problem like the MNIST one, a question comes up: can we trust deep learning predictions? What about if we are operating in some very sensitive area like health care or autonomous driving?

Sometimes, in contexts where risk must be controlled, avoiding a prediction is a much better option rather than going with a random guess. Within this framework we would need some metrics working in a way that if the metric value does not reach a pre-defined threshold, then an exception is raised, and a prediction is not computed. A crucial point here is that the focus is not on fixing the prediction (or rather fixing the model), whereas it’s on avoiding the prediction itself!

Once agreed that this could be a strategy, the next question popping up is: how should a robustness metric look like? First things first, it should tell us if the prediction is reliable (or if it’s in a way a random guess) by providing a measure of ‘confidence’. On top on this we could define a check list with some properties that our metric should have:

It should be computable without labels
Computable in ‘real-time’ (in case of any stream of inputs where we want to process one after the other)
Easy to plug-in into a working ML pipeline
Low false positive rate, but also low false negative rate: high effectiveness in detecting the anomaly points (low false negative rate), low rate of discarded ‘good predictions’ (low false positive rate)

But let’s take a step back! We are talking about robustness metrics, but we haven’t defined what robustness is. Well, let’s try to answer this question with a quote of Davide’s own:

Robustness is the quality of an entity that doesn’t break too easily if stressed in some way

Generally, approach in Statistics involve some modification of the predictive model/estimator (‘if you get into a stressful situation, handle it better’), for instance the trimming for the average, or changing the weights of the model, etc., with often a loss of performance as side effect. But what if we don’t modify our models but we protect them from threats (‘Don’t get into stressful situations’). We look for the chickens and keep them away from our model, we prevent them to get into our system.

We are going see 2 approaches to define robustness metrics:

GAN-based approach: decide if a prediction is worth the risk, checking the ‘stability’ of the classifier for the new input
VAE-based approach: decide if a prediction is worth the risk, checking the true ‘origin’ of the new input data

First approach: WGAN + Inverter

Let’s start with a quick description of GANs (Generative Adversarial Networks). In a nutshell a GAN is a composition of 2 neural networks (Fig.2) where the first one, called Generator, takes some random noise and produces some inputs (like the digits we saw before); the second one instead, called Discriminator (or Critic) decides if the inputs that is coming from the Generator is authentic of artificial. The idea is that at some point, after many iterations, the Generator becomes so good in producing artificial inputs, that the Discriminator can’t distinguish anymore between the real inputs and the artificial ones.

In the approach proposed by Davide, a bit fancier GAN is used: the WGAN (Wasserstein GAN), which uses a different loss function (the Wasserstein distance) which is a measure of the distance between 2 probability distributions. For more information related to the WGAN please refer to this link: https://lilianweng.github.io/posts/2017-08-20-gan/.

What is relevant in the proposed approach is that compared to the normal GAN setting, where there are only 2 networks, there is a 3rd one that is called Inverter. This additional network takes the inputs that the Generator created, like the fake digits, and translates them back to the latent representation that the Generator is using. It’s essentially the inverse function of the Generator (Fig.3).

With this framework in mind, how do we build our robustness metrics?

Given a new data point x (i.e. chicken is coming), for which the classifier makes a prediction (it says 6), let’s find the closest point to x in the latent space that is, once translated back to the original space, able to confound the classifier f

The idea behind this logic is that starting from z*, the latent representation of the chicken, we look around it and check for the closest data point, in that representation, that can fool the classifier. If we apply it to the chicken example, if the classifier was predicting initially 6, now it will predict 8. The rationale here is that, if only a little effort is needed to make the classifier changing its mind, then perhaps the classifier is not really confident about it.

The robustness metric at this point could be define as the difference of the translated chicken in the latent space and the closest point that is able to fool the classifier

If this object is really small then the classifier was very easy to fool, whereas if the difference is big then the classifier is ‘stable’.

Obviously the question here is: how do we define that is big or small? How do we find the threshold to define where we should make a decision?

To answers those questions Davide trained the WGAN + the Inverter. After the training he looked at the distribution of on the test set and took, as reference, the 5% percentile as our threshold: the goal is to compare this number with the of the chicken. The latter was way smaller than the threshold, so in this setting it would mean that the prediction is discarded because there’s something weird in the input.

Reasonably we should check not only a single data point, but rather many (chicken) inputs. To do that the FashionMNIST dataset was used and, taking always the 5% percentile as threshold, we have 5% of positive rate by definition but a 35% loss of good predictions (Fig.4). If we convert this back to our chicken problem, in 35% of the cases we are finding chickens that are not there, and this of course is not a good result.

One option to overcome the problem could be the tuning of the threshold value.

Fig.4 Results of the WGAN + Inverter experiment

Second approach: Variational AutoEncoder

The second approach to define a robustness metric is based on AutoEncoder, a composition of 2 networks: the Encoder that takes the inputs and ‘compresses’ them in a latent representation, and the Decoder that learns how to take the compression and to expand it back to the real original space (Fig.5). If this works properly, after many iterations it’s not possible to identify the original input from the reconstructed one. The idea does not differ much from the GAN approach, both are using generative approaches.

What happens with the VAE is that the input is mapped into a distribution, rather than a fixed vector. In this scenario we have a probabilistic encoder and a probabilistic decoder that are somehow approximation of posterior, likelihood and some prior. You can find more info about the topic at the following link: https://lilianweng.github.io/posts/2018-08-12-vae/

But how is it possible to leverage this approach to define a robustness metric? If we pay attention, we see that the loss function of a VAE is somehow an indirect measure of the probability that an observation comes from the same distribution underlying the training set. Basically, the loss function is telling us how hard was decoding after the encoding: so if the decoding was ‘perfect’ we would have a loss function equal to 0; otherwise if the loss was big it wouldn’t work.

The idea is that an ‘unlikely’ input would produce a big loss function, and this loss function is our actual robustness metric! And the coolest thing here is that we don’t need the classifier at all.

As for the first approach, Davide trained the architecture and checked the loss distribution. However in this case a conservative approach was followed and the threshold was set equal to the maximum value: no matter what happens, if the loss generated by the new input is higher than the maximum, then it’s discarded.

As before the FashionMNIST dataset has been considered as a new input for the VAE model: using the maximum test loss as threshold, we get 0% as false positive rate and a 3.55% of lost good predictions (false negative rate), as shown in Fig.6.

The VAE approach can also be ‘easily’ deployed in production, following this list of steps:

Train your classifier
Train a VAE on your training set
Get the distribution of the VAE losses on your test set
Define a threshold more or less ‘conservative’
Implement a conditional classifier (see Fig.7)

Conclusions

To summarize, Davide explained 2 different approaches to define a robustness metric. Among the two, the one based on VAE seems more appealing especially for the fact that there is no need of having a classifier.

All in all, the VAE approach comes with several pros, although a couple of cons can’t be avoided. Regarding the benefits, the VAE-based metric can be applied to monitor many different ML models (for a given dataset) at the same time, and is applicable to any kind of data (tabular, image, etc.). In addition, it is relatively ‘easy’ to explain and to plug-in into existing pipelines. On the other hand, arbitrary thresholds must be set by Data Scientists and, finally, the robustness metric approach represents a further model that needs to be maintained in production.

Recording & Slides

Written by Matteo Boscato

Originally published at https://medium.com on August 3, 2022.