Adversarial Machine Learning

5 min readJan 7, 2021

Imagine that you train a convolutional neural network to solve the task of predicting some class given an image (e.g. the network can answer the question “what type of animal is present in this image?”, given a random image of an animal).
You train it on a dataset of images and proceed to test it by using some new unseen images, resulting in a great test classification accuracy.

You become excited and decide to tell one of your best friends about your success. You share your algorithm with your friend, which decides to test it out using some of her own images of animals. After half an hour you receive a rather disappointing message, where you friend explains that the performance of your classifier isn’t even close to what you claimed.

You tell your friend to send you the images she used and when you look at them they seem very similar to the ones you used in your test set, some even being duplicates. You decide to pick out one of the images your friend used used, which you know should classify as a donkey (since you have the same image in your test set). You run it through your classifier and it classifies it (with high certainty) as a pelican — What is going on here?

Although they seem identical to the human eye, the “same” image of a donkey your friend used is actually a slight modified image of the original one. It has been modified in such a way that it looks the same to humans, but is different enough to be able to trick your classifier.

This is what is known as an adversarial sample. The main objective of adversarial machine learning is to find perturbations of the input data so that the loss of the model is maximised.

So how does one generate adversarial samples? And maybe more interesting, how can you make your classifier more robust towards adversaries?

There are several ways of doing this and it also depends on how much access you have to the algorithm, however as in dealing with any security you should always assume the worst (i.e. that the adversary has full access to your algorithm).

The setting in the example above is something called a white box attack because we gave full access to our algorithm to the attacker (your friend). If you have full access to the model you can use a method called FGSM (Fast Gradient Sign Method) [1]. The TLDR of the algorithm is: Calculate the gradient of the loss function with respect to a data point x and move a small step in the direction of the gradient, since we want the error to increase:

Why does this work? (The following section is explained more thoroughly in the paper cited above).

Consider a sample x (image) of our data set. To simplify we can think of every element of x (pixel) being in the greyscale set = {0,1,2,…,255} where 0 is white and 255 is black. Since the dynamic range of each pixel is limited by 8 bits we have lost some precision. Because of this it wouldn’t make sense if the classifier took different decisions between a real data sample x and a “fake” one (x + perturbation) if every element of the perturbation is smaller than the precision of the images. In other words we can bound the infinity norm of the perturbation vector (lets call it η) by some small value ϵ.

We can break down an adversarial input signal for an activation function into this:

As you can see the activation grows with the perturbation term. By assigning η to the sign of w we can maximise this increase, subject to the max norm bound on η. If the dimension of w is n and the average size of an element is m, the signal will grow by ϵmn. This means that subjecting the input to many small changes will cause a large change in the output, since it grows linearly with respect to the dimension of w.

The argument the authors use to translate this into non-linear models such as sigmoid activation neural networks, is that they are usually tuned to operate in the linear, non-saturated regions and thus are “too linear to resist linear adversarial perturbation”.

So how do you train your model to be more robust against these kind of attacks?

The answer is simple: you include the FGSM in the training of the network. We start by doing a normal forward pass and calculate the gradient w.r.t. the data. We use the gradient to perturb the original input and create an adversarial sample. We then train the network on the adversarial sample and update the weights accordingly.
Is there a tradeoff in accuracy? The authors state that they optimise the output of two mixed loss functions, one being the loss w.r.t. the original sample and the other one being the loss w.r.t. the adversarial sample. The ratio is a hyper-parameter than can be tuned, however the authors state that a 50–50 ratio seems to work well.

What about black box attacks?

If you do not have access to the algorithm you can generate input and output pairs by classifying some data using the black box. Once you have generated a substantial amount of data pairs you can train your own neural network on this data and generate adversarial samples using the white box method described earlier. It turns out that this works surprisingly well and that the “proxied” adversarial samples can cause trouble for the original neural network.

Thank you for reading!

References:

[1] Explaining and harnessing adversarial examples (2015) Ian J. Goodfellow, Jonathon Shlens & Christian Szegedy.

Adversarial Machine Learning

Written by Marcus Gruneau