Adversarial attacks: A detailed review – Part 2

In previous part, we understood what is and adversarial attack and how it can be classified based on various attributes. (In case you missed it, you can read it here). In this part, we will study some of the most common types of attacks in detail.


In this part we will overview some of the most common attacks on image classifiers and implement a very popular method called FGSM (Fast Gradient Sign Method) attack proposed in Goodfellow et al. and understand how it works as well as how it challenges the notion of non-linearity of neural networks being the reason behind success of adversarial attacks.

L-BFGS Attack

This was one of the earliest attacks where Szegedy et al. first discovered the vulnerability of deep visual models to adversarial perturbations by solving for the following optimization problem:

Adversarial Attacks A detailed review-02
equation for L-BFGS attack

where we are trying to minimize ρ (which is the adversary signal) with second-norm. If we look closely, this equation is similar to equation described in part — 1, with norm-value p=2.

For this problem, approximate solution was computed by Szegedy et al. with the Limited Memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm, upon which this method is named. However, solving this equation for large number of examples, is computationally prohibitive, which is addressed in next method. That is when the next method comes into picture.

FGSM Attack

The FGSM is among the most influential attacks in the existing literature, especially in the white-box setup. Its core concept of performing gradient ascend over the model’s loss surface to fool it, is the basis for a plethora of adversarial attacks. Many follow-up attacks can be strongly related to the original idea of FGSM. The most common image showing adversarial attacks has been of FGSM attack:

Adversarial Attacks A detailed review-02
A common example of FGSM attack

The FGSM is a one-step gradient-based method that computes norm-bounded perturbations, focusing on the ‘efficiency’ of perturbation computation rather than achieving high fooling rates. Goodfellow et al. also used this attack to corroborate their linearity hypothesis, which considers the linear behavior of the modern neural networks in high dimension spaces (induced by ReLUs) as a sufficient reason for their vulnerability to adversarial perturbations. At the time, the linearity hypothesis was in sharp contrast to the developing idea that adversarial vulnerability was a result of high ‘non-linearity’ of the complex modern networks.

Goodfellow et al. claimed that adversarial examples expose fundamental blind spots in our training algorithms. They also claimed that linear behavior in high-dimensional spaces is sufficient to cause adversarial examples. And using this, they designed a fast method of generating adversarial examples that makes adversarial training practical.

Linear explanation of adversarial examples

We start with explaining the existence of adversarial examples for linear models. Since the precision of the features is limited, we will see how the classifier can be forced to respond differently to an input x than to an adversarial input  = x + η if every element of the perturbation η is smaller than the precision of the features.

Adversarial attacks for linear model

Thus, for high dimensional problems, we can make many infinitesimal changes to the input that will add up to one large change to the output.

Linear Perturbation of Non-Linear Models

The linear view of adversarial examples suggests a fast way of generating them. We hypothesize that neural networks are too linear to resist linear adversarial perturbation. LSTMs, ReLUs, and maxout networks are all intentionally designed to behave in very linear ways, so that they are easier to optimize. More nonlinear models such as sigmoid networks are carefully tuned to spend most of their time in the non-saturating, more linear regime for the same reason. This linear behavior suggests that cheap, analytical perturbations of a linear model should also damage neural networks. Let us see how adversarial examples can be generated for the neural networks.

Adversarial Attacks A detailed review-02
Adversarial perturbation for neural networks

We refer to this as the “fast gradient sign method” of generating adversarial examples. Note that the required gradient can be computed efficiently using backpropagation.

Code Implementation for FGSM

Let us see how we can implement this in code. We will be using code from this link, which is part of tensorflow official documentation. We will analyze this function ‘create_adversarial_pattern’, as it implements the crux of the paper, i.e. calculates the gradient sign.

def create_adversarial_pattern(input_image, input_label):
with tf.GradientTape() as tape:
prediction = pretrained_model(input_image)
loss = loss_object(input_label, prediction)

# Get the gradients of the loss w.r.t to the input image.
gradient = tape.gradient(loss, input_image)
# Get the sign of the gradients to create the perturbation
signed_grad = tf.sign(gradient)
return signed_grad

# codel_url:


Using below lines, we are basically asking tensorflow to keep track of computations related to ‘input_image’

with tf.GradientTape() as tape:


Using below two lines, we are making predictions for ‘input_image’, and calculating the loss related to this prediction.

prediction = pretrained_model(input_image)
loss = loss_object(input_label, prediction)


Using below lines, we are calculating the gradient of loss wrt input image, this part contributes to the ‘gradient’ term in the FGSM (Fast Gradient Sign Method)

gradient = tape.gradient(loss, input_image)


Once we have gradients, we need to calculate use ‘sign’ function of gradient, i.e, sign(gradient), and we have ‘gradient sign’ term in FGSM (Fast Gradient Sign Method). Below is the input image, for which we are calculation adversarial perturbation (η).

Adversarial Attacks A detailed review-02
Prediction on input image


Calculated adversarial perturbation (η) using FGSM method comes out as:

Adversarial Attacks A detailed review-02
Calculated adversarial perturbation (η) using FGSM

Adversarial image to fool the model is calculated using the below code:

adv_x = image + eps*perturbations


Here is what the generated result looks like for different values of ϵ

Adversarial Attacks A detailed review-02
Generated adversarial examples for different values of epsilon (0.01, 0.1, 0.15)

Goodfellow et al. concluded the following things as a result of this experiment:

  • Adversarial examples can be explained as a property of high-dimensional dot products. They are a result of models being too linear, rather than too nonlinear.
  • The generalization of adversarial examples across different models can be explained as a result of adversarial perturbations being highly aligned with the weight vectors of a model, and different models learning similar functions when trained to perform the same task.
  • The direction of perturbation, rather than the specific point in space, matters most. Space is not full of pockets of adversarial examples that finely tile the reals like the rational numbers.
  • Because it is the direction that matters most, adversarial perturbations generalize across different clean examples.

Other attacks

Adversarial Attacks A detailed review-02
A single Universal Adversarial Perturbation can fool a model on multiple images. Fooling of GoogLeNet is shown here. These perturbations often transfer well across different models (Source)

Thus, we studied FGSM attack in details and understood its implementation in tensorflow. We also overviewed some other attacks on image classification. In further parts, we will be moving beyond classification and seeing how adversarial attacks can be performed on other tasks such as Face Recognition, Object detection, Object Tracking and how they can affect the real world in several ways.

Leapfrog your Enterprise AI adoption journey

Request Demo!

Get started with Subex
Request Demo Contact Us
Request a demo