Audio Adversarial Perturbations

Part 1 - Standard evaluation

Comparison of different distortion levels according to the metric:
$\hspace{1.5cm} dB_{x,max}(v)= dB_{max}(v)-dB_{max}(x)$ ,
where
$\hspace{1.5cm} dB_{max}(x)= max_i \ 20\cdot \log_{10}(|x_i|).$

This metric is employed by convention in previous works on audio adversarial examples for speech recognition problems, in which distortion levels below -32dB are assumed to be acceptable. However, we show that this metric is not representative for speech related tasks. Notice that, even for small distortion levels (under this metric), the perturbations are easily detectable.

Standard evaluation: -25dB | Proposed evaluation: -9dB

Original
[Click to Show Prediction] No
Adversarial Example
[Click to Show Prediction] Go

Standard evaluation: -30 dB | Proposed evaluation: -13dB

Original
[Click to Show Prediction] Yes
Adversarial Example
[Click to Show Prediction] Left

Standard evaluation: -32 dB | Proposed evaluation: -7dB

Original
[Click to Show Prediction] No
Adversarial Example
[Click to Show Prediction] Go

Standard evaluation: -35 dB | Proposed evaluation: -8dB

Original
[Click to Show Prediction] Go
Adversarial Example
[Click to Show Prediction] Up

Standard evaluation: -40 dB | Proposed evaluation: -19dB

Original
[Click to Show Prediction] Right
Adversarial Example
[Click to Show Prediction] Unknown

Part 2 - Proposed evaluation

Comparison of different distortion levels according to the metric:
$\hspace{1.5cm} dB_{x,mean}(v)= dB_{mean}(v)-dB_{mean}(x)$ ,
where
$\hspace{1.5cm} dB_{mean}(x)=20\cdot \text{log}_{10}\left(\frac{1}{d}\sum_{i=1}^d{|x_i|}\right)$
Metric applied to the background part of the audio signal.

We discovered that measuring the distortion in both vocal and background part lead to more representative results. In particular, we discovered that the perturbation is more susceptible to be detectable in the background part than in the vocal part, due to the lower sound intensity in that part. This metric is also more correlated with the human judgment, as lower distortion levels lead to a lower detectability. For these reasons, we propose the use of more rigorous approaches to measure the distortion of audio adversarial examples, in order to promote a deeper study of these vulnerabilities and the risk they suppose.

Standard evaluation: -48dB | Proposed evaluation: -25dB

Original
[Click to Show Prediction] Go
Adversarial Example
[Click to Show Prediction] No

Standard evaluation: -48dB | Proposed evaluation: -30dB

Original
[Click to Show Prediction] Left
Adversarial Example
[Click to Show Prediction] Right

Standard evaluation: -50dB | Proposed evaluation: -32dB

Original
[Click to Show Prediction] On
Adversarial Example
[Click to Show Prediction] Unknown

Standard evaluation: -54dB | Proposed evaluation: -35dB

Original
[Click to Show Prediction] Right
Adversarial Example
[Click to Show Prediction] Left

Standard evaluation: -57dB | Proposed evaluation: -40dB

Original
[Click to Show Prediction] Right
Adversarial Example
[Click to Show Prediction] Unknown

Part 3 - Intensity of the original signal

We also discovered that the perception of a perturbation changes according to the intensity of the audio signal in which it is applied. For this reason, we decided to classify the audios considering three levels of intensity: low, medium and high, according to the $\hspace{1.5cm}dB_{mean}$ metric:

Low intensity level: audios with a mean distortion below 50dB.
Medium intensity level: audios with a mean distortion between 50dB and 70dB.
High intensity level: audios with a mean distortion above 70dB.

In the following examples we have added the same perturbation to three audios with different levels of intensity, to illustrate its effect on the perception of the distortion.

Low intensity level

Original
[Click to Show Prediction] Right
Adversarial Example
[Click to Show Prediction] Unknown

Medium intensity level

Original
[Click to Show Prediction] Right
Adversarial Example
[Click to Show Prediction] Unknown

High intensity level

Original
[Click to Show Prediction] Right
Adversarial Example
[Click to Show Prediction] Unknown

On the Human Evaluation of Audio Adversarial Examples