Universal Audio Adversarial Perturbations

Part 1 - Standard evaluation

Comparison of different distortion levels according to the metric:
$\hspace{1.5cm} dB_{max,x}(v)= dB_{max}(v)-dB_{max}(x)$ ,
where
$\hspace{1.5cm} dB_{max}(x)= max_i \ 20\cdot \log_{10}(|x_i|)$

This metric is employed by convention in previous works on audio adversarial examples for speech recognition problems, in which distortion levels below -32dB are assumed to be acceptable. However, we show that this metric is not representative for speech related tasks. Notice that, even for small distortion levels (under this metric), the perturbations are easily detectable.

Standard evaluation: -25dB | Proposed evaluation: -7dB

Original
[Click to Show Prediction] No
Adversarial Example
[Click to Show Prediction] Left

Standard evaluation: -30 dB | Proposed evaluation: -11dB

Original
[Click to Show Prediction] Go
Adversarial Example
[Click to Show Prediction] Unknown

Standard evaluation: -32 dB | Proposed evaluation: -8dB

Original
[Click to Show Prediction] Down
Adversarial Example
[Click to Show Prediction] Unknown

Standard evaluation: -35 dB | Proposed evaluation: -8dB

Original
[Click to Show Prediction] On
Adversarial Example
[Click to Show Prediction] Left

Standard evaluation: -40 dB | Proposed evaluation: -19dB

Original
[Click to Show Prediction] No
Adversarial Example
[Click to Show Prediction] Down

Part 2 - Proposed evaluation

Comparison of different distortion levels according to the metric:
$\hspace{1.5cm} dB_{mean,x}(v)= dB_{mean}(v)-dB_{mean}(x)$ ,
where
$\hspace{1.5cm} dB_{mean}(x)=\frac{1}{d} \sum_i^d 20\cdot \log_{10}(|x_i|)$
Metric applied to the background part of the audio signal.

We discovered that measuring the distortion in both vocal and background part lead to more representative results. In particular, we discovered that the perturbation is more susceptible to be detectable in the background part than in the vocal part, due to the lower sound intensity in that part. This metric is also more correlated with the human judgment, as lower distortion levels lead to a lower detectability. For these reasons, we propose the use of more rigorous approaches to measure the distortion of audio adversarial examples, in order to promote a deeper study of these vulnerabilities and the risk they suppose.

Standard evaluation: -38dB | Proposed evaluation: -20dB

Original
[Click to Show Prediction] Yes
Adversarial Example
[Click to Show Prediction] Left

Standard evaluation: -47dB | Proposed evaluation: -25dB

Original
[Click to Show Prediction] On
Adversarial Example
[Click to Show Prediction] Left

Standard evaluation: -45dB | Proposed evaluation: -30dB

Original
[Click to Show Prediction] Go
Adversarial Example
[Click to Show Prediction] No

Standard evaluation: -50dB | Proposed evaluation: -32dB

Original
[Click to Show Prediction] Right
Adversarial Example
[Click to Show Prediction] Left

Standard evaluation: -47dB | Proposed evaluation: -35dB

Original
[Click to Show Prediction] Yes
Adversarial Example
[Click to Show Prediction] Left

Universal Adversarial Perturbations for Speech Command Classification