On the Human Evaluation of Audio Adversarial Examples

headphones_recommended








Part 1 - Standard evaluation

Comparison of different distortion levels according to the metric:
,
where

This metric is employed by convention in previous works on audio adversarial examples for speech recognition problems, in which distortion levels below -32dB are assumed to be acceptable. However, we show that this metric is not representative for speech related tasks. Notice that, even for small distortion levels (under this metric), the perturbations are easily detectable.

Standard evaluation: -25dB | Proposed evaluation: -9dB

Standard evaluation: -30 dB | Proposed evaluation: -13dB

Standard evaluation: -32 dB | Proposed evaluation: -7dB

Standard evaluation: -35 dB | Proposed evaluation: -8dB

Standard evaluation: -40 dB | Proposed evaluation: -19dB

Part 2 - Proposed evaluation

Comparison of different distortion levels according to the metric:
,
where

Metric applied to the background part of the audio signal.

We discovered that measuring the distortion in both vocal and background part lead to more representative results. In particular, we discovered that the perturbation is more susceptible to be detectable in the background part than in the vocal part, due to the lower sound intensity in that part. This metric is also more correlated with the human judgment, as lower distortion levels lead to a lower detectability. For these reasons, we propose the use of more rigorous approaches to measure the distortion of audio adversarial examples, in order to promote a deeper study of these vulnerabilities and the risk they suppose.

Standard evaluation: -48dB | Proposed evaluation: -25dB

Standard evaluation: -48dB | Proposed evaluation: -30dB

Standard evaluation: -50dB | Proposed evaluation: -32dB

Standard evaluation: -54dB | Proposed evaluation: -35dB

Standard evaluation: -57dB | Proposed evaluation: -40dB

Part 3 - Intensity of the original signal

We also discovered that the perception of a perturbation changes according to the intensity of the audio signal in which it is applied. For this reason, we decided to classify the audios considering three levels of intensity: low, medium and high, according to the metric:

In the following examples we have added the same perturbation to three audios with different levels of intensity, to illustrate its effect on the perception of the distortion.

Low intensity level

Medium intensity level

High intensity level