music and noise fingerprinting and reference … · 2014. 12. 3. · anil alexander1, oscar forth1...

ANIL ALEXANDER1, OSCAR FORTH1 AND DONALD TUNSTALL2

1 Oxford Wave Research Ltd, United Kingdom

{anil|oscar}@oxfordwaveresearch.com2 Digital Audio Corporation, USA

[email protected]

Audio Engineering Society 46th Conference on Audio Forensics

Denver, Colorado

June 14-16, 2012

MUSIC AND NOISE FINGERPRINTING AND REFERENCE CANCELLATION

APPLIED TO FORENSIC AUDIO ENHANCEMENT

Introduction In surveillance audio recordings, it is common

to come across: Interfering music or a television playing in the background

in locations like pubs, cafes, cars, etc.

Other speakers in the background who mask the speech of interest

Target speakers who turn on their music players or their televisions, as they begin to speak, especially when they suspect they are being monitored, in order to mask their speech.

The loud music or background noise drowns out the words or makes the speech of the speakers hard to decipher and transcribe.

Research Questions

Is it possible to reduce or remove:

I - interfering music from non-contemporaneous reference material and to bring the voice of the speaker to the forefront?

II- background noises, and speech of other speakers, music, etc. from contemporaneous recordings made in the same acoustic environment to bring the voice of the main speaker to the forefront?

Example (1,2): Car or Hotel Room

Hotel RoomIn a Car

Noise sources: Radio, television, music player

Noise sources: road noise, car radio, other passengers

Example (3): Pub/Hall with Music

Noise Sources: Television, Jukebox, Radio, Bar Noise, Other Speakers

Research Question (I)

Is it possible to reduce or remove interfering music from non-contemporaneous reference material and to bring the voice of the speaker to the forefront? (Alexander and Forth, 2011)

Why is this difficult ?“Is it possible to reduce or remove interfering music

and to bring the voice of the speaker to the forefront?”

Straightforward subtraction of the audio will not remove the music as the effects of the room are not considered

Cancellation is sensitive to clipping and compression .

Has often to be applied on a single channel of audio (without simultaneous reference recordings).

The exact song that is playing has to be identified and perfectly time-aligned time and labour intensive.

Reducing Background Music

Tasks involved:

Identifying the music/song being played

Aligning the tracks to the exact moment in time, within the file being analysed, that the song or music begins

Applying a noise- and distortion-robust echo cancellation algorithm to remove or reduce the music while mostly leaving the target speech intact.

Automatic Music Identification

Commercial applications of acoustic fingerprinting are in areas of identifying tunes, songs, videos, advertisements and radio broadcasts and anti-piracy initiatives.

Recent proliferation of music identification systems such as Shazam™.

A short segment of audio (noisy, distorted or otherwise poor) is sent through to an internet-based recognition server for identification.

The server compares feature of this recording to a pre-indexed database of songs.

It selects the most probable candidate(s) for the song.

Noise-Robust Audio Fingerprinting

Query audio

5 10 15 20 250

1000

2000

3000

4000

Match: 1-05 The Road To Hell (Part 2) at 179.744 sec

180 185 190 195 200 2050

1000

2000

3000

4000

Attributes for a ‘fingerprint ’ [Wang (2003)]

Temporally localized

Translation invariant

Robust

Sufficiently Entropic

Spectral peak pairs are thus temporally localized, robust to noise and transmission distortions

Landmark-based Audio Fingerprinting Algorithm (1)

• Peaks’ chosen based having higher energy than neighbours

• Spectrogram is reduced into a ‘constellation map’ containing spectral peaks.

• Pairs of peaks selected as landmark ‘hashes’ that provide reference anchor points in time and frequency.

• Landmark hash extraction is performed on query audio.


∆t

Query Audio

Reference audio containing music or noise

Time of match (t)

Landmark hashes

Matching hashes

• Constellation maps are then compared to obtain the position in time when some of the hashes match, between the query and reference audio.

• The file with the largest number of hash matches is selected as the reference audio file.

• An accurate estimate of the time of match is also returned by this algorithm.

Query audio

5 10 15 20 250

1000

2000

3000

4000

Match: 1-05 The Road To Hell (Part 2) at 179.744 sec

180 185 190 195 200 2050

1000

2000

3000

4000


Ellis (2009) Robust Landmark-Based Audio Fingerprinting

Result Example - Time Domain

0 0.5 1 1.5 2 2.5 3 3.5 4.0-0.5

0

0.5Original Signal (Speech and Music)

Time (s)

0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0-0.5

0

0.5Identified Music Signal

Time (s)

1 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0-0.5

0

0.5Resulting Speech (Original - Music)

Time (s)

Marked reduction in the noise floor

Result Example- Frequency Domain

Echo Cancellation (1)

Echo cancellation suffers from a similar problem –

playback from the speakers and simultaneous recordings from the microphones

the playback should not ‘seep in’ to the recording in the microphone

An acoustic echo canceller could provide a good solution to the problem

Echo cancellation algorithms are generally LMS (Least Mean Square-based) – either time domain or frequency domain approaches can be used

In this application we use an echo canceller software module (compliant with ITU-T G.167, G.168) specifications using Intel Performance Primitives (IPP) library and the DAC CARDINAL.

LMS-based Echo Cancellation

Speech + residual noise/music(S+N’ –N”)

(S+N) Speech + noise/music

(N’) Identified time-aligned noise/music

+

-ElectronicResponse estimate

(N”) Residual

LMS / NLMS Coefficient Update

,

hn (i + 1) = hn (i) + Δhn (i)

Each FIR coefficient h, index n, updated each sample interval i as follows:

Update increment, Δhn (i), computed by LMS algorithm as follows:

Δhn (i) = µ ∙ e (i) ∙ x (i - n)

NLMS uses a slightly different µ value, as follows:

Δhn (i) = µ’ ∙ e (i) ∙ x (i - n)

where µ’ is the specified µ value (or “adapt rate”), scaled inversely to the average input signal power

Electronic Response Estimate FIR filter coefficients represent

an electronic simulation of the room’s acoustical environment

Filter must have a sufficient number of taps, N, to not only account for direct acoustic path (A), but also the longest significant reverberation path (B)

At 16000 Hz sample rate, required N for example at left would be 0.070s * 16000/s = 1120 taps

We typically estimate the minimum required filter length in milliseconds as 5 times largest dimension of the room in feet

A – Direct Path (13’)

B – Longest significant path (70’)

Sound: 1 ft ~ 1 msec

15’

Time Alignment Drift If there is a speed differential between

the primary and reference tracks, thetime alignment will “drift” as theprocessing progresses

This can be observed in the FIRcoefficient response as a movement ofthe “big spike” (the large coefficientassociated with the direct path signalcorrelation), either to the right or theleft

If drift is significantly fast (e.g. morethan 1-2 coefficients every 5-10seconds), the LMS algorithm will neverbe able to converge the FIR coefficientsto an optimal solution

Also, should the spike drift beyondeither the beginning or the end of thefilter, all cancellation will be lost

Research Question (II)

“Is it possible to reduce or remove, from contemporaneousrecordings made in the same acoustic environment, interferingmusic, background noises, and speech of other speakers, tobring the voice of the main speaker to the forefront?”

Will having two microphones in the same environment allowfor effective cancellation?

Applying ‘Audio Fingerprinting’ to Background Noise

Having two microphones in the same acoustic environment perfectly time aligned can greatly help bringing out the voice of one speaker over the other

Rarely happens in practise

Aligning noise is a more difficult problem as sufficient spectral peaks may not be available in both recordings.

Applying a less stringent criteria for matching, we can time-align audio from the two independent recorders in the same acoustic environment accurately.

Applications to Noise Identification

Speech + residual noise/music(S+N’ –N”)

(S+N) Speech + noise/music

(N’) Identified time-aligned noise/music

+

-ElectronicResponse estimate

(N”) Residual

Scenarios

Scenario 1: Two independent recordings using two smartphones in the same acoustic environment

Scenario 2: Two fixed microphones in the same acoustic environment

Scenario 3: White noise interference

Scenario 1: Two independent recordings using two smartphones

Two mobile phones: an iPhone 4S and an iPhone 3GS, were used to record a conversation between two speakers in the same acoustic environment.

Two independent devices with not synchronized to each other) in any way.

Smaller number of hashes observed -sufficient for time alignment

Queried test audio (iPhone 4S) matched and time-aligned against a reference recording (iPhone 3GS)

Scenario 2: Two fixed microphones in the same acoustic environment

• Interfering noise was a television broadcast (2 speakers in a room)• Relatively small number of matching hashes as compared with the

music • Land-marking experiments -> sufficient matches to time-align the two

files correctly

Scenario 3: White Noise Interference

0 10 20 30 40 50 60-1

0

1

Time (s)

Original speech and white noise

0 10 20 30 40 50 60-1

0

1

Time (s)

Identified and aligned reference white noise

0 10 20 30 40 50 60-1

0

1

Time (s)

Resulting speech (Original - white noise)

• White noise as the interfering source.

• Exceedingly difficult to find any distinctive spectral peaks

• The number of matching hashes was significantly less than observed with either music or regular noise

• However, we were able to identify a very small number of matching hashes that were sufficient to allow time-alignment.

• Reference cancellation applied using this time-alignment showed significant improvement in intelligibility.

Limitations

This method is not applicable to to

Badly clipped recordings

Compressed recordings

Recordings where there is a ‘drift’ or stretch between the playback time of the music (more applicable to analogue recordings)

Note: What is extracted may still not be sufficient quality for forensic voice comparison

Conclusions

A combination of audio-fingerprinting and echo cancellation can be used to reduce the effect of interfering radio and television noises.

This approach could be extended to non-music speech sources by using two independent recordings in the same recording environment

A significant improvement in the intelligibility is obtained which could benefit forensic audio enhancement and transcription.

References Avery Wang "An Industrial-Strength Audio Search Algorithm", Proc.

2003 ISMIR International Symposium on Music Information Retrieval, Baltimore, MD, Oct. 2003.

J. Benesty, D. Morgan and M. Sondhi, (1997) ‘‘A better understanding and an improved solution to the problems of stereophonic acoustic echo cancellation’’, Proc. ICASSP,97, 303

D. P. W. Ellis. (2009) Robust Landmark-Based Audio Fingerprinting. http://labrosa.ee.columbia.edu/matlab/fingerprint/

A. Alexander and O. Forth (2011) “'No, thank you, for the music': An application of audio fingerprinting and automatic music signal cancellation for forensic audio enhancement”, International Association of Forensic Phonetics and Acoustics Conference 2011, Vienna, Austria, July 2011

http://labrosa.ee.columbia.edu/matlab/fingerprint/

music and noise fingerprinting and reference … · 2014. 12. 3. · anil alexander1, oscar forth1...

Documents