Download - Audio Fingerprinting Introduction

A Bachelor Thesis ProjectPresentation

(First Evaluation)

on

Audio FingerprintingFor

Song Identification under the guidance of

Dr. Padam Kumar

Department of Electronics & Computers, IIT Roorkee

Team –

Rishabh SoodB.Tech. CSE IV Yr. 070820

Santosh KumarB.Tech. CSE IV Yr. 070824

Vikesh KhannaB.Tech. CSE IV Yr. 070829

Contents

1. Objective1.1 Problem statement 1.2 Motivation

2. Theory2.1 Audio Fingerprint definition2.2 System Parameters

3. Design3.1 Architecture3.2 Flow Diagram3.3 Codec Layer

3.4 Fingerprint Layer3.5 Protocol Layer

3.6 Search Algorithm3.7 Database Architecture

5. Demonstration

6. Progress timeline

7. References

Problem Statement

To build a robust audio fingerprinting system which can be used to identify songs efficiently from a large database with limited computing resources and input.

Motivation

There is an immense scope of robust audio fingerprinting applications in the industry.

Broadcast Monitoring

Automating the royaltiescollection by monitoring

broadcast channels

Media Plugins

Plugins for playlist generation and identifying

similar tracks

P2P Filtering

Filtering copyright material from P2P networks

Even if filenames andMetadata is tampered with

Language Translation

Identifying audio contentIn foreign languages, not possible by textual search

An audio fingerprint is essentially a hash function that maps an audio object of a large number of bits to a ‘fingerprint’ of only a limited number of bits. The audio object can be uniquely identified from this bit string.

Audio Fingerprint definition

F5 MB 100 KB

Audio Fingerprint v/s Cryptographic hash functions

1. Mathematical Equivalence v/s Perceptual similarityAssume X and Y are two objects that are mapped into H(X) and H(Y) by a crypto. hash function H. Strict mathematical equality of H(X) and H(Y) implies an equality of X and Y with a very low probability of error. In case of audio, we are not interested in strict mathematical equivalence but perceptual similarity.

2. Transitivity propertyIf two sound tracks X and Y are perceptually similar while Y and Z are perceptually similar to each other, it does NOT imply that X and Z are perceptually similar. Transitivity property essentially holds for all mathematical hash functions.

Therefore, in stead of mathematical equivalence, we use threshold comparisons:

|F(x) – F(y) | ≤ T implies X and Y are similar

|F(x) – F(y) | > T implies X and Y are not similar

System Parameters

Robustness

Reliability

Fingerprint Size

Granularity

Search Speed

Low false negative rate.

Low false positive rate.

How many bits per song?

What is the minimum input size?

How fast is the search for a particular database size?

Architecture

A layered approach

Codec Layer

Fingerprint Layer

Protocol Layer

Samples in unsigned char format

Audio input

Fingerprint

HTTP POST request

Database(Search Algorithm)

XML generator

XML Data

Metadata

XML Parser

Album Artist Lyrics

SERVERCLIENT

Codec Layer

Fingerprint Layer

Protocol Layer

An audio codec is a computer program that compresses/decompresses an audio file format for encryption or playback

AAC MP3 WMA AAC

Codec Layer

Fingerprint Layer

Protocol Layer

i) Samples (unsigned char* samples)A buffer of the actual data samples (2 bytes or 16 bits per sample)

ii) Byte Order (int byteOrder) The byte order of the samples in. This can be CONST_LITTLE_ENDIAN or CONST_BIG_ENDIAN

iii) Number of samples (long size) Number of samples read.

iv) Sample rate (int sRate) The number of samples per second of audio (samples/sec)

v) Stereo (bool stereo) Boolean value indicating whether the audio is stereo

Vi) DurationDuration of the original audio regardless of the number of samples.

Vii) FormatFormat of the original audio. This will be expressed as file extensions - .mp3, .wav etc.

WAV

AudioData

ChunkID

ChunkSize

Format

Subchunk1 ID

Subchunk1 Size

Audio Format

Num channels

Sample rate

Byte Rate

Block Align

BitsPerSample

Subchunk2 ID

Subchunk2 Size

Data

Endian

Field offset(bytes)

Field name Field size(bytes)

big0

4

8

12

16

20

22

24

28

32

34

36

40

44

little

big

big

little

little

little

little

little

little

little

big

little

little

4

4

4

4

4

2

4

4

4

2

4

4

4

Subc

hunk

2 si

ze

The “RIFF” chunk descriptor. The format “WAVE” requires two subchunks “fmt” and “data”

“fmt” subchunk

Describes the format of the data in the “data” subchunk

“data” subchunk

Indicates the ‘size’ of the sound Information and contains the raw sound data

Codec Layer

Fingerprint Layer

Protocol Layer

Uncompressed PCM(WAV format) [4]

Codec Layer

Fingerprint Layer

Protocol Layer

Fingerprint layer carries out the core mathematical analysis of the audio, thereby converting a 5MB audio file into a 100KB fingerprint (bit string)

WAV(5MB)

fea690b1-b11dce98-a…(100KB)

Codec Layer

Fingerprint Layer

Protocol Layer

Fingerprint extraction scheme [1] :

1. FramingDivide the audio file into equally sized frames .

2. Sub fingerprintingFor each frame, degradation invariant features are calculated. Well known audio features include Fourier Coefficients, Mel Frequency Cepstral Coefficients (MFCC), Spectral Flatness, Sharpness, Linear Predictive Coding (LPC). These features are mapped into a more compact representation by using classification algorithms like Hidden Markov Models (HMM) or Quantization.

3. Generate a fingerprint blockOne sub fingerprint is not sufficient for identification of an audio clip. The basic unit that is sufficient to identify an audio clip is called a fingerprint block.

E(n,m) = Energy of band m of frame nF(n,m) = m-th bit of the subfingerprint of frame n

F(n,m) =

1 if E(n.m) – E(n,m+1) – (E(n-1,m) – E(n-1,m-1)) > 0

0 if E(n.m) – E(n,m+1) – (E(n-1,m) – E(n-1,m-1)) <= 0

Framing Framing

FeatureF

ABS

∑ x2

EnergyComputation

∑ x2

∑ x2

∑ x2

Band Division

T

T

T

T

+

-

+

+

+

-

-

+

+

+

+

-

-

-

-

>0

>0

>0

>0

F(n,0)

F(n,1)

F(n,30)

F(n,31)

Codec Layer

Fingerprint Layer

Protocol Layer

Codec Layer

Fingerprint Layer

Protocol Layer

The protocol layer accepts the fingerprint from the fingerprint layer and makes an HTTP POST request to the server for the relevant metadata.

The protocol layer has two major modules –

1. HTTP module This module implements the POST request to the server with the fingerprint in the request message.

2. XML ParserThe returned metadata is in XML format. The protocol layer has the parser module to retrieve the required information like the artist, album, lyrics etc.

Codec Layer

Fingerprint Layer

Protocol Layer POST /path/script.cgi HTTP/1.0From: [email protected]: HTTPTool/1.0Content-Type: application/x-www-form-urlencodedContent-Length: 32client_id=42&fingerprint=fea690b1b11dce98a…

HTTP POST

Database

XML<xml version=“1.0” version=“UTF-8” ?>

<metadata fp=“fea690b1b11dce98a…” id=“42”><album>Dark Side of the moon</album><song>Comfortably Numb</song><artist>Pink Floyd</artist>

</metadata>

XML Parser

Album Dark Side of the moon

Song Comfortably Numb

Artist Pink Floyd

mailto:[email protected]

Server side

Search algorithm and scalability

Database Architecture

To understand the search algorithm, it is essential to understand the database architecture first.

Database Implementation

Tables used:

a) look_upsubfingerprint INTlink_list BLOB

b) songssong_id INTsong_fingerprint MEDIUMBLOB

c) Metadata This table stores the song name, album, artist, genre, lyrics, year etc.

Note:

•Indexing has been added to speed up search within a table.

•The list is stored as a binary large object via object serialization. It contains the following fields:i)songIdii)offset

Search algorithm

A brute force matching approach takes O(n) time which is unacceptable for any commercial deployment having large databases. For example, consider a moderate fingerprint database of 10,000 songs with an average length of 5 minutes. Recall that every 11.6 ms of audio generates a sub-fingerprint =>

Number of sub-fingerprints = (5 x 10000 x 60) / (11.6 x 10-3 ) = 258 million

Assuming a rate of 2 x 105 fingerprint comparisons per seconds [1] on a modern PC, an O(n) time algorithm takes about 20 minutes for execution on this database.

Optimized Algorithm

Assumption: At least one sub-fingerprint has an exact match in the correct song.

The positions in the database where a specific 32-bit sub-fingerprint is located are retrieved using the database architecture shown already. The fingerprint database contains a lookup table (LUT) with all possible 32 bit sub-fingerprints as an entry. Every entry points to a list with pointers to the positions in the real fingerprint lists where the respective 32-bit sub-fingerprints are located.

Assume the same 10,000 song database with each song of length approximately 5 minutes, leading to about 250 million sub-fingerprints. The average number of positions in the list will be, assuming all positions to be equally likely, :

Average list size = 250,000,000 / 232 = 0.058

Average number of comparisons per identification = 0.058 x 256 = 15

Therefore, the average time for the algorithm = 15 x 20 / 106

= 30 ms

Improvement over brute force = 20 x 60 / 30 x 10-3

= 40,000

Search algorithm

Demonstration

Codec Layer in action

Progress Timeline

Work Details Scheduled Time Status

Literature review of audio fingerprinting

Research papers November 2010 – 1st Jan 2011

COMPLETE

Designing the architecture of the system

Conceptualizing the layers and their functions

1st Jan 2011 – 15th Jan 2011

COMPLETE

Writing Data Structures for the protocol layer

C++ class implementation of AudioData and TrackInfo classes

By 16th to 20th January 2011

COMPLETE

Implementing the Codec Layer for uncompressed PCM format

C/C++ implementation of the wavefile.cpp

By 24th Jan 2011 COMPLETE

Extending the Codec Layer to include compressed formats

Installing the open source LAME plugin for MP3

By 25th February 2011 PENDING

Creating the database structure and functions

By 15th March 2011 PENDING

Developing the fingerprint layer By 15th April 2011 PENDING

Work Details Scheduled Time Status

Testing of the fingerprint layer Testing the fingerprint algorithm by using diverse inputs and codecs.

By 20th April 2011 PENDING

Server side development Search Algorithm and metadata population

By 30th April 2011 PENDING

Alpha Testing Developer phase testing By 10th May 2011 PENDING

Beta Testing Deploying an actual server and third-party testing

By 20th May 2011 PENDING

Progress Timeline (contd.)

References

[1] Jaap Haitsma and Ton Kalker, “A highly robust audio fingerprinting system”, Philips Research , Eindhoven, The Netherlands, October 2001

[2] Music IP corporation, Available HTTP: musicip.com

[3] Neuschmied H., Mayer H. and Battle E., “Identification of Audio Titles on the Internet”, Proceedings of International October 2000. Conference on Web Delivering of Music 2001, Florence, Italy, November 2001

[4] Microsoft-IBM Wave file format, Available HTTP: ccrma.stanford.edu/courses/422/projects/WaveFormat/

[5] Haitsma J., Kalker T. and Oostveen J., “Robust Audio Hashing for Content Identification, Content Based Multimedia Indexing 2001, Brescia, Italy, September 2001.

Thank youAny Questions?

Download - Audio Fingerprinting Introduction

Top Related