Download - Audio Fingerprinting Introduction
A Bachelor Thesis ProjectPresentation
(First Evaluation)
on
Audio FingerprintingFor
Song Identification under the guidance of
Dr. Padam Kumar
Department of Electronics & Computers, IIT Roorkee
Team –
Rishabh SoodB.Tech. CSE IV Yr. 070820
Santosh KumarB.Tech. CSE IV Yr. 070824
Vikesh KhannaB.Tech. CSE IV Yr. 070829
Contents
1. Objective1.1 Problem statement 1.2 Motivation
2. Theory2.1 Audio Fingerprint definition2.2 System Parameters
3. Design3.1 Architecture3.2 Flow Diagram3.3 Codec Layer
3.4 Fingerprint Layer3.5 Protocol Layer
3.6 Search Algorithm3.7 Database Architecture
5. Demonstration
6. Progress timeline
7. References
Problem Statement
To build a robust audio fingerprinting system which can be used to identify songs efficiently from a large database with limited computing resources and input.
Motivation
There is an immense scope of robust audio fingerprinting applications in the industry.
Broadcast Monitoring
Automating the royaltiescollection by monitoring
broadcast channels
Media Plugins
Plugins for playlist generation and identifying
similar tracks
P2P Filtering
Filtering copyright material from P2P networks
Even if filenames andMetadata is tampered with
Language Translation
Identifying audio contentIn foreign languages, not possible by textual search
An audio fingerprint is essentially a hash function that maps an audio object of a large number of bits to a ‘fingerprint’ of only a limited number of bits. The audio object can be uniquely identified from this bit string.
Audio Fingerprint definition
F5 MB 100 KB
Audio Fingerprint v/s Cryptographic hash functions
1. Mathematical Equivalence v/s Perceptual similarityAssume X and Y are two objects that are mapped into H(X) and H(Y) by a crypto. hash function H. Strict mathematical equality of H(X) and H(Y) implies an equality of X and Y with a very low probability of error. In case of audio, we are not interested in strict mathematical equivalence but perceptual similarity.
2. Transitivity propertyIf two sound tracks X and Y are perceptually similar while Y and Z are perceptually similar to each other, it does NOT imply that X and Z are perceptually similar. Transitivity property essentially holds for all mathematical hash functions.
Therefore, in stead of mathematical equivalence, we use threshold comparisons:
|F(x) – F(y) | ≤ T implies X and Y are similar
|F(x) – F(y) | > T implies X and Y are not similar
System Parameters
Robustness
Reliability
Fingerprint Size
Granularity
Search Speed
Low false negative rate.
Low false positive rate.
How many bits per song?
What is the minimum input size?
How fast is the search for a particular database size?
Architecture
A layered approach
Codec Layer
Fingerprint Layer
Protocol Layer
Samples in unsigned char format
Audio input
Fingerprint
HTTP POST request
Database(Search Algorithm)
XML generator
XML Data
Metadata
XML Parser
Album Artist Lyrics
SERVERCLIENT
Codec Layer
Fingerprint Layer
Protocol Layer
An audio codec is a computer program that compresses/decompresses an audio file format for encryption or playback
AAC MP3 WMA AAC
Codec Layer
Fingerprint Layer
Protocol Layer
i) Samples (unsigned char* samples)A buffer of the actual data samples (2 bytes or 16 bits per sample)
ii) Byte Order (int byteOrder) The byte order of the samples in. This can be CONST_LITTLE_ENDIAN or CONST_BIG_ENDIAN
iii) Number of samples (long size) Number of samples read.
iv) Sample rate (int sRate) The number of samples per second of audio (samples/sec)
v) Stereo (bool stereo) Boolean value indicating whether the audio is stereo
Vi) DurationDuration of the original audio regardless of the number of samples.
Vii) FormatFormat of the original audio. This will be expressed as file extensions - .mp3, .wav etc.
WAV
AudioData
ChunkID
ChunkSize
Format
Subchunk1 ID
Subchunk1 Size
Audio Format
Num channels
Sample rate
Byte Rate
Block Align
BitsPerSample
Subchunk2 ID
Subchunk2 Size
Data
Endian
Field offset(bytes)
Field name Field size(bytes)
big0
4
8
12
16
20
22
24
28
32
34
36
40
44
little
big
big
little
little
little
little
little
little
little
big
little
little
4
4
4
4
4
2
4
4
4
2
4
4
4
Subc
hunk
2 si
ze
The “RIFF” chunk descriptor. The format “WAVE” requires two subchunks “fmt” and “data”
“fmt” subchunk
Describes the format of the data in the “data” subchunk
“data” subchunk
Indicates the ‘size’ of the sound Information and contains the raw sound data
Codec Layer
Fingerprint Layer
Protocol Layer
Uncompressed PCM(WAV format) [4]
Codec Layer
Fingerprint Layer
Protocol Layer
Fingerprint layer carries out the core mathematical analysis of the audio, thereby converting a 5MB audio file into a 100KB fingerprint (bit string)
WAV(5MB)
fea690b1-b11dce98-a…(100KB)
Codec Layer
Fingerprint Layer
Protocol Layer
Fingerprint extraction scheme [1] :
1. FramingDivide the audio file into equally sized frames .
2. Sub fingerprintingFor each frame, degradation invariant features are calculated. Well known audio features include Fourier Coefficients, Mel Frequency Cepstral Coefficients (MFCC), Spectral Flatness, Sharpness, Linear Predictive Coding (LPC). These features are mapped into a more compact representation by using classification algorithms like Hidden Markov Models (HMM) or Quantization.
3. Generate a fingerprint blockOne sub fingerprint is not sufficient for identification of an audio clip. The basic unit that is sufficient to identify an audio clip is called a fingerprint block.
E(n,m) = Energy of band m of frame nF(n,m) = m-th bit of the subfingerprint of frame n
F(n,m) =
1 if E(n.m) – E(n,m+1) – (E(n-1,m) – E(n-1,m-1)) > 0
0 if E(n.m) – E(n,m+1) – (E(n-1,m) – E(n-1,m-1)) <= 0
Framing Framing
FeatureF
ABS
∑ x2
EnergyComputation
∑ x2
∑ x2
∑ x2
Band Division
T
T
T
T
+
-
+
+
+
-
-
+
+
+
+
-
-
-
-
>0
>0
>0
>0
F(n,0)
F(n,1)
F(n,30)
F(n,31)
Codec Layer
Fingerprint Layer
Protocol Layer
Codec Layer
Fingerprint Layer
Protocol Layer
The protocol layer accepts the fingerprint from the fingerprint layer and makes an HTTP POST request to the server for the relevant metadata.
The protocol layer has two major modules –
1. HTTP module This module implements the POST request to the server with the fingerprint in the request message.
2. XML ParserThe returned metadata is in XML format. The protocol layer has the parser module to retrieve the required information like the artist, album, lyrics etc.
Codec Layer
Fingerprint Layer
Protocol Layer POST /path/script.cgi HTTP/1.0From: [email protected]: HTTPTool/1.0Content-Type: application/x-www-form-urlencodedContent-Length: 32client_id=42&fingerprint=fea690b1b11dce98a…
HTTP POST
Database
XML<xml version=“1.0” version=“UTF-8” ?>
<metadata fp=“fea690b1b11dce98a…” id=“42”><album>Dark Side of the moon</album><song>Comfortably Numb</song><artist>Pink Floyd</artist>
</metadata>
XML Parser
Album Dark Side of the moon
Song Comfortably Numb
Artist Pink Floyd
Server side
Search algorithm and scalability
Database Architecture
To understand the search algorithm, it is essential to understand the database architecture first.
Database Implementation
Tables used:
a) look_upsubfingerprint INTlink_list BLOB
b) songssong_id INTsong_fingerprint MEDIUMBLOB
c) Metadata This table stores the song name, album, artist, genre, lyrics, year etc.
Note:
•Indexing has been added to speed up search within a table.
•The list is stored as a binary large object via object serialization. It contains the following fields:i)songIdii)offset
Search algorithm
A brute force matching approach takes O(n) time which is unacceptable for any commercial deployment having large databases. For example, consider a moderate fingerprint database of 10,000 songs with an average length of 5 minutes. Recall that every 11.6 ms of audio generates a sub-fingerprint =>
Number of sub-fingerprints = (5 x 10000 x 60) / (11.6 x 10-3 ) = 258 million
Assuming a rate of 2 x 105 fingerprint comparisons per seconds [1] on a modern PC, an O(n) time algorithm takes about 20 minutes for execution on this database.
Optimized Algorithm
Assumption: At least one sub-fingerprint has an exact match in the correct song.
The positions in the database where a specific 32-bit sub-fingerprint is located are retrieved using the database architecture shown already. The fingerprint database contains a lookup table (LUT) with all possible 32 bit sub-fingerprints as an entry. Every entry points to a list with pointers to the positions in the real fingerprint lists where the respective 32-bit sub-fingerprints are located.
Assume the same 10,000 song database with each song of length approximately 5 minutes, leading to about 250 million sub-fingerprints. The average number of positions in the list will be, assuming all positions to be equally likely, :
Average list size = 250,000,000 / 232 = 0.058
Average number of comparisons per identification = 0.058 x 256 = 15
Therefore, the average time for the algorithm = 15 x 20 / 106
= 30 ms
Improvement over brute force = 20 x 60 / 30 x 10-3
= 40,000
Search algorithm
Demonstration
Codec Layer in action
Progress Timeline
Work Details Scheduled Time Status
Literature review of audio fingerprinting
Research papers November 2010 – 1st Jan 2011
COMPLETE
Designing the architecture of the system
Conceptualizing the layers and their functions
1st Jan 2011 – 15th Jan 2011
COMPLETE
Writing Data Structures for the protocol layer
C++ class implementation of AudioData and TrackInfo classes
By 16th to 20th January 2011
COMPLETE
Implementing the Codec Layer for uncompressed PCM format
C/C++ implementation of the wavefile.cpp
By 24th Jan 2011 COMPLETE
Extending the Codec Layer to include compressed formats
Installing the open source LAME plugin for MP3
By 25th February 2011 PENDING
Creating the database structure and functions
By 15th March 2011 PENDING
Developing the fingerprint layer By 15th April 2011 PENDING
Work Details Scheduled Time Status
Testing of the fingerprint layer Testing the fingerprint algorithm by using diverse inputs and codecs.
By 20th April 2011 PENDING
Server side development Search Algorithm and metadata population
By 30th April 2011 PENDING
Alpha Testing Developer phase testing By 10th May 2011 PENDING
Beta Testing Deploying an actual server and third-party testing
By 20th May 2011 PENDING
Progress Timeline (contd.)
References
[1] Jaap Haitsma and Ton Kalker, “A highly robust audio fingerprinting system”, Philips Research , Eindhoven, The Netherlands, October 2001
[2] Music IP corporation, Available HTTP: musicip.com
[3] Neuschmied H., Mayer H. and Battle E., “Identification of Audio Titles on the Internet”, Proceedings of International October 2000. Conference on Web Delivering of Music 2001, Florence, Italy, November 2001
[4] Microsoft-IBM Wave file format, Available HTTP: ccrma.stanford.edu/courses/422/projects/WaveFormat/
[5] Haitsma J., Kalker T. and Oostveen J., “Robust Audio Hashing for Content Identification, Content Based Multimedia Indexing 2001, Brescia, Italy, September 2001.
Thank youAny Questions?