mp3 surround: efficient and compatible coding of multi-channel

14
MP3 Surround: Efficient and Compatible Coding of Multi-Channel Audio Juergen Herre 1 , Christof Faller 2 , Christian Ertel 1 , Johannes Hilpert 1 , Andreas Hoelzer 1 , and Claus Spenger 1 1 Fraunhofer Institute for Integrated Circuits IIS, 91058 Erlangen, Germany 2 Agere Systems, Allentown, PA 18109, USA ABSTRACT Finalized in 1992, the MP3 compression format has become a synonym for personalized music enjoyment for millions of users. The paper presents a novel extension of this popular format which adds support for the coding of multi-channel signals, including the widely used 5.1 surround sound. As a prominent feature of the extended format, complete backward compatibility with existing stereo MP3 decoders is retained, i.e. standard decoders reproduce a full stereo downmix of the multi-channel sound image. The paper discusses the underlying advanced technology enabling the representation of multi-channel sound at bitrates that are comparable to what is currently used to encode stereo material. Results for subjective sound quality are presented and related activities of the MPEG standardization group are reported. 1. INTRODUCTION With the broad availability of the Internet and modern computer technology, a palette of perceptual audio coding schemes has found widespread use in multimedia applications. Among these codecs, the popular MP3 compression format is one of the most frequently used coding schemes for both hardware- based playback devices (such as portable audio players, PDAs, and CD/DVD players) and software-based applications (media players, music jukebox software etc.). Initially mainly associated with Internet audio, MP3 has turned into a synonym for personalized music enjoyment for millions of users worldwide. Formally speaking, the name MP3 refers to the Layer 3 coding scheme of the ISO/MPEG-1 and ISO/MPEG-2 Audio specifications [21] [22] that were finalized in 1992 and 1994, respectively 1 . Compliant to these standards, MP3 capability usually supports the 1 MPEG-2 Audio also specifies (matrixed) backward compatible multi-channel audio coding schemes. These are not well represented in the marketplace and are not included in MP3 implementations.

Upload: dangthuy

Post on 24-Jan-2017

220 views

Category:

Documents


0 download

TRANSCRIPT

MP3 Surround: Efficient and CompatibleCoding of Multi-Channel Audio

Juergen Herre1, Christof Faller2, Christian Ertel1, Johannes Hilpert1, Andreas Hoelzer1,and Claus Spenger1

1 Fraunhofer Institute for Integrated Circuits IIS, 91058 Erlangen, Germany

2 Agere Systems, Allentown, PA 18109, USA

ABSTRACT

Finalized in 1992, the MP3 compression format has become a synonym for personalized music enjoyment formillions of users. The paper presents a novel extension of this popular format which adds support for the coding ofmulti-channel signals, including the widely used 5.1 surround sound. As a prominent feature of the extended format,complete backward compatibility with existing stereo MP3 decoders is retained, i.e. standard decoders reproduce afull stereo downmix of the multi-channel sound image. The paper discusses the underlying advanced technologyenabling the representation of multi-channel sound at bitrates that are comparable to what is currently used toencode stereo material. Results for subjective sound quality are presented and related activities of the MPEGstandardization group are reported.

1. INTRODUCTION

With the broad availability of the Internet and moderncomputer technology, a palette of perceptual audiocoding schemes has found widespread use inmultimedia applications. Among these codecs, thepopular MP3 compression format is one of the mostfrequently used coding schemes for both hardware-based playback devices (such as portable audio players,PDAs, and CD/DVD players) and software-basedapplications (media players, music jukebox softwareetc.). Initially mainly associated with Internet audio,

MP3 has turned into a synonym for personalized musicenjoyment for millions of users worldwide.

Formally speaking, the name MP3 refers to the Layer 3coding scheme of the ISO/MPEG-1 and ISO/MPEG-2Audio specifications [21] [22] that were finalized in1992 and 1994, respectively1. Compliant to thesestandards, MP3 capability usually supports the

1MPEG-2 Audio also specifies (matrixed) backward compatible

multi-channel audio coding schemes. These are not well representedin the marketplace and are not included in MP3 implementations.

Herre et al. MP3 Surround

Page 2 of 14

encoding/decoding of mono or stereo2 audio at anumber of common sampling rates (16, 22.05, 24, 32,44.1, 48kHz) and bit-rates up to 320 kbit/s. Thus, MP3technology has been synonymous to stereo (non-multi-channel) audio storage and transmission for a period ofmore than 10 years.

More recently, however, multi-channel soundreproduction setups in home environment have becomemore popular. This trend is driven mostly by multi-channel sound that comes together with movie content,be it as discrete 5.1 channel sound (DVD Video/AC-3)or as matrixed analog multi-channel sound (e.g.Prologic, Logic7). Although market penetration inEurope is lagging behind the US, the trend towardsowning a �home theater� setup is clearly detectable. Afurther growing application area in this context is multi-channel for automotive applications. More and morehigh-end cars will be equipped with multi-channel audiocapabilities by default.

The consequence of this trend will be a clear demandfor multi-channel audio-only material that is played onhome theater setups without a video component. In thehigh-end segment, this desire will be served byemerging media, such as DVD-Audio and SA-CD. Forbandwidth-constrained multi-channel applications, theMPEG-2/4 Advanced Audio Coding (AAC) scheme hasbeen developed as an efficient multi-channel audioformat providing excellent sound quality at bit-rates inthe range of 256-320kbit/s for 5-channel material [20][5].

The general increase in significance of multi-channelsound in consumer audio raises the question about amulti-channel MP3 audio format that could serve theMP3 user community with efficient representation of�surround sound�. Naturally, it is important for such anadvanced format to maintain some degree ofcompatibility with existing MP3 systems in order toleverage the existing base of deployed systems. As canbe expected for any transition to a new generation oftechnology, there will always be a vast number ofconsumers who will stick to their existing stereoreproduction setups for the foreseeable future. Toaccommodate a smooth transition towards multi-channel transmission and reproduction technology, anideal advanced MP3 coding format needs to supportequally well both user populations (i.e. owners of

2In this paper the term �stereo� always refers to two-channel

stereophony.

traditional stereo setups and multi-channel enabledreproduction setups).

This paper introduces the concept behind a new formatthat extends current MP3 capabilities towards theefficient representation of multi-channel audio. Basedon recent advances in multi-channel audio codingtechnology, it permits the representation of 5.1 sound atbit-rates that are comparable to those currently used forrepresenting 2-channel material. Due to its underlyingstructure, this �MP3 Surround� format is fully backwardcompatible with existing MP3 decoders in the sense thatthese will decode a stereo downmix of the multi-channelsound image. Enhanced decoders make use of additionalaspects of the extended bitstream format in order toreproduce the full multi-channel sound image.

An alternative non-backward compatible approach forextending MP3 technology towards multi-channel in thecontext of MPEG-4 has been discussed in [18] and willnot be elaborated in this paper.

The subsequent sections briefly review state of the artmethods in perceptual coding of stereo and multi-channel audio and give a short overview of BinauralCue Coding (BCC), a coding method for spatial audio,which forms part of the technological basis of the MP3Surround format. Next, it discusses how the traditionalBCC approach is extended to permit the expansion ofstereo signals into a multi-channel sound image. Then,first results of subjective listening tests are presentedand the status of recent related standardization activitieswithin the MPEG group is described. This will berounded up with a discussion of several applications inmulti-channel sound enabled by the MP3 Surroundformat.

2. PERCEPTUAL CODING OF STEREO ANDMULTI-CHANNEL AUDIO

The basic approach of monophonic perceptual audiocoding relies on two general principles:

� The reduction of the signal�s redundancy by means ofentropy coding, utilization of the �nonflatness� of itsspectral envelope or predictive approaches facilitate amore compact representation of the signal withoutirreversible loss of information.

� The exploitation of the signal�s irrelevance by meansof shaping the coder quantization noise in frequencyand time allows a signal representation for which the

Herre et al. MP3 Surround

Page 3 of 14

coding distortion is not perceptible to the humanlistener (�masking�) while consuming only a fractionof the signal�s original data rate.

When coding audio material which is represented byseveral audio channels (i.e. stereo or multi-channelmaterial), state of the art coding schemes generallymake use of a number of additional techniques thatenable the efficient joint coding of these audio channelsand account for the spatial aspects of human auditoryperception. The two most well known techniques forjoint stereo coding of signals are Mid/Side (M/S) stereocoding (also known as �Sum/Difference coding�) andintensity stereo coding. Both joint coding strategiesmay be combined by selectively applying them todifferent frequency regions. A brief review will begiven next.

2.1. M/S Stereo Coding

In M/S stereo coding [16] the sum and the differencesignals of the two stereo channels are quantized andcoded rather than processing the left and right channelsignals separately (L/R coding). Switching between L/Rand M/S coding effectively controls the imaging ofcoding noise, in relation to the imaging of the originalsignal and in this way prevents spatial unmasking.Specifically, this technique addresses the issue of�Binaural Masking Level Difference� (BMLD) [4],where a signal at lower frequencies (below 2 kHz) canresult in up to 20 dB difference in the masking thresholddepending on the phase and correlation of the signal andprobe noise present. In such cases, proper coding of thestereo signal in L/R mode can require more bits thantwo transparently coded monophonic signals whereasM/S coding provides both proper spatial masking and �very frequently � further bitrate savings.

M/S stereo coding has been used extensively in the MP3and MPEG-2/4 AAC audio coders. In both cases, theswitching state (L/R or M/S coding) is determinedselectively in time on a block-by-block basis andtransmitted to the decoder. As a refinement, AAC alsoallows for frequency selective switching (for each scalefactor band the L/R vs. M/S decision is madeindividually).

In order to code multi-channel audio, M/S stereo codingis applied to channel pairs of the multi-channel signal,i.e. between pairs of channels that are arrangedsymmetrically on the left/right listener axis. In this

way, imaging problems due to spatial unmasking areavoided to a large degree [17].

2.2. Intensity Stereo Coding

Intensity stereo coding is a joint stereo coding techniqueapplied to reduce �perceptually irrelevant information�between audio channels and is described in [29].Specifically, intensity stereo coding exploits the factthat the perception of high frequency sound componentsmainly relies on the analysis of their energy-timeenvelopes [4]. When applied to each coder frequencyband, the corresponding spectral coefficients (orsubband samples) of both signal channels are replacedby a single sum signal and a direction angle (azimuth orequivalent information). The azimuth controls theintensity stereo position of the auditory event created atthe decoder. Only one azimuth is transmitted for a coderfrequency band. Using this information, the originalenergy-time envelopes of the coded channels arepreserved approximately by means of a scalingoperation of the transmitted sum signal such that eachchannel signal is reconstructed with its original levelafter decoding. Intensity stereo coding is capable ofsignificantly reducing the bitrate for stereo and multi-channel audio due to the reduction in the effectivenumber of transmitted spectral coefficients. However,its application is limited since intolerable distortions canoccur if it is applied to the full audio bandwidth or foruse with signals with a highly dynamic and wide spatialimage [15] [28]. Potential improvements of theapproach are constrained because the time-frequencyresolution is given by the core audio coder and cannotbe modified without considerable complexity beingadded.

Intensity stereo coding has been used widely undervarious names in both MPEG audio coders (MPEG-1,MPEG-2 [27] and MPEG-2/4 AAC [17]) and otherschemes (e.g. AC-3 [8]).

For the coding of multi-channel audio, intensity stereocoding can be applied either to pairs of channels that arearranged symmetrically on the left/right listener axis(similar to M/S stereo coding) or to arbitrary groups ofaudio channels.

3. BINAURAL CUE CODING (BCC)

Binaural cue coding (BCC) [11] [13] is a codingscheme based on a compact and perceptually motivatedrepresentation of multi-channel audio signals. BCC has

Herre et al. MP3 Surround

Page 4 of 14

different applications in scenarios where such signalsneed to be represented compactly. Specifically, BCC fornatural rendering [12] [13] is a parametric multi-channel audio coding technique, similar to intensitystereo coding.

Some of the limitations of intensity stereo coding areovercome by BCC in the way that different filterbanksare employed for coding of the audio waveform and forparametric stereo [1]. Most audio coders use a ModifiedDiscrete Cosine transform (MDCT) for coding of audiowaveforms. The advantages of using a different andmore suitable filterbank for parametric stereo arereduced aliasing [1] and more flexibility, such as theability to efficiently synthesize not only intensities(ICLD), but also time delays and coherence between theaudio channels. Another notable conceptual differencebetween intensity stereo coding and BCC is that thelatter is able to operate on the full-band audio signal andtransmits a single monophonic time domain signalwhereas intensity stereo coding is usually applied tohigh frequency components of a signal and thustransmits only spectral portions as a mono signalcomponent.

Figure 1 shows a scheme for BCC for natural rendering.The input audio channels are downmixed to one singlechannel, denoted sum signal. From psychoacoustics it isknown that the level difference (ICLD), time difference(ICTD), and coherence (ICC) between audio channelsare relevant parameters for the perception of theauditory spatial image [2]. (The �IC� in theabbreviations denote inter-channel). Given its multi-channel input, the BCC encoder estimates ICLD, ICTD,and ICC between channel pairs as a function offrequency (i.e. in different subbands) and time andtransmits these estimates to the decoder as sideinformation. Given the sum signal, the decodergenerates a multi-channel signal such that ICLD, ICTD,and ICC of the synthesized signal approximate those ofthe original multi-channel signal.

Figure 2 illustrates the scheme for ICTD, ICLD, andICC synthesis for generating a multi-channel audiosignal given the transmitted sum signal. By simplyimposing different delays and applying different gainfactors in subbands, a multi-channel signal with specificICTD and ICLD can be generated [12] [13]. It is lessobvious how to synthesize ICC. Several approaches forsynthesizing ICC for parametric stereo coding wereproposed previously [13] [26].

Figure 1: Generic BCC scheme. A number of inputsignals are downmixed to one channel and transmitted

to the decoder together with side information.

Figure 2: ICTD, ICLD, and ICC synthesis scheme.Processing is carried out individually for each frequency

band. ICTD and ICLD are synthesized by imposingdifferent delays and scale factors, respectively.

4. MP3 SURROUND CODING

The Binaural Cue Coding approach, as described in thepreceding section, may be combined with any type oflow bitrate audio coder to form an efficient system fortransmission and storage of multi-channel soundproviding two main functional aspects:

� Firstly (and probably most importantly), it enables abitrate-efficient representation of multi-channel audiosignals. Compared to a transmission of C discreteaudio channel signals, only one audio signal has to besent to the decoder together with a compact set ofspatial side information which results in impressivebitrate savings. As an example, the usual 5 channel(3/2) format is reduced into a single sum audio

Herre et al. MP3 Surround

Page 5 of 14

channel corresponding to an overall data reduction ofabout 80% (i.e. 4 out of 5 channels are dropped,neglecting the compact BCC side information).

� Secondly, the transmitted sum signal corresponds to amono downmix of the multi-channel signal. Forreceivers that do not support multi-channel soundreproduction, listening to the transmitted sum signalis thus a valid method of presenting the audiomaterial on low-profile monophonic reproductionsetups. Conversely, BCC can therefore also be usedto enhance existing services involving the delivery ofmonophonic audio material towards multi-channelaudio.

The latter aspect of BCC can be regarded as a bridgingfunction between monophonic and multi-channelrepresentation. When looking at today�s consumerelectronics world, however, the dominant sound formatis clearly 2-channel stereophony rather than amonophonic presentation. This motivates the use of astereo sound representation as the basis for a BCC-typealgorithm which then could scale up the informationcontained in these channels towards a multi-channelsound image. This is exactly the core idea behind theMP3 Surround approach, to be summarized as follows:

� Two audio channels are transmitted from the encoderto the decoder side forming a compatible stereodownmix of the multi-channel sound to berepresented.

� A BCC-type algorithm produces multi-channel soundat the decoder end by making best possible use of theinformation contained in the transmitted stereodownmix signal.

� The compact spatial side information is embeddedinto the basic stereo MP3 bitstream in a compatibleway, such that a standard MP3 decoder is notaffected.

The proposed algorithm adds scalability to the basicBCC scheme in terms of transmitting more than oneaudio channel. Due to the increase in informationavailable to the decoder, it is expected that the proposedscheme achieves higher quality than conventional BCCwith one transmission channel. Another way of addingscalability to BCC is to use a regular multi-channelcoder at lower frequencies and a mono audio coder withBCC at higher frequencies, as presented in [3].

4.1. Basic Scheme

Figure 3 illustrates the general structure of an MP3Surround encoder for the case of encoding a 3/2 multi-channel signal (L, R, C, Ls, Rs). As a first step, a two-channel compatible stereo downmix (Lc, Rc) isgenerated from the multi-channel material by adownmixing processor or other suitable means. Theresulting stereo signal is encoded by a conventionalMP3 encoder in a fully standards compliant way. At thesame time, a set of spatial parameters (ICLD, ICTD,ICC) are extracted from the multi-channel signal,possibly considering the stereo downmix signals. Thesespatial parameters are encoded and embedded assurround enhancement data into the ancillary data fieldof the MP3 bitstream within a suitable data containerthat unambiguously identifies the presence of such datafor decoders with corresponding extended capabilities(i.e. MP3 Surround decoding). This mechanism ofextending MPEG-1 Audio bitstream syntax has beenused repeatedly in the past to �hide� various types ofenhancement data in a backward compatible way, e.g.for embedding multi-channel audio extension [22] orbandwidth extension [30] data.

LRCLsRs

CompatibleDownmix

StandardMP3

Encoder

BCCEncoder

Lc

Rc

MP3 Surround

Bitstream

Surround Enhancement

AncillaryData

Figure 3: General structure of an MP3 Surround encoder(principle).

Figure 4 shows the decoder side of the transmissionchain. The MP3 Surround bitstream is decoded into acompatible stereo downmix signal that is ready forpresentation over a conventional 2-channel reproductionsetup (speakers or headphones). Since this step is basedon a fully compliant MPEG-1 Audio bitstream, anyexisting MP3 decoding device can perform this step andthus produce stereo output. MP3 Surround enableddecoders will furthermore detect the presence of theembedded surround enhancement information and, ifavailable, expand the compatible stereo signal into a fullmulti-channel audio signal using a BCC-type decoder.

Herre et al. MP3 Surround

Page 6 of 14

LRCLsRs

StandardMP3

Decoder

Lc

Rc

MP3 Surround

Bitstream

Surround Enhancement

AncillaryData

BCC Decoder

Figure 4: General structure of an MP3 Surround decoder(principle).

While the preceding example discussed theencoding/decoding of a 5 channel audio signal, othermulti-channel configurations can be supported in thesame way with this approach. This also includes the useof a subwoofer (LFE = �Low Frequency Enhancement�)channel, as it is used frequently for the representation ofmovie sound (5.1 configuration).

4.2. Multi-channel vs. Stereo Representation

As can be seen from the previous section, the MP3Surround process involves information from both amulti-channel version of the signal (for extraction ofspatial parameters) and a stereo version (for actualcompression and transmission). Therefore, there is aneed to provide both versions of the audio itemsimultaneously for which a number of options areexplored subsequently. The most common approach toobtaining a stereo version of a multi-channel signal iscalled downmixing and involves a linear combination ofcertain multi-channel signals to obtain the desired stereosignals. When downmixing multi-track/multi-channelsound material into a stereophonic representation, anumber of considerations come into play which aremotivated by both psychoacoustics and productionpractices. On one hand, it is desired to present all partsof the multi-channel sound image also to the listener ofa stereo reproduction setup. On the other hand, it isknown that � by collapsing front and back channels intothe front-only stereo reproduction � the listener�s abilityto separate the sound components diminishes due to thelack of spatial separation between front and back soundsources. Consequently, sound sources from backchannels are usually attenuated within a stereo mixdownin order to guarantee good audibility of the importantfront sound sources. In practice, there are different waysto produce corresponding stereo material for a multi-channel audio item:

Manual mix: In many cases, the sound engineerproduces a �manual� downmix of the multi-channel

sound sources into stereo using hand-optimized mixingparameters, thus preserving a maximum amount ofartistic freedom. If further flexibility is desired, differentrecording methods may be used for the production ofthe stereo version (e.g. different microphoneconfigurations).

Simple automatic downmixing: The most basic approachto create an automatic downmix from a given multi-channel recording is to use a fixed downmixingequation, such as the set of standard downmixingequations recommended by ITU-R for compatible 2-channel stereo reproduction of multi-channel signals [6][31]. As an example, the well-known equations for thecase of 3/2 multi-channel audio are:

Lc = t * ( L + a*Ls + b*C )Rc = t * ( R + a*Rs + b*C)

where Lc and Rc denote the compatible stereo signalsand t, a and b are constants. Usually, a=b=1/√2 arestandard choices and t is chosen such that an appropriateoutput signal level is achieved while avoiding clipping.For material with a high level of ambient rearcomponents, choosing a=0.5 is a common alternative.Even though a fixed downmixing approach is clearlysuboptimal compared to a dedicated manual stereo mix,it may in practice be sufficient for most applications.

Dynamic/advanced automatic downmixing: Over time,more advanced methods for automated multi-channeldownmix have become available which take intoconsideration factors such as absolute sourcepositioning, panning laws, the way sound sources weremixed into multi-channel signals and inter-channelphase relationships. Such advanced algorithms adapttheir downmixing behavior to the processed materialand may achieve a sonic quality that is comparable tothat of a �manual downmix� [14]. Dynamicdownmixing has also been applied to BCC for the caseof one transmission channel [13] and can be used forgenerating the MP3 Surround compatible stereodownmix signals likewise.

The basic approach of MP3 Surround does not imposeany restriction on what option for stereo downmixinghas to be used. In fact, the downmixing process can beconsidered as a system component that is notnecessarily part of the general coding scheme. This isillustrated in Figure 5 by showing a system that acceptsboth a 5-channel audio signal and a correspondingstereo version.

Herre et al. MP3 Surround

Page 7 of 14

LRCLsRs

StandardMP3

Encoder

BCC Encoder

Lc

Rc

MP3 Surround

Bitstream

Surround Enhancement

AncillaryData

Lc

Rc

Figure 5: MP3 Surround encoding using an externaldownmix process.

Looking at the decoding side (Figure 4), it becomesclear that the transmitted stereo signals form the basis ofthe recreated multi-channel sound image. Henceforth,the sound components that are desired to be present inthe multi-channel sound also need to be present in thestereo signal in an appropriate way to ensuresatisfactory reproduction results. As a simple thoughtexperiment, a solo instrument will not be reproducedproperly in the multi-channel signal if omitted from theunderlying stereo signal. Clearly, the use of well-behaved automatic downmixing processes guaranteesthat such pathological conditions do not occur. Morework is needed to establish a set of minimumrequirements for consistency between multi-channel andstereo signals that guarantee a proper decoded multi-channel sound image also for manual stereodownmixing.

In summary, different options exist for the productionprocess of the stereo signal version that allow a trade-off between the delivered quality on the stereo and onthe multi-channel side. These range from �use automaticstereo downmix optimized for best possible multi-channel reproduction� to �transmit best possible stereoquality, possibly at the expense of multi-channel soundquality�.

5. QUALITY EVALUATION

In order to assess the subjective sound quality of MP3Surround, a listening test was carried out to compare theperformance of the proposed scheme with that ofestablished multi-channel audio codecs. The generalperformance expectation of the MP3 Surround schemecan be derived from the following considerations:

� In MP3 Surround, the multi-channel sound image ismultiplexed (downmixed) into a stereo signal and�unpacked� (expanded) again at the decoder side.This is analogous to the well-known formats formatrixed surround, such as Prologic, Logic 7 etc.

� Contrary to such matrixed surround formats, MP3Surround makes use of some side information inorder to reconstruct the multi-channel sound image atthe decoder side. Ideally, the improvement due to thetransmission of side information might deliver asystem performance approaching that of a fullyindependent transmission of multi-channel material,i.e. a discrete surround format.

Consequently, both a common format for matrixedsurround and a state-of-the-art discrete multi-channelcodec were used for comparison with the MP3 Surroundcodec. For the matrixed surround format, DolbyProLogic II [9] was chosen, a coder that is primarilyintended for the transmission of multi channel audioover an analog stereo transmission line with the highestpossible stereo compatibility. This format is widely usedin current consumer electronic equipment. As a discretemulti-channel codec, MPEG-2/4 AAC was used at abitrate of 320 kbit/s to produce near transparent quality.This coder was found to deliver EBU broadcast qualityfor 5-channel material at a bitrate of 256-320 kbit/s informal verification tests [5]. The MP3 Surround coderwas run at a total bitrate of 192 kbit/s (including sideinformation) which is common for high quality stereoMP3.

In order to obtain both an absolute grade for each of thecodecs as well as a consistent relative rating amongthem, the listening test method chosen closely resemblesthe ITU recommendation BS.1534 (MUSHRA) [7].Several time-aligned audio signals were presented to thelistener who performed on-the-fly switching betweenthese signals using a keyboard and a screen. The signalsincluded the original signal, which was labeled as�Reference�, and several anonymized items, arranged atrandom. Using a simple graphic user interface software,the test subjects had to grade the basic audio quality ofthe anonymized items on a graphical scale with fiveequally sized regions labeled �Excellent�, �Good�,�Fair�, �Poor� and �Bad�. To check the listener�sreliability and enable relating the ratings to results ofother tests, a hidden reference (original) and an anchorwere included in addition to the three coded/decodeditems. The anchor consisted of a 3.5 kHz bandwidthreduced version of the reference, as is compulsory for

Herre et al. MP3 Surround

Page 8 of 14

the MUSHRA test methodology. There was no limit ofthe number of repetitions the test subjects could listen tobefore they would deliver their ratings and proceed tothe next test item. Due to its interactive nature, the testwas taken by one subject at a time.

Eleven items were selected for the listening test, ten ofthem being commercial music of different styles (3 popmusic, 3 jazz music, 4 classical music). The remainingitem was created artificially and features a strongperceived dissimilarity between different channelgroups (item �fountain�: the sound of a fountain on thecenter channel, a piano on the front side channels andsinging birds on the surround channels). The items werepresented in an acoustically isolated listening labequipped with high quality loudspeakers. Details on theequipment used and the listening test software can befound in the annex of this paper.

First experiments showed that it is very difficult toremember the surround image of the previous test signalby the time the next item is played. It is, therefore, veryhelpful for the listener to have the possibility of settingstart and stop markers to allow looping at arbitrarypositions, as is suggested in the BS.1116-1 [32] testspecification and can be used for MUSHRA testslikewise. Furthermore, the possibility of instantaneousswitching between different signals while they areplaying contributes greatly to enhancing the sensitivityof the listening test. The switching process wasimplemented by a 100ms fade-out of one signalfollowed by a 100ms fade-in on the next signal, bothusing raised cosine windows. In this way, all transitionsbetween signals sound identical. A software toolenabled these features and provided the possibility ofrating and randomization of presentation orderaccording to the MUSHRA specification [7]. Listenerswere allowed to set the playback level according to theirindividual preference.

Eight of the ten subjects were expert listeners with yearsof experience in audio coding, while the other two wereless experienced. Prior to the test phase the listenerswere instructed to take into account both the faithfulreproduction of the signal�s spatial image and distortionby perceptual coding artifacts for their ratings.

Figure 6 shows the results of the listening test as meanrating and 95% confidence interval for individual testitems together with their overall mean. As can be seen,the ratings of MPEG-2/4 AAC encoded items overlapwith those of the hidden reference in their confidenceinterval for all items. This shows that the listeners werenot able to distinguish between the coded/decoded itemsand the reference in a statistical sense. This outcome isconsistent with previous characterizations of the codecat this bitrate.

The quality of the ProLogic II encoded/decoded signalswas rated mostly in the �good� region of the gradingscale. Although there is some degree of variability in thesubjective ratings for this format, it is clearly visible thatlisteners are able to distinguish these signals from theoriginal. Listeners frequently reported a change in boththe general perception of the sound stage as well as thepositioning of certain sound components. As a note tothis result, it should be pointed out that in commercialapplications of matrixed surround formats it is commonto try to alleviate the limitations of the underlyingformat by optimizing the multi-channel mix at the timeof its production for good reproduction after Prologicencoding/decoding [10].

The quality of the MP3 Surround encoded/decodedsignals was mostly rated within the �excellent� range ofthe grading scale, and their confidence intervals overlapwith those of the original signal for two out of the 11test items. For the other nine cases, the listeners wereable to distinguish statistically between the MP3Surround coded version and the original. Test listenersreported no significant alterations of the spatial soundimage.

Considering that the basic coding efficiency of the MP3coder kernel is significantly lower than of MPEG-2/4AAC and that MP3 Surround uses a far lower bitratethan the MPEG-2/4 AAC coder (192 kbit/s instead of320 kbit/s), this test outcome can be seen as an excellentresult. In summary, it appears that the overall subjectivesound quality provided by the MP3 Surround system ismuch closer to that of a fully discrete multi-channelsystem than to a matrixed surround format even thoughMP3 Surround spends only a small fraction of itsoverall bitrate on encoding of the spatial information.

Herre et al. MP3 Surround

Page 9 of 14

Figure 6: Results of the subjective listening test.

6. INTERNATIONAL STANDARDIZATION

As it may have become clear from the precedingdiscussions, the general idea of applying BCC-typeprocessing to expand a compatible mono or stereosignal into a multi-channel sound image does not relyon the use of a particular type of audio coder. Due to theattractive properties of this approach, the MP3 Surroundscheme can be considered merely the first commercialapplication of a new coding paradigm for multi-channelaudio, and other schemes are likely to follow.Demonstrations of such combinations have beenprovided recently, including MPEG-2/4 AAC + BCC(67th MPEG meeting, Hawaii 2003, 5.1 multi-channel atca. 140 kbit/s) and MPEG-4 High-Efficiency AAC +

BCC (EBU workshop, Geneva 2/2004, 5.1 multi-channel at about 64 kbit/s).

In the area of international standardization, theISO/MPEG Audio group has noted these recentadvances and their market potential [19] and started anew work item with the working title Spatial AudioCoding. This process aims at complementing theexisting MPEG-4 AAC-based general audio codingschemes with a tool for efficient and compatiblerepresentation of multi-channel audio. It addresses bothtechnology that expands stereo signals into multi-channel sound (called �2-to-n� scheme) and the moretraditional mono variant (called �1-to-n� scheme). Thekey requirements are [25]:

� Best possible approximation of original perceivedmulti-channel sound image

excel.

Average and 95% Confidence Intervals

good

fair

poor

bad

America

SambaBarToccata

All ItemsChopin

FountainGershwin

HarpPops

PoulencQueen

1. Hidden Reference 2: Ref. 3.5kHz 3: MPEG4-AAC 4: MP3 Surround 5: ProLogic II

Herre et al. MP3 Surround

Page 10 of 14

� Minimal bitrate overhead compared to conventionaltransmission of 1 or 2 audio channels

� Bitstream backward compatibility by using anMPEG-4 AAC codec for the audio transmission

� Backward compatibility of transmitted audio signalwith existing mono or stereo reproduction systems,i.e. the transmitted audio channels shall represent acompatible (mono or stereo) audio signalrepresenting all parts of the multi-channel soundimage

As of the time of writing of this paper, the MPEG grouphas issued a �Call for Information� [24] to seekadditional input on the requirements for this new workitem and anticipates issuing a �Call for Proposals� (CfP)at its 68th meeting in March 2004 [25].

7. APPLICATIONS

Considering the general trend towards surround soundin consumer and professional audio, the followingsection illustrates some examples of applications thatare enabled through spatial audio coding in general andMP3 Surround technology specifically, focusing oncompatible multi-channel enhancements of existingservices.

The vast majority of current audio-visual distributioninfrastructure is tailored to delivering stereo rather thanmulti-channel audio, both in terms of availabletransmission bandwidth as well as the underlyingtechnical system structure. Thus, in order to enableupgrading such distribution media to multi-channelaudio, it is essential to deliver multi-channel audio atbitrates comparable to what is usually needed for thetransmission of stereo which is one of the key featuresof the MP3 Surround/spatial audio coding approach.

� Music download service: Currently, a number ofcommercial music download services are availableand working with considerable commercial success.Such services could be seamlessly extended toprovide multi-channel enabled services while stayingcompatible for stereo users. On computers with 5.1channel setups the MP3 Surround files are decoded insurround sound while on portable MP3 players thesame files are played back as stereo music.

� Streaming music service / Internet radio: ManyInternet radios are operating currently under severely

constrained bandwidth conditions and, therefore, canoffer only mono or stereo content. MP3 Surroundcould extend this to a full multi-channel servicewithin the permissible range of bit-rates. Sinceefficiency is of paramount importance in thisapplication, the compression aspect of MP3 Surroundcomes into play. As an example, representation of 5presentation channels from two transmitted basischannels corresponds to a bit-rate saving of (5-2)/5 =60% compared to full multi-channel coding,neglecting the small amount of surroundenhancement side information.

� Digital Audio Broadcasting: Due to available channelcapacity, the majority of existing or planned systemsfor digital broadcasting of audio content cannotprovide multi-channel sound to the users. Adding thisfeature would, however, be a strong motivation forusers to make the transition from their traditional FMreceivers to the new digital systems. Again,compatibility with existing stereo reproduction setupsis mandatory and inherent to the MP3 Surround /spatial audio coding concept.

� Teleconferencing: Although teleconferencing isbecoming increasingly popular and important intoday�s global business world, most systems are stillworking with rather low bandwidth. A spatial audiocoding approach can help expand the sound image tomulti-channel sound and, thus, allow a bettersubjective separation and resolution of the audiocontributions of each speaker in the teleconference.

� Audio for Games: Many personal computers havebecome �personal gaming engines� and are equippedwith a 5.1 computer speaker setup. Synthesizing 5.1sound from a backward compatible stereo soundbasis allows for an efficient storage of multi-channelbackground music.

If storage efficiency is of chief importance for specificapplications, a spatial audio coding approach based on amono channel (�1-to-n scheme�) can be used. For 5channels, a bit-rate saving of 80% compared to adiscrete multi-channel transmission is then achieved.For stereo-only applications, a �1-to-2� scheme may bedesirable which corresponds to the case of parametricstereo as is addressed in the upcoming ISO/MPEG-4Extension specification for use with a parametric audiocoder [23].

Herre et al. MP3 Surround

Page 11 of 14

8. CONCLUSIONS

This paper introduces an extension of the popular MP3audio coding format towards the bitrate-efficientrepresentation of multi-channel audio signals, mostprominently the 5.0 and 5.1 channel configurations.This is achieved by extending a stereo MP3 scheme bymeans of Binaural Cue Coding (BCC) that serves as aspatial pre/post processor to the MP3 encoder/decoderchain. An important feature of the resulting format isfull backward compatibility to existing MP3 decoders,which reproduce a complete stereo downmix of themulti-channel sound material. In order to guide thespatial decoding process in an MP3 Surround decoder, asmall amount of spatial side information is hiddeninside the MP3 bitstream in a compatible way within theancillary data field.

Contrary to earlier approaches to parametricrepresentation of stereo and multi-channel audio, theproposed algorithm transmits two (stereo compatible)signal channels rather than only a single channel. This isa significant and important extension of the generaltechnical paradigm because it enables backwardcompatibility with the huge number of stereoreproduction setups and media currently in existence.The MP3 Surround scheme and the underlying generalideas pave the way for using multi-channel sound for anumber of attractive applications which wereinconceivable in the past, such as multi-channel Internetradio and music download services.

Results of first informal listening tests indicate that theproposed scheme provides an excellent combination oflow bitrate and sound quality which is significantlybetter than that of matrixed surround formats. The ideaof expanding backward compatible monophonic andstereophonic sound signals has been embraced by theMPEG standardization group. The new work item�Spatial Audio Coding� was initiated to undertakefurther development of this concept in the context ofMPEG-4 Audio. Over time we will see further technicaldevelopment of such algorithms and combinations withmore powerful audio coders which will bring the visionof multi-channel audio at very low bitrates (<64 kbit/s)into reality.

9. ANNEX

The subjective listening test was conducted in anacoustically isolated listening lab that is designed topermit high-quality listening tests conforming to theBS.1116 [32] test methodology for high-quality audiolistening tests. The room dimension was 5.5m x 7.1m x2.4m. All test signals were presented on 5 Geithain RL901 active studio loudspeakers and a Geithain TT 920active subwoofer driven by Bryston pre-amplifiers.Playback was controlled from a 2 GHz Linux computerwith an RME Hammerfall digital sound output interfaceconnected to Lake People DAC F20 D/A converters.

For the generation of the AAC bitstreams theFraunhofer IIS professional MPEG-4 AAC LowComplexity Profile encoder was used. The encoder wasinitialized with default psychoacoustic parameters at abitrate of 320 kbps.

The Dolby ProLogic II signals were generated with theSurCode for Dolby ProLogic II Version 2.0.3 softwareby Minnetonka Audio Software using the defaultsettings. In order to guarantee perceptually equalloudness of the Dolby ProLogic II signals, a loudnessalignment procedure was necessary. The amplificationfactor for each ProLogic II decoded signal wasdetermined by adjusting its amplification in steps of 1dB until there was no perceptual difference in loudnesscompared to the original signal when switching betweenboth signals. The procedure was repeated for severallisteners. This resulted in an equal amplification of 3 dBfor all ProLogic II decoded test signals. The signals hadsufficient headroom to enable this gain change withoutintroducing clipping.

The original items were obtained partly from specialaudio test compilations and partly recorded from theanalog output of a Panasonic DVD A7 DVD-Audioplayer using an RME ADI 8 DS A/D converter.

Herre et al. MP3 Surround

Page 12 of 14

Item Format Title Artist Album Style

America 5.1 Head and Heart America HomecomingDVD Audio

Pop

Bar 5.1 My Love Is Here M. Matthews DVD Audio(Panasonic/Technics)

Jazz

Chopin 5.0 Piano Sonata Nr. 3Finale

n/a Multi-Channel DemoMaterial (DenonElectronic GmbH)

Classic

Fountain 5.0 Fountain Music - MPEG-2 Multi-ChannelVerification Test Material

n/a

Gershwin 5.0 I Got Rhythm n/a Multi-Channel DemoMaterial (DenonElectronic GmbH)

Jazz

Harp 5.0 G.F. Händel: ConcertoFor Harp And Strings

New YorkSymphonicEnsemble

DVD Audio(Panasonic/Technics)

Classic

Pops 5.0 n/a n/a n/a Pop

Poulenc 5.0 Concerto In G MinorFor Organ, String AndTimpani

n/a Multi-Channel DemoMaterial (DenonElectronic GmbH)

Classic

Queen 5.1 You´re My Best Friend Queen A Night At The OperaDVD Audio

Pop

Samba 5.0 One Note Samba Hamamura Quintett Audionet STANDARDSNo. 1: RETOLD, DVDAudio

Jazz

Toccata 5.0 J.S. Bach: Toccata AndFugue In D Minor

Mika Oi DVD Audio(Panasonic/Technics)

Classic

Table 1: Description of the listening test items.

10. REFERENCES

[1] F. Baumgarte and C. Faller: �Why Binaural CueCoding is better than Intensity Stereo,� 112th AESConvention, Munich 2002. Preprint 5575

[2] F. Baumgarte and C. Faller: �Binaural Cue Coding- Part I: Psychoacoustic fundamentals and designprinciples,� IEEE Trans. on Speech and AudioProc., vol. 11, no. 6, November 2003

[3] F. Baumgarte, C. Faller, P. Kroon: �Audio CoderEnhancement using Scalable Binaural Cue Codingwith Equalized Mixing,� 116th AES Convention,Berlin 2004

[4] J. Blauert, �Spatial Hearing: The Psychophysics ofHuman Sound Localization�, revised edition, MITPress, 1997

[5] M. Bosi, K. Brandenburg, S. Quackenbush, L.Fielder, K. Akagiri, H. Fuchs, M. Dietz, J. Herre,G. Davidson, Oikawa: �ISO/IEC MPEG-2Advanced Audio Coding�, Journal of the AES, Vol.45, No. 10, October 1997, pp. 789-814

[6] ITU-R Recommendation BS.775-1, �Multi-channelStereophonic Sound System with or withoutAccompanying Pic ture� , In terna t ionalTelecommunications Union, Geneva, Switzerland,1992-1994

[7] ITU-R Recommendation BS.1534-1, �Method forthe Subjective Assessment of Intermediate SoundQ u a l i t y ( M U S H R A ) � , I n t e r n a t i o n a l

Herre et al. MP3 Surround

Page 13 of 14

Telecommunications Union, Geneva, Switzerland,2001

[8] M. Davis, �The AC-3 Multichannel Coder�, 95thAES Convention, New York, 1993, Preprint 3774

[9] Dolby Publication, Roger Dressler: �DolbySurround Prologic Decoder � Principles ofOperation�,http://www.dolby.com/tech/whtppr.html

[10] D o l b y P u b l i c a t i o n , J i m Hilson:�Mixing with Dolby Pro Logic II Technology�,http://www.dolby.com/tech/PLII.Mixing.JimHilson.html

[11] C. Faller and F. Baumgarte, �Binaural Cue Coding:A novel and efficient representation of spatialaudio,� Proc. ICASSP 2002, Orlando, Florida, May2002

[12] C. Faller and F. Baumgarte, �Binaural Cue Codingapplied to stereo and multi-channel audiocompression,� 112th AES Convention, Munich2002. Preprint 5574

[13] C. Faller and F. Baumgarte, �Binaural Cue Coding- Part II: Schemes and applications,� IEEE Trans.on Speech and Audio Proc., vol. 11, no. 6, Nov.2003

[14] D. Griesinger: �Surround from stereo�, Workshop#12, 115th AES Convention, New York, 2003

[15] J. Herre, K. Brandenburg, D. Lederer: �IntensityStereo Coding�, 96th AES Convention, Amsterdam1994, Preprint 3799

[16] J. D. Johnston, A. J. Ferreira, �Sum-DifferenceStereo Transform Coding�, IEEE ICASSP 1992,pp. 569-571

[17] J. D. Johnston, J. Herre, M. Davis, U. Gbur:�MPEG-2 NBC Audio - Stereo and MultichannelCoding Methods�, 101st AES Convention, LosAngeles 1996, Preprint 4383

[18] M. Lutzky, M. Weishart, J. Hilpert, B. Grill, H.Gernhardt, M. Haertl: �MP3 in MPEG-4�, 115thAES Convention, New York 2003, Preprint 5870

[19] ISO/IEC JTC1/SC29/WG11 (MPEG), DocumentM10378, �Spatial Audio Coding: Market Contextand Requirements�, Hawaii 2003

[20] ISO/IEC JTC1/SC29/WG11 MPEG, InternationalStandard ISO/IEC 13818-7 �Generic Coding ofMoving Pictures and Associated Audio: AdvancedAudio Coding�, 1997

[21] ISO/IEC JTC1/SC29/WG11 (MPEG), InternationalStandard ISO/IEC 11172-3 �Coding of movingpictures and associated audio for digital storagemedia at up to about 1.5 Mbit/s�, 1992

[22] ISO/IEC JTC1/SC29/WG11 (MPEG), InternationalStandard ISO/IEC 13818-3 �Generic Coding ofMoving Pictures and Associated Audio: Audio�,1994

[23] ISO/IEC JTC1/SC29/WG11 (MPEG), DocumentN6130, �Text of ISO/IEC 14496-3:2001/FDAM2�,Hawaii 2003

[24] ISO/IEC JTC1/SC29/WG11 (MPEG), DocumentM10378, �Call for Information on Spatial AudioCoding�, Hawaii 2003

[25] ISO/IEC JTC1/SC29/WG11 (MPEG), DocumentM10378, �Draft Call for Proposals on SpatialAudio Coding�, Hawaii 2003

[26] E. Schuijers, W. Oomen, B. den Brinker, and J.Breebaart, �Advances in parametric coding forhigh-quality audio,� 114th AES Convention,Amsterdam 2003, Preprint 5852

[27] G. Stoll, G. Theile, S. Nielsen, A. Silzle, M. Link,R. Sedlmayer, A. Breford, �Extension ofISO/MPEG-Audio Layer II to Multi-ChannelCoding: The Future Standard for Broadcasting,Te l ecommunica t ion , and Mul t imed iaApplications�, 94th AES Convention, Berlin 1994,Preprint 3550

[28] AES Technical Committee of Coding of AudioSignals: �Perceptual Audio Coders: What to listenfor�, CD-ROM with tutorial information and audioexamples, AES publications

[29] R.G. v.d.Waal, R.N.J. Veldhuis, �Subband Codingof Stereophonic Digital Audio Signals�, IEEEICASSP 1991, pp 3601-3604.

Herre et al. MP3 Surround

Page 14 of 14

[30] T. Ziegler, A. Ehret, P. Ekstrand, M. Lutzky,�Enhancing MP3 with SBR: Features andCapabilities of the New MP3PRO Algorithm�,112th AES Convention, Munich 2002, Preprint5560

[31] S. K. Zielinski, F. Rumsey: �Effects of Down-MixAlgorithms on Quality of Surround Sound�, Journalof the AES, pp. 790, September 2003

[32] ITU-R Recommendation BS.1116-1 �Methods forthe Subjective Assessment of Small Impairments inAudio Systems Including Multichannel SoundSystems�, International TelecommunicationsUnion, Geneva Swizerland, 1994-1997