automatic timbre mutation of drum loops · the same process. the two loops are then synthesized...

Automatic Timbre Mutation of Drum Loops

Jason A. Hockman NYU Music Technology

Submitted in partial fulfillment of the requirements for the Master of Music in Music Technology

In the Department of Music and the Performing Arts Professions In The Steinhardt School

New York University

Advisors: Kenneth J. Peacock, Robert J. Rowe 2007/05/04

1

Acknowledgements As this work is an abstraction of previous work, I must use this space to extend thanks to those people that have helped me in development the algorithm as it exists thus far. First and foremost, I must credit the knowledge and guidance of Dr. Juan P. Bello, whose tutoring in the use of the MATLAB coding environment and amassed knowledge of subject material had aided me through the various phases of the project’s design and implementation. In particular, Dr. Bello’s previous work on onset detection (Bello et al., 2004), and rhythm modification of drum loops (Ravelli et al., 2007), had made this project possible. I must also thank the NYU Music Technology Program, which has brought me further into sound construction than I knew possible only two years ago. I had originally applied to the program thinking that I would continue working in the field of electronic dance music production, however through the guidance of Kenneth Peacock, Robert Rowe, Dafna Naphtali, and Richard Boulanger, I have since expanded my scope to include the interests of research in Music Information Retrieval (MIR) and Digital Audio Effects (DAFX). The NYU Music Technology Forum, which generally meets on a bi-monthly basis has been a constant source of inspiration for the project, and has provided essential feedback on the various decisions chosen throughout the algorithm’s development. I would also like to extend special thanks to Melissa Czajkowski, Liana Eagle and Ernest Li, for their considerate critique and thoughtfulness throughout the project. Finally, I must thank my parents for making such an opportunity possible.

2

Abstract The presented algorithm demonstrates a method by which a percussive loop is automatically segmented into its constituent parts, which are then classified and appropriately resequenced to match the components of a second loop, which undergoes the same process. The two loops are then synthesized together, component by component, and output is presented as 1) a unified .wav file, 2) individual slices (also output as .wav files), and 3) as a MIDI file, for those musicians who desire to work with a traditional sequencer and sampler. The spectral mix between the two loops is defined by simple coefficient controls of both magnitude and phase values. The algorithm has build upon recent work of Ravelli, Bello, and Sandler (Automatic Rhythm Modification of Drum Loops, 2007) 1, and has relevance to music composers/producers who wish to easily and efficiently create rhythmic and timbral variations to static loops.

3

CONTENTS Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1 The Early Years . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Innovation in Timbre . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Computer Approaches . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1 Onset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Complex-Domain Onset Detection . . . . . . . . . . . . . . . . . 13 2.3 Peak-Picking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 k-means Classification . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Cluster Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Resequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Substitution Method . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Proportional Centroid Distance . . . . . . . . . . . . . . . . . . . 23 4.3 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Timbre Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1 Possible Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2 Phase Vocoding Mutation . . . . . . . . . . . . . . . . . . . . . . 27 5.3 User-Defined Mix Levels . . . . . . . . . . . . . . . . . . . . . . 30 5.4 Window Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Further Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.1 MIDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.2 Slicing and Combined Output . . . . . . . . . . . . . . . . . . . . 33 6.3 Gaps and Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.4 Omission of Timestretching Functionality . . . . . . . . . . . . . . 34

7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4

8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

8.1 TimbreMut.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 8.2 DetFuncA.m & DetFuncB.m . . . . . . . . . . . . . . . . . . . . 42 8.3 LocalMaxima.m . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 8.4 PhaseVoxMut.m . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5

1. INTRODUCTION

6

“Rhythm imposes unanimity among the divergent”

- Yehudi Menuhin Modern electronic musicians and producers benefit from the manipulation and processing of pre-recorded drum loops. Whether for commercial or personal purposes, musicians rely increasingly upon the timbres achieved through the use of large databases of drum loops. Oftentimes however, these samples, both sonically and rhythmically, begin to sound “flat” or mundane to the ear, and further modification is required for generation of subtle nuance and uniqueness. This paper presents a method for the automatic timbre and rhythmic mutation of two percussive loops. The process is the result of the combined algorithms for segmentation, classification, resequencing and synthesis algorithms. The aim of the project is to introduce a system by which a mono wav file is systematically separated into its constituent events, which are then given a simple classification, resequenced to the rhythmic timing of another percussive wav file, and finally morphed together in a cross-synthesis technique. The main goal of the project is thus 1) to provide a robust segmentation method that is capable of proper functionality given a large database of a diverse range of samples, while 2) minimizing the occurrence of class mismatches, such that a kick will not be synthesized with a non-kick. Interface goals are 1) overall timeliness of the algorithm, 2) dutiful efficiency and accurate response, and 3) sonic clarity. Timeliness is simply gauged by timing functions within the coding structure; accuracy is determined though analysis with database samples; and sonic clarity is a subjective measure achieved by listener tests. The remainder of the paper will be presented as follows: after a brief history of loop-based music, and technology specifically related to loop-based music production, Section 2 will present the detection algorithm and peak picking process. Section 3 will deal with classification and event labeling. Resequencing methodology is discussed in Section 4. The timbre mutation technique is presented in Section 5. The algorithm’s output and further considerations are explained in Section 6, and conclusions follow in Section 7. The MATLAB code for the presented algorithm is printed in the Appendix, along with database information.

7

1.2 The Early Years *

The Sugar Hill Gang released their seminal hit “Rapper’s Delight” in the summer of 1979; this was the world’s first glimpse of a new form of urban music later defined as hip hop, and the first popular culture music hit to sample a drum riff from another band. Sylvia Robinson, Sugar Hill’s producer, had used a drum break (used synonymously with riff, or loop meaning a brief drum solo, often in backbeat form) from Chic’s chart-topping “Good Times”. Although not truly the first hip hop song (most recognize the Fatback Band’s “King Tim III (Personality Jock)” as the original), “Rapper’s Delight” heralded the beginning of an era of repetitive percussive patterns, spliced together from previously existing material, as a basic block form by which to layer upon. “Rapper’s Delight” was nothing new to the residents of east coast tenements in 1979. In fact there had been a long-standing tradition of privately owned sound system and DJing culture within urban centers for the majority of the decade. Building upon disco DJ methods, Clive Campbell, aka Kool Herc, was the first to use a mixer to quickly “cut” between two copies of the same record to prolong the sonic climax of a break. Campbell, then Afrika Bambaata and Grand Master Flash with greater precision, soon began to layer the breaks by beatmatching (a technique made possible by the variable user-defined pitch control on performance turntables), and cutting in edits, or phrases from different breaks. During beatmatching, a DJ matches the tempo of one record to that of another. Generally, this process is performed on percussive sections of rhythm-based music, often for the purpose of vertically aligning structural components by measure. By 1979, the use of breaks as a percussion element had become a standard, not just at block parties, but as a production technique as well. By the end of that same summer, breakbeats had completely overtaken the backing tracks of top 40 R&B hits.

1.3 Innovation in Timbre For the next few years, the timbres of hip-hop breaks had remained unchanged from that of the recorded drum sounds from the previous generation of rhythm and blues hits. The early 1980s provided users with low-cost drum machines, namely Roland’s TR-808, equipped with a flexible sequencer as well as duration and pitch control over a variety of drum timbres; a freedom which had liberated artists from the confines of pre-recorded loop structures and legal issues 16. “Planet Rock” (1982) by Afrika Bambaata demonstrated the possibilities of these new timbres, and with its release created a sub-genre of futurism that would be continued years later in the layered textures of European techno, then jungle and drum and bass. * as explained by A. Light, “The History of Hip Hop” New York, Three Rivers Press, 1999 16

8

The need to remain innovative pushed producers to create a hybridization of electronic and live drum timbres in the mid to late 1980s, and as sampling technology had become more affordable, the two could exist side-by-side on hardware samplers such as the 8-bit Ensoniq Mirage (1985), and later 16-bit Akai MPC and E-Mu E-series samplers. On-board DSP processors on samplers from the early 1990s provided the signal processing technology necessary for groups such as Eric B & Rakim and Public Enemy the capability of blending through equalization or sideband compression, and even more recently convolution between samples (as seen on the E-mu E4 Platinum). Layering drum loops provides musicians and producers with a simple yet effective method to create more effective percussive events, as well as add necessary variation to standard loop configurations which might otherwise sound commonplace and mundane. The blending of loops, which in itself is a complex process involving segmentation, compression, equalization, and amplitude control has been sped drastically due to developments in graphical displays and processor speeds, yet is still a daunting task.

1.3 Computer Approaches Modern advances in personal computing have created a new market for the laptop/desktop musician. Sampling and synthesis have been, for the most part, moved into the software realm, alongside sequencing programs such as Pro Tools, Digital Performer, Cubase, and Logic. Programs such as Propellerheads Recycle (http://www.propellerheads.se/ products/recycle), have provided digital musicians an alternative to hand-selected divisors of a loop, with a user-segmentation intensity level (0-100), which operates by analyzing time-domain horizontal transient-detection coupled with sudden amplitude envelope deviations. The unfortunate truth is that for most loops, the detection algorithm finds inaccurate onsets, and for those correctly chosen, places slice points several samples after the actual inception of onsets, causing clipping at both the start and end of the slices. Generally, a user must then select a segmentation intensity which provides the closest fit onset schematic for each loop, and then systematically add new / remove false onsets, and verify the accuracy of each onset point. A useful function of the Reason program however, is its ability to output sequential individual slices and MIDI (Musical Instrument Digital Interface) file for control purposes. As the majority of electronic musicians have not yet abandoned the MIDI platform, this is a practical implementation that is available in the presented segmentation process. FXpansion offers GURU (http://www.fxpansion.com/product-guru-main.php), a software groove box that provides factory preset loop manipulation, as well as segmentation and resequencing of input loops. The slicing is quite accurate, however present implementation only allows the definition of singular note events, such that there can only be one kick, one snare, and one hat determined per input loop. The appearance of complexity within loop-based music is partially derived from subtle alterations between samples, such that there are several kicks, snares, and hats, as opposed to one. Still, GURU does offer several processing effects such as LFOs and tape effects for this manipulation.

9

Ravelli et al. (2007) have created a method of segmentation and resequencing, which provides the basis for this paper’s presented algorithm 1, 2. In their work, two mono drum loops are segmented, spectrally classified, and one loop, the “original”, is resequenced to the rhythm of the “model” loop. Segmentation is executed through complex-domain onset detection 3, classification by the k-means clustering method, and resequencing is performed via a substitution matrix. A more detailed explanation of this procedure can be found in [1] and [2], however as I incorporate the similar detection and classification schemes in the presented algorithm, these will be reviewed within subsequent sections of this paper.

10

1.4 Algorithm Overview The algorithm is comprised of three stages, 1) Analysis, 2) Resequencing, and 3) Synthesis. The analysis stage can be seen as the sequence of two steps: i) onset detection and segmentation of two percussive samples; file A and file B, ii) categorization via simple classification iteration. Resequencing involves the matching of file B to file A. The Synthesis stage involves i) phase vocoder mutation, and ii) resynthesis (combined .wav output). Figure 1 demonstrates a flow chart of the algorithm.

Figure 1: Timbre Mutation Overview

The TimbreMut.m algorithm is designed to function with a wide variety of standard drum loops, however its analysis architecture allows it to be relevant with non-traditional musical timbres, such as found sounds and sound effects. The algorithm may, for example, be used to either blend characteristics of one loop with another’s, as an effect to a pre-existing drum loop, or simply to generate interesting non-rhythmic combinations of sounds.

11

2. SEGMENTATION

12

2.1 Onset Detection The aim of onset detection is to provide a clear inception of an event, prior to the attack portion of the event, at which the amplitude is elevated and may be measured. Bello et al. (2005) define an onset as the specific moment chosen to denote the inception of a temporally extended transient, or period during which the signal may be characterized as in excitation 5. There are several methods by which this calculation may be made for percussive signals, in either the time or frequency domain. The most basic form of onset detection may be performed through analysis of amplitude elevations in the time-domain. Success of this analysis scheme is limited to those signals in which clear separation between loud note events exists. Softer onsets, such as those often found in the kick drum and hat/cymbal classes, are often ignored by these detection schemes, bypassed for the more prevalent noisy characteristics of the snare drums. For the purposes of the algorithm discussed here, a more robust detection method is needed, which will require an examination of the signal within the frequency domain. In (1), the incoming signal is parsed into individual equal-length frames, windowed, and input into a discrete Fourier transform 6.

X(k) = DFT x(n)[ ] = x(n)e

! j2" k /N

n=0

N !1

# (1)

The transform produces various spectral features assessed in different detection functions. Local energy, a detection function which also is possible through the temporal domain, is achieved spectrally through simple analysis of magnitudes given across the FFT bins.

LE(n) =

1

NXk(n)

2

k=!N /2

k=N /2

" (2)

Rodet and Jaillet (2001) present a more robust analysis devised through weighting the individual frequency bins, which results in increased sensitivity towards the higher frequencies, a characteristic which exploits alteration in percussive signals well 7.

HFC(n) =1

NXk(n)

2k

k=!N /2

k=N /2

" (3)

Both local energy and HFC approaches often bypass softer or tonal onsets, instead opting for louder, noisy onsets.

13

Alternatively, Bello and Sandler (2003), determine onsets through deviation of phase outputs of the FFT. Analysis of the mean absolute phase deviation provides an efficient characterization of tonal onsets. Unlike aforementioned magnitude-based calculations, the phase-based approach fares well in detection of kick-drums and tonal percussive sounds, regardless of amplitude levels. In each STFT frame, phases are output and stored from each bin. During onset excitation, significant differentiation will be present due to a high variance within the signal within a short period of time, resulting in large peaks, while steady sinusoidal points will demonstrate derivative values nearer to zero. In (4), the change in phase for frame n is represented by an analysis of the present (!

k(n) ) and two prior

frames (!k(n " m) ) 8.

!"

k(n) = "

k(n) # 2"

k(n #1) +"

k(n # 2) ! 0 (4)

(5) demonstrates this measure of the summed absolute phases across N 8. Although the mean absolute phase deviation is a well-suited measure for tonal onsets, it does not represent noisy transients as well, as high-frequency behavior is not as easily recorded from frame to frame.

! p (n) =

1

N"#k (n)

k=1

N

$ (5)

2.2 Complex-Domain Onset Detection Bello et al. (2004) describe a method by which both phase and energy based approaches may be combined, that is capable of extracting onsets from both high and low energy events, as well as those events classified by low or high frequencies. The method, fully

explained in [3], creates present magnitude ( Xk(n) ) and target magnitude (

X! k (n) ), and

present phase (!k(n) ) and target phase (

!!k(n) ) values from STFT calculation. The present

values are derived from (6):

Xk (n) = Xk (n) ej!k (n)

(6)

and the target calculations are derived from (7):

X! k (n) = X! k (n) e

j!" k (n)

(7)

The target amplitude may then be understood also as the previous frame value as in (8), and using the phase unwrap method of princarg, the target phase may be achieved by analysis of the previous two frames (9).

14

X! k (n) = X

k(n !1) (8)

!!k(n) = princarg[2""

k(n #1) #""

k(n # 2)] (9)

Then, bin stationarity is achieved by a comparison of past to present STFT frames:

!k(n) = X! k (n)

2

+ Xk(n)

2" 2 X! k (n) i Xk

(n) icos(#$k(n)){ }

1

2

(10)

And then summing these values across the bins, k, a fill complex domain onset detection function may then be represented by:

CDOD = !

k(m)

k=1

K

" (11)

This chosen method for the TimbreMut.m algorithm, is presented in Appendix 8.2, within DetFuncA.m on lines 39-40, or alternatively in DetFuncB.m, on lines 37-38. Through this method, onsets may be visualized by simply plotting the output vector given by onDet (see Appendix 8.2). The derivative of onsets may then be taken to provide a precise point of event inception. Figure 2 provides a graphical display of three stages of the detection process: (a) wavread, (b) complex-domain onset detection, and (c) derivative of the detection function.

15

(a) Wavread of 10.wav

(b) Complex-Domain Onset Detection of 10.wav

(c) Derivative of Complex-Domain Onset Detection of 10.wav

Figure 2: Onset Detection Preparation

2.3 Peak-Picking Once the detection function has determined its event characterization, a peak-picking method is required to determine which of the points available are actually onsets. To ensure a more accurate decision during peak-picking, Ravelli et al. (2007) present the use of a difference function to streamline the data such that the output becomes even less noisy 1.

16

Another precursor to the actual peak-picking function is thresholding. The two methods available are fixed and adaptive thresholding. If the incoming signal was prepared such that it was always of comparable amplitude, by, say, a hard limiter, then a fixed threshold might be an ideal choice. However, as percussive events may be encountered across a wide range of styles and amplitude of onset events, it is difficult to define such a static position for amplitude which satisfies all (or even many) percussive signals.

2.3.1 Adaptive Thresholding Automatic threshold calculation is an essential characteristic of the algorithm, to remove the otherwise constant threshold adjustments necessitated for each specific sample’s amplitude deviations. Adaptive thresholding is performed through the creation of a mean absolute filter, efficiently following the shape of the detection vector, successfully eliminating floor noise, while allowing local maxima to remain after peak-picking. The adaptive threshold used in TimbreMut.m, a variation of the threshold presented by Duxbury et al. in [9], is created as the addition of 1) the derivative of (11) multiplied by 0.6, and 2) the lower value of either the floor threshold or mean absolute derivative of (11). Fine-tuning of the algorithm demonstrated that a floor threshold of 20 was the most reliable value tested.

!adpt (m) = diff CDOD( )i0.6( ) +min ! floor ,diff CDOD( )

N

"

#$%

&' (12)

2.3.2 Local Maxima As mentioned above, peak-picking is performed by selecting the local maxima within the vector by utilizing the find command to locate those points proven true for a) point n is greater than point n-1, b) point n is greater than point n+1, and c) point n is greater than the adaptive threshold for that point. LocalMaxima.m thus parses the data given by onDet, by returning a truncated vector comprised only of sample points at which onsets are initialized.

17

3. CLASSIFICATION

18

3.1 Introduction Classification of slices between onset points is intrinsic to the future matching stage of the algorithm, as it provides necessary parsing of the output onsets vector into sub-groups defined by average spectral makeup. It should be mentioned here that the chosen classification system provides the method by which later resequencing will be based (proportional centroid distance), but this will be explained in the following section.

3.2 k-Means Classification Once the input signal has been analyzed for onset start times, DetFuncA.m and DetfuncB.m then utilize k-Means (MacQueen, 1967), a simple classification implementation is to define the information between onset vector points as either belonging to the family of kick drums (K ), snare drums (S), or hat/cymbals (H). The goal of the k-Means classification procedure is to locate mean vectors of a distribution, serving to create Voronoi cells, enclosing the data by an ownership labeling (Duda et al., 2001) 4. Through an effort to minimize a squared-error distance calculation (13), k-Means produces maximum-likelihood estimates of ownership 4. In (13), the sum of absolute difference for each point x j in set S

i from its prescribed mean µ

i is calculated. The results of these

calculations are then in turn summed for each value of k.

V = x j ! µi

2

x j "Si

#i=1

k

# (13)

During onset detection, the matrix BNC is created containing the first eight STFT frequency bins of a detected onset frame, along with the following seven frames for validity of frequency information. The eight bins are selected in the initialization of the DetFuncA.m and DetFucB.m functions (Appendix 8.2) by the command, 1:5:36, which represent frequencies between 0-344 Hz (as the sampling frequency is 44100 samples per second and the STFT window size is 512 samples in length). These bins are chosen, as this configuration provides an accurate representation of the three classes. Kick drums will generally be represented with large values in bins 1:3, while snares will demonstrate the most significant values in the 5:8 bins (however mostly in the eighth bin), while hats, although present in the vector due to onset detection, will have a minimal presence across these low bins. The means of these eight frames for each STFT bin are then calculated, and input into a adapted version of Kmeans.m (Abonyi, et al. 2004), part of MATLAB’s Fuzzy Clustering Toolbox 11. The k-Means method of classification was designed to analyze characteristics of

19

data present on a feature space of x given dimensions. In this scenario, there are eight dimensions, given by the eight STFT bins extracted during onset detection. The data is represented on the feature space along with k centroids (centers of mass, or barycenters), whose initial locations are generated randomly. The k value chosen for the algorithm is 3, representing the 3 classes (K, S, H) for which each data point will be attributed. Through an iterative process, each data point assigns itself to a particular centroid, and the centroids each move to the center point of the data set by which it is defined. The process continues until each point is defined, and no centroid remains in motion. Through this iterative process each onset may be assigned to a particular centroid, however, because of the random placement of centroids, there can be no method of predictive description for centroids. Post-analysis description is therefore necessary. Because the sample signals being evaluated are drum loops (or perhaps non-rhythmical percussive sounds), we can assume the sounds to fall in basic categories, based on the design of the instruments being used to create the sounds. If the loops contain onsets from the three basic components of a drum kit – kick drum, snare drum, and hat/cymbal, then we can assume that the data will cluster into three main areas of the feature space. While k is still assigned to three, floor tom presence most often results in increased K (kick drum) onsets, however simple tweaking of the k value from 3 to four will provide four centroids which may be used for definition of drum sounds. For most production loops however, the k=3 determination is sufficient, as the occasional tom will not skew data, as it will be generally exist as an outlier, and will not be chosen as an onset for resequencing unless another outlier exists in the opposing loop.

3.3 Cluster Labeling Amongst other data, the output of kmeans.m provides the location for all three centroids in all eight dimensions (given by result.cluster.v), a 1-0 truth table representing the membership of every data point to each centroid (result.data.f), and a distance metric for each point within the dataset (result.data.d). The means of result.cluster.v are calculated, creating three values. In determining which centroid is associated with which instrument class, some assumptions must be made, and defined in the code. As mentioned above, kick drums will center around bins 1:3, snares around 5:8, and hats will be uniformly neutral. Furthermore, the snare centroid will contain the most energy of the three, and therefore may be defined as the maximum of the mean centroids, and the hats as the minimum. Because k=3, through the process of elimination, the kick drum centroid is associated with the remainder from the simple equation:

CentroidK = 6 – CentroidS – CentroidH (14) where: CentroidS = max(mean(result.cluster.v)) CentroidH = min(mean(result.cluster.v))

20

The marking of ‘k’, ’s’, or ‘h’ is then assigned to each onset on the data set, and a list of kicks, snares, and hats may then be exported in list form. DetFuncA.m and DetFuncB.m define these outputs as output.list.kick, output.list.snare, and output.list.hat. A secondary list group, comprised of output.class.k, output.class.s, and output.class.h, provides TimbreMut.m with the data point distances from same class centroid, for resequencing capabilities. Both DetFuncA.m and DetFuncB.m also return a general list of onsets in the vector output.onsetinfo.

(a) three clusters generated and result.cluster.v

(b) determination of [S]

(c) determination of [H]

(d) determination of [K]

Figure 3: k-Means Output and Categorization Determination

21

Plotting of data, and centroid movement may be viewed in MATLAB’s Figure window, and separation between should be clearly visible between data groupings. In Figure 3(a), the kick centroid exists on the bottom left corner, and is data defined by its ownership is represented by blue [+]. Snares are represented as red [.] with the snare centroid located at the barycenter of these four points. The hat centroid is located on the lower right side of the graph; hat data points are represented by green [x]. Figure 3 also demonstrates the steps by which these determinations are made. Across the eight features, the snare centroid will demonstrate the highest average energy (Figure 3(b)), and hats the least (Figure 3(c)), as the bins analyzed are much too low to exhibit any significant energy levels. As kicks are the only other class remaining, they can simply be determined by simple subtraction (three groups equals a total value of six (Figure 3(d)); and we can simply subtract the determined value of snares and hats from this).

22

4.RESEQUENCING

23

4.1 Substitution Method Once DetFuncA.m and DetfuncB.m have completed and their outputs returned to TimbreMut.m, a simple matching is needed to resequence the ‘B’ file onset list to that of the ‘A’ file onsets. Work by Ravelli et al. (2007) in this area introduces a substitution matrix, as presented in Figure 4(a), a systematic procedure of assigning rewards and punishments to specific pairings, such that non-class pairings (such as ‘A’ kick with ‘B’ snare) are discouraged, while strong on-beat events such as kick drums on the first beat of a loop are more likely to be paired. Once rewards and punishments have been assessed for each pairing, a resequenced vector is created by navigating backwards through the matrix on a pattern that provides the optimal score (Figure 4(b)). This implementation excels by following the resequence pattern that provides the highest possible score, and thus most likely fit.

Figure 4 : (a) substitution matrix with penalties and rewards for same loop

(b) Needleman-Wunsch-Sellers Algorithm for matching two different loops (Ravelli et al., 2007) 1

Figure 4(b) demonstrates the resequencing of loop ‘B’ (LMLM) to loop ‘A’ (LMLLLM). On-beat (boldface)/off-beat (regular font) determination is performed by a separate beat tracking analysis stage. As the presented method does not make such distinctions, it is not possible to utilize the substitution matrix as indoctrinated in the Ravelli method. As such any implementation of a substitution matrix would then require some method of differentiation between similarly-classed events. However, if these events are already parsed into separate vectors, the need for a substitution matrix is removed. Pairing is then only possible by a comparative analysis of some feature produced during classification. The following section presents the only measure output by the classification scheme, that of a distance metric from a centroid.

4.2 Proportional Centroid Distance (PCD) Sequence alignment of two loops relies upon the basic assumption that there is some discernable characteristic within each of the vectors being analyzed that must be used as the source of comparison. Spectral similarity can only bring us so far, as once we have delineated between classes, there truly is no reliable source of frequency comparison

24

between loops. A related feature to that of spectral similarity can be created however, through a stored output of kmeans.m, namely the aforementioned output.class.k/s/and h, which outputs distance to ownership centroid per class. Unfortunately centroid distances are not predictable, such that strong variation between in class events such as rimshots and brushes on a snare drum may result in a large variation in centroid distances. It is therefore essential to adopt this data to a comparative form. The method undertaken here is to use the proportional centroid distance (PCD), which is achieved simply by dividing each value in a particular subset by its largest distance. To prevent incorrect class mismatches, resequencing in TimbreMut.m is performed sequentially on individual class lists (TimbreMut.m, lines 68-83), resulting in resequenced ‘B’ class lists (B_kicks_nu, B_snares_nu, B_hats_nu), which must then be combined for the cross-synthesis technique in the following section. Although PCD is irrelevant in spectral comparison between ‘A’ and ‘B’ beyond the class delineation (i.e. kick versus snare), it can be seen to effectively associate those onsets within loop ‘B’ with onsets of ‘A’, which are of similar deviation from the idealized timbres of the classes, represented by the centroids.

Figure 5: PCD pairing and Resequencing of A and B kicks

Figure 5 demonstrates the method of comparison by which PCD utilizes for resequencing ‘B’ onsets. Each vector of distances (listed as ‘A’ kicks and ‘B’ kicks) is first divided by its maximum component, resulting in components ranging between 0-1. This then provides a method of comparison between the two lists. To find most appropriate pairings, PCD compares the first ‘A’ onset with each ‘B’ onset, locating the absolute minimum deviation between the two lists. In the case of A_kicks(1), the correct pairing becomes B_kicks(13), because the smallest calculated difference exists between 0.5683 and 0.7186. Continuing in this fashion results in a resequenced pattern of [13,9,5,1,5,4,5,1,9], and is defined by B_kicks_nu. The process is then repeated for the creation of B_snares_nu, and B_hats_nu. PCD is an especially quick operation, generally executed in under 0.025 seconds, and provides reliable results for loops with standard instrumentation, and is able to provide proper pairing for ghost notes (stick bouncing) and the occasional odd timbral event, such as a rimshots, so long as these exist in both loops being matched.

25

One interesting issue created through the use of centroid distance, is that directionality is completely ignored. As a comparison of spectral characteristics is not performed on intra-class slices, a sort of spectral circumference is generated, and comparisons are based only on distance from an idealized class structure (ideal kick drum, snare drum, or hat/cymbal). The result of this is neither good nor bad, the spectral makeup of different loops can not be assumed to be similar. For example, the composition of a solid down-beat kick for loop A could possibly most resemble spectra to the off-beat syncopated kick of loop B; the result of which would be an undesirable pairing. Instead, the distance metric is used without regard to spectral positioning within the circumference.

8.3 Outliers It is important to keep in mind that unwanted results may occur if one of the loops contains a majority of non-standard sounds such as rimshots, and relatively few full snare hits, as this will move the centroid towards the rimshot timbre, making this the normal sound. When paired with another loop, which uses the full snare as a comparative centroid element, the full snare will be matched with a rimshot. Of course, the user of this algorithm should be aware that such a pairing will be expected, as there are only three centroids, and if a rimshot exists as the main source of timbral expression for the snare centroid, then it should come as no surprise that its timbre will be dually represented within the final output .wav files.

26

5. TIMBRE MUTATION

27

5.1 Possible Methods There are several methods by which the slices of one loop ‘A’ may be combined with another ‘B’. Any chosen synthesis technique should provide a consistent quality measure, and do so in a highly speedy manner, as the algorithm must iterate several times. Other considerations may include a variable quality control for faster results when choosing loops and setting timbre configurations. Certain input characteristics, such as variation in file length and pitch height also play a role in the decision as to which method should be incorporated. Spectral modeling synthesis (SMS), linear predictive coding (LPC), and phase vocoder morphing are all appropriate methods by which to implement repetitive cross-synthesis, however SMS and LPC methods require longer durations for analysis and high quality output than the phase vocoder method – LPC requires a large number of coefficients to carefully model the spectral envelope of a sound. It should be noted here that because specific frequency blending characteristics are not present (princarg function is not utilized), the phase vocoder methods should not generally be described as cross-synthesis techniques for tonal mutation applications, however for this application, the method proves to be quite appropriate. The function, PhaseVoxMut.m (Appendix 8.4) provides analysis, transformation, and resynthesis functionality to the TimbreMut.m algorithm.

5.2 PhaseVoxMut.m Analysis and resynthesis performed on the time-frequency model employed during phase vocoding allows simplistic modification of magnitude and phase values. In a block-by-block approach, samples are input into a sliding-time reference FFT calculation, such that:

X(n,k) = x(m)h(n ! m)e

! j2"mk /N

m=!#

#

$ (15)

Here, h(n) is a sliding window whose value is the same as the frame size. It is important to note here that in the block-by block approach, phases are not unwrapped by the princarg function, and as such, their values result in values between 0 and 2π. Amplitude values result in -1 to 1 values.

In resynthesis, the inverse IFFT is applied to each STFT spectrum, and the signal is then reconstructed through the use of the overlap-add method (Arfib et al., 2002). The

28

combined process, also known as the direct FFT/IFFT approach, is demonstrated below in Figure 6.

Figure 6: Overview of Overlap-Add Phase Vocoder method

(adapted from Zolzer et al. 2002)12

Provided that the sum of the overlapped STFT windows is unity, as is seen in Figure 7, a perfect reconstructed signal is possible 13.

Figure 7: Overlap-Add Summation Model (adapted from Zolzer et al. 2002)12

Arfin et al. (2002) explain that mutation between two sounds may be performed through the direct FFT/IFFT method, whereby two sounds are analyzed by individual sliding FFT windows in tandem 12. Within each STFT frame, phase and magnitude values of each sound are extracted.

29

The phase components of sound A (!A(n) ) and B (!

B(n) ), are combined:

!

A(n) +!

B(n) = !

comb(n) (16)

and the same method is used for the determination for overall amplitude values from individual A ( X

A(n) ) and B ( X

B(n) ) values:

X

A(n) + X

B(n) = X

comb(n) (17)

At the end of each window calculation, phase and magnitude components are then merged:

y(n) = Xcomb (n) e! j2"comb (n) (18)

And finally, the IFFT returns the spectrum to a time-domain representation of the signal:

outputPVM = IFFT y(n)( )n=!"

"

# (19)

Figure 8 demonstrates the process of phase vocoder mutation.

Figure 8: Phase Vocoder Mutation

In a recursive process, PhaseVoxMut.m is provided correlated ‘A’ and ’B’ (i.e. A_kicks and B_kicks_nu) samples; it analyzes the lengths, and determines the longer sample (which becomes the new sample length). The frame-by-frame STFT values from earlier analysis in DetFuncA.m and DetFuncB.m are not stored, and as such must be recalculated in PhaseVoxMut.m – this calculation is quite brief, and its separation from the previous processes is essential for separation of magnitudes and phases, as well as the variable quality control measure discussed previously in this section.

30

5.3 User-Defined Mix Levels For innovative phase vocoder mutation, [13] recommend experimenting with adjustments to the process of addition of magnitudes, r = r1 +r2, where r is overall magnitudes, and r1 and r2 are the ‘A’ sample and ’B’ sample magnitudes (per frame), respectively. To this end, I have implemented coefficients ! and ! , for magnitude and phase respectively, so that the user may control the amounts of each of these values from the command line, resulting in a modified equation which now reads (PhaseVoxMut.m, line 51):

Xcomb(n) = X

A(n) i 1!"( )( ) + X

B(n) i"( ) (20)

This same principle is applied for phase control, described by theta on line 52:

!comb(n) = !

A(n)i 1" #( )( ) + !

B(n) i#( ) (21)

The adjusted magnitudes and phases are then recombined as per equation (18) in line 53: This returns the imaginary and real values for the IFFT transformation, which returns the result of (18) to the time domain. PhaseVoxMut.m then outputs amplitude values, similar to the wavread that it used to create the hybridization, and uses the two sample names to output a concatenated name. The phase vocoding method results in the separation of magnitudes and phases within the frequency domain. Morphing using the phase vocoder technique produces two sets of such magnitudes and phases, and allows for individual manipulation of magnitudes and phases in the frequency domain. At the command line, once filenames are called, magnitude and phase coefficients provide a 0.0-1.0 control over each parameter, such that a 0.2 value would coincide with a 20% mix (80% ‘A’, 20% ‘B’). For example, the command line may read:

which corresponds to 80% (0.8) ‘eel.wav’ magnitudes and 20% (0.2) ‘eel.wav’ phases. It is important here to mention that the decorrelation of magnitude and phases results in seemingly random pairings of phases with magnitudes, and the larger the divide, the more pronounced the effect. The most obvious disconnect will result when the magnitude and phase values differ by 1.0, such that the values either read:

In both cases cavernous phaser effects are created. In the first example, TimbreMut.m will return a resequenced .wav to the rhythm of the first loop, which utilizes the enveloping of the sounds of ‘clarknova4bbcutm.wav’ while returning the phase components (and thus sounds of) ‘eel.wav’. While remaining rhythmically consistent with the first example, the second case will contain the envelopes of ‘eel.wav’ while maintaining the phases of

31

‘clarknova4bbcutm.wav’ – an interesting effect, considering that this file has also provided the output file with its rhythmic component. Both of the aforementioned settings result in extreme variations on the design of the algorithm, resulting in phaser-like effects upon the output waveform. Standard usage however, will generally keep both magnitude and phase values consistent with one another – in fact the default values of each are set to 0.8 and 0.8, which will result in a clean mix of 80% ‘B’ loop of both magnitudes and phases. Equal values throughout the scale of 0.0 – 1.0 will then provide predictable clear results, with correlated phase values.

This setting will result in a full replacement of both phases and magnitudes; the output of the algorithm will thus be the timbre of the ‘eel.wav’ loop playing the rhythm from the ‘clarknova4bbcutm.wav’ sample.

5.4 Window Size The final user defined option within the TimbreMut.m algorithm is a option of window size. The default size is 1024, however any power of two will work. The purpose of this control is to allow the user to quickly create a transformation for testing. Timbre morphing is often a difficult process to predict, even without user-defined options such as phase and magnitude amounts. Larger window sizes are effective for accurate frequency representation, however tend to create non-linearity within the input–output relationship. Therefore, to fine-tune the morphing parameters, the user has access to a larger window size, which will generate low quality versions of the transformation, prior to higher quality representation. As the loss of frequency resolution is less imperative than temporal resolution, a final, higher quality version may be produced by using a window size of 512, 256 or even 128 samples.

32

6. FURTHER CONSIDERATIONS

33

6.1 MIDI This algorithm is designed to not only produce new timbral effects, but to perform autonomously as a traditional beat slicing program such as Propellerheads Recycle. Of course the addition of MIDI is not simply for this purpose, but more intrinsically because MIDI techniques are still widely used within the electronic music community. Specifically, audio manipulation within software sequencers such as Pro Tools, Digital Performer, Logic and Cubase are often executed with the use of software samplers and MIDI control. It is for this expressed reason that TimbreMut.m exports individual .wav slices, and a .mid file for their control. DetFuncA.m exports an ‘nmat’ file which stores sample points at which onsets occur, which is then used by TimbreMut.m to create a MIDI file on lines 171 to 172 (Appendix 8.1). DetFuncA.m also contains coding for MIDI file creation, the result of altered code from the MATLAB MIDI Toolbox 14. Notes are assigned from 1 to the length of the onset vector (output.mMIDI.notes), and their durations (output.mMIDI.dur) are prepared as the sample point difference between successive onsets. The createnmat.m command from the MIDI Toolbox is used here to create a notematrix, or a pre-MIDI file containing notes, durations, note-on and note-off times. The notematrix is then output from DetFuncA.m as output.mMIDI.nmat. These files are returned to TimbreMut.m and configured as a renamed MIDI file there. DetfuncB.m does not contain a MIDI preparation block, nor is one necessary, as the flow of information demonstrates that the second loop is resequenced fully to the rhythm of the first, and thus the process will never require a second MIDI file. This was a purposeful decision to limit the outflow of information from the detection algorithms, thereby reducing analysis processing time significantly. The absence of this feature is the only difference between the two functions.

6.2 Slicing and Combined Output PhaseVoxMut.m is called iteratively within TimbreMut.m (Appendix 8.1), on lines 99, 110, 125, 136, 151 and 162. The internal output of the command is, as mentioned above, an individual slice as a .wav, named for the two input samples. PhaseVoxMut.m also returns an amplitude vector to TimbreMut.m, which, in combination with the stored ‘notes’ vector, is used for a final resequenced, combined .wav.

34

6.3 Gaps and Clipping During rhythm resequencing, substantial slice length deviation between an ‘A’ and ‘B’ samples may result in either a gap between the end of one slice and the beginning of the next, or clipping if a slice has not completed its full playback. The solution undertaken to solve both of these problems is two-fold. Before either problem is solved, it must be understood that if the PhaseVoxMut.m command will, and must, choose a specific sample length. Therefore the first step prescribes a minimum and maximum title to each sample. PhaseVoxMut.m always creates an empty vector to the length of the longer, irregardless of ‘A’ or ‘B’’s length, creating a situation whereby an ‘A’ loop with shorter durations will always clip (so long as the mix is above 0.0 for at least the magnitude value). Once this has been done, a simple 100-sample fade (Appendix 8.1, line 181-182) placed at the end of the length of each prescribed MIDI note will remove any non-linearity associated with the continuance of the note events, and pass unnoticed by listeners.

6.4 Omission of Timestretching Functionality The decision to omit a timestretching function was made only after attempting varied techniques and after careful consideration of several unavoidable scenarios. Certain assumptions must be made in considering the timestretching option. First, and foremost is that each loop will contain slices of different sample lengths (unless of course the user is manipulating a loop with itself simply for the slicing benefit of the algorithm). If the difference between sample lengths is greater than a certain listener-defined tasteful duration, then how might an algorithm containing only three centroids to define class (along with centroid distance) deal with length? If the first loop contains long duration ride cymbals, and the second only closed hats, when pairing occurs, should the algorithm really stretch the closed hat to the duration of the ride cymbal? Second, and closely related to the first point, if this issue is permitted, and timestretching is utilized, at which point is sample difference uncomfortable to the ear? Beyond transient smearing, stretching even the decay of kick drums often results in a slight ringing and phasy tone, which many electronic musicians and producers might frown upon. More importantly, a high quality timestretching algorithm is, in itself another process which takes time to complete. So, for these reasons, I have chosen to avoid implementing a timestretching block to the algorithm, and will instead rely on the user to perform any such manipulation prior to input.

35

7. CONCLUSIONS

36

In this paper I have presented a method by which two percussion loops may be automatically manipulated such that the second loop is 1) modified to the rhythm of the first, and 2) the loops are then systematically morphed together. The TimbreMut.m algorithm appears to be a useful tool for segmentation and manipulation of percussive rhythm and timbre. It consists of: 1) a detection function which effectively locates the onsets of percussive events, 2) a classification stage which uses basic principles of logic and inference in a deterministic categorization of the events into one three groups, 3) resequencing using a component from the classification stage, and finally, 4) a mutation/morphing element, which relies on phase vocoder techniques. These stages have been designed with an emphasis on modularity, such that each component is comprised of block forms for easy access or redefinition via alternate functions. Quantification of the robustness of the algorithm is a difficult measure without proper sample licensing. The database on which I have performed my tests is a private collection of 355 recorded funk and jazz vinyl records/CDs, along with a second set of 30 prepared studio drum loops, comprised of more basic beat structures and less timbral complexity. To test the accuracy of the onset detection, TimbreMut.m was modified to output individual slices of the A loop alone. On studio drum loops, DetFuncA.m performed impeccably, locating onset points precisely each time. It should be noted here as well, that these loops have been created with the intention of further manipulation by electronic music composers, and as such have a moderate amount of separation between elements. Detection with the recorded funk loops was slightly less predictable, however no false onsets were found. The most repeated error was found in the detection being too accurate, in that if a percussionist played a kick and hat closely, but their attacks were not necessarily indistinct, the function would see two as individual onsets, resulting in a short slice. For simple slicing of a loop this is fine, however, in timbre morphing or full replacement scenarios, the replaced event would not fit comfortably, and a ‘chopped’ effect occurs. To remedy this situation, a simple if-statement could be included in the synthesis block, whereby slices must be of a given duration to enter PhaseVoxMut.m, otherwise they will automatically be used the final output. Listener tests have been optimistic towards the output of the algorithm when the PhaseVoxMut.m frame size is set to moderate (512 samples) to high quality (128 samples) settings. Below these, the sound becomes “grainy” (at 1024 samples), or “noisy”(2048 samples and above). The most often received criticism regarding the recombined waveform was that the individual slice decays were often cut short. As the onset points for the rhythm cannot be altered, and a timestretching function is not a presently entertained option, there is little that can be done to prevent the occurrence of this problem. The effect is, however often a desirable sonic event within electronic dance music genres such as drum and bass or glitch. Nevertheless, electronic musicians were for the most part interested in the individual output wav files (along with accompanied .mid file), and

37

possibilities with non-standard drum sounds for alternate percussion timbres. This group was eager to try the algorithm to generate new timbres quickly for compositions. Potentially the most problematic area of the algorithm is the Classification stage. The simple three-class categorization method works well for loops whose constituent events are comprised of timbres defined by kick drum (K), snare drum (S), or hat/cymbal (H). For the most part then, k-Means relegates non-K, S, H timbres to a positioning in relation to the K, S, H tessellation. For example, floor toms are seen as outliers to the K class, and bells or triangles will be perceived as hats. If the number of non-percussive events such as triangles predominates the usage of standard hats in a loop, then k-Means will still iterate its centroid towards the barycenter of all events, making the idealized hat a composite of both the hats and triangles. Therefore, the more non-hat elements encompassed by the H centroid, the more loosely the definition will fit the slices. Again, this proves to be an issue only during the mutation or full replacement, as a triangle from one loop may be blended with a hat from the second. In operations with the database of funk loops, this situation was common, as live performances and solos often include a stylistic individuality marked by not only heightened rhythm complexity, but timbral uniqueness as well. As TimbreMut.m has been designed to work more robustly with studio drum loops, the user must use her/his scrutiny in loop choice. In future implementations, a nested k-Means method may be incorporated, such that the first delineation would be between low, middle, and high events, and then subclasses could then be derived from a secondary k-Means calculation, and whose (secondary) k value is generated from the Euclidian distance of each of these events. Another answer may be found in the k-Medioid function, whereby a centroid is not chosen; rather the data point closest to the center is chosen as a center for the mass. Comparative analysis may then be used to determine outlier validity. The PCD method of resequencing could also be made more robust against outliers, through an outer limit thresholding, such that once the distances have been brought within a certain range, and adaptive thresholding function could selectively eliminate all data points beyond it. Presently, TimbreMut.m is not intended to operate real-time, and its usage is intended for processing and preparation of percussive loops, it is not required that it rely on user input once initial calculations have begun. A future implementation may include real-time resequencing with live audio, a feature provided by Nick Collins BBCut, for the Supercollider language, but this would require the reworking of several components in the algorithm’s architecture within another program. In further implementations of the program, I would also like to incorporate a multi-resolution analysis phase, similar to that found in [15]. This would allow for both a more accurate representation of the presence of onsets within each frequency range chosen, as well as a better-suited vector used for the classification stage. In all however, the TimbreMut.m algorithm provides a fast method of morphing between percussive loops, generating novel timbres from pre-recorded samples, and its architecture provides a strong basis on which to grow for further development.

38

8. APPENDIX

39

8.1 TimbreMut.m function [] = TimbreMut(filename1,filename2,mag,phase,framesize); %Jason A. Hockman %NYU Music Technology %This function, given two different mono drumbeats, will return individual %blended conglomerate files made up of the like sounds of the two loops, a %midi file of the same root name, as well as a resequencing of the two %individual files, made to play to the rhythm of the first loop. All these %outputs are then found in a folder, within the main MATLAB catalog, with %a combined name. % % e.g. A) '10.wav' B) 'Brockie.wav' Folder name will be: '10Brockie' % cd('/Applications/MATLAB_SV71');%this should be set to main Matlab folder %=========================== % ---- Onset Detection ---- %=========================== if (exist('mag') ~= 1) %optional args for spectral magnitude and phase mix mag=0.8; else mag=str2num(mag); end if (exist('phase') ~=1) phase=0.8; else phase=str2num(phase); end if (exist('framesize') ~=1)%opt args for framesize...useful to start high framesize=1024; else framesize=str2num(framesize); end output1 = DetFuncA(filename1);% complex domain detection function for A output2 = DetFuncB(filename2);% complex domain detection function for B fs=output1.audio.SR; %A_defs A=output1.audio.file; A_kicks=output1.list.kick;A_snares=output1.list.snare; A_hats=output1.list.hat; A_k=output1.class.k;A_s=output1.class.s;A_h=output1.class.h; A_ON=output1.onsetinfo;dur=output1.mMIDI.dur; %B_defs B=output2.audio.file; B_kicks=output2.list.kick;B_snares=output2.list.snare; B_hats=output2.list.hat; B_k=output2.class.k;B_s=output2.class.s;B_h=output2.class.h; B_ON=output2.onsetinfo; %============================= % -------- Filename --------- %=============================

40

fn1=filename1(1:strfind(filename1,'.')-1); fn2=filename2(1:strfind(filename2,'.')-1); if length(fn1)>4; fn1=fn1(1:4); end if length(fn2)>4; fn2=fn2(1:4); end fn=[fn1 fn2];mkdir(fn);%used for creation of folder+filename %=========================== % ----- P C D Pairing ----- %=========================== m=1;n=1;p=1; %this is performed via proportional centroid proximity for i=1:length(A_k); [Y(:,m),A_feign_kick(:,m)]=min(abs((A_k(i)/max(A_k))-... ((B_k(1:length(B_k)))/max(B_k)))); m=m+1; end for i=1:length(A_s); [Y(:,n),A_feign_snare(:,n)]=min(abs((A_s(i)/max(A_s))-... ((B_s(1:length(B_s)))/max(B_s)))); n=n+1; end for i=1:length(A_h); [Y(:,p),A_feign_hat(:,p)]=min(abs((A_h(i)/max(A_h))-... ((B_h(1:length(B_h)))/max(B_h)))); p=p+1; end B_kicks_nu=B_kicks(A_feign_kick); B_snares_nu=B_snares(A_feign_snare); B_hats_nu=B_hats(A_feign_hat); %=============================== % -- Phase Vocoding & Slices -- %=============================== E=zeros(max(output1.mMIDI.dur),length(A_ON)-1); m=1;n=1;p=1; for i=1:length(A_kicks); if A_kicks(i)<length(A_ON)-1; A_kick=A(A_ON(A_kicks(i)):A_ON(A_kicks(i)+1)); if B_kicks_nu(i)<length(B_ON)-1; B_kick=B(B_ON(B_kicks_nu(i)):B_ON(B_kicks_nu(i)+1)); else B_kick=B(B_ON(B_kicks_nu(i)):length(B)); end j=eval(['A_kicks(' num2str(m) ')']); pvmout=PhaseVoxMut(A_kick, B_kick, ([fn num2str(j)]),... fn,mag,phase,framesize); E(1:length(pvmout.result),j)=pvmout.result; else A_kick=A((A_ON(A_kicks(i))):A_ON(end)); if B_kicks_nu(i)<length(A_ON)-1; B_kick=B(B_ON(B_kicks_nu(i)):B_ON(B_kicks_nu(i)+1)); else B_kick=B(B_ON(B_kicks_nu(i)):B_ON(end)); end j=eval(['A_kicks(' num2str(m) ')']); pvmout=PhaseVoxMut(A_kick, B_kick, ([fn num2str(j)]),... fn,mag,phase,framesize); E(1:length(pvmout.result),j)=pvmout.result; end m=m+1; end for i=1:length(A_snares); if A_snares(i)<length(A_ON)-1; A_snare=A(A_ON(A_snares(i)):A_ON(A_snares(i)+1)); if B_snares_nu(i)<length(B_ON)-1;

41

B_snare=B(B_ON(B_snares_nu(i)):B_ON(B_snares_nu(i)+1)); else B_snare=B(B_ON(B_snares_nu(i)):length(B)); end j=eval(['A_snares(' num2str(n) ')']); pvmout=PhaseVoxMut(A_snare, B_snare, ([fn num2str(j)]),... fn,mag,phase,framesize); E(1:length(pvmout.result),j)=pvmout.result; else A_snare=A((A_ON(A_snares(i))):length(A)); if B_snares_nu(i)<length(A_ON)-1; B_snare=B(B_ON(B_snares_nu(i)):B_ON(B_snares_nu(i)+1)); else B_snare=B(B_ON(B_snares_nu(i)):length(B)); end j=eval(['A_snares(' num2str(n) ')']); pvmout=PhaseVoxMut(A_snare, B_snare, ([fn num2str(j)]),... fn,mag,phase,framesize); E(1:length(pvmout.result),j)=pvmout.result; end n=n+1; end for i=1:length(A_hats); if A_hats(i)<length(A_ON)-1; A_hat=A(A_ON(A_hats(i)):A_ON(A_hats(i)+1)); if B_hats_nu(i)<length(B_ON)-1; B_hat=B(B_ON(B_hats_nu(i)):B_ON(B_hats_nu(i)+1)); else B_hat=B(B_ON(B_hats_nu(i)):length(B)); end j=eval(['A_hats(' num2str(p) ')']); pvmout=PhaseVoxMut(A_hat, B_hat, ([fn num2str(j)]),... fn,mag,phase,framesize); E(1:length(pvmout.result),j)=pvmout.result; else A_hat=A((A_ON(A_hats(i))):length(A)); if B_hats_nu(i)<length(A_ON)-1; B_hat=B(B_ON(B_hats_nu(i)):B_ON(B_hats_nu(i)+1)); else B_hat=B(B_ON(B_hats_nu(i)):length(B)); end j=eval(['A_hats(' num2str(p) ')']); pvmout=PhaseVoxMut(A_hat, B_hat, ([fn num2str(j)]),... fn,mag,phase,framesize); E(1:length(pvmout.result),j)=pvmout.result; end p=p+1; end %=========================== % -------- MIDI ---------- %=========================== MIDIfileName=[fn '.mid']; %additional functionality for ReCycle-like use MIDIfile=writemidi(output1.mMIDI.nmat,MIDIfileName); %============================= % Resequenced Combined Output %============================= df=(-1:.01:0);df=abs(df)';%used for fade notes=[1:length(A_ON)-1];start=A_ON(1:length(A_ON)-1); buff=zeros(1,length(A));E=E'; for i=1:length(notes); drumfade=[ones(dur(i)-101,1); df];%fade for clip prevention buff(1,start(i):start(i)+dur(i)-1)=E(i,1:dur(i)).*drumfade'; end

42

%=========================== % ------ Resequence ------- %=========================== wavwrite(buff,fs,[fn num2str('full.wav')]); cd('/Applications/MATLAB_SV71');

8.2 DetFuncA.m & DetFuncB.m function output=DetFuncA(filename); %Jason A. Hockman %NYU Music Technology %This function, which is used to find the onsets of each event within a %percussive audio file, is based on the research of Bello et al. It %incorporates both phase (proper for kick detection) and STFT energy %assessment for high freq elements (snares+hats). This function outputs %onsets, their start times, as well as a MIDI file. [y,fs]=wavread(filename); output.audio.file=y; output.audio.SR=fs; % =========================== % ---- Initializations ---- % =========================== m=1; v=1:5:36; q=1; thresh=20; N=512; hop=N/2; win=blackman(N); file_name=[]; tic % ========================================== % ---- Complex-Domain Onset Detection ---- % ========================================== A=abs(spectrogram(y,win,hop,N,fs)); pin=0; pout=0; pend=length(A)-2; while pin<pend fta_prior=real(A(1:hop,pin+2));%prior angle fta_priorprior=real(A(1:hop,pin+1));%priorprior angle STFT_present=abs(A(1:hop,pin+3));%present STFT STFT_prior=abs(A(1:hop,pin+2));%prior STFT newthresh(:,m)=median(STFT_present); d = princarg((2*fta_prior) - fta_priorprior);%phase unwrap onDet(:,m)=(((STFT_prior).^2)+((STFT_present).^2)- ... ((2*STFT_prior).*(STFT_present).*(cos(d)))).^0.5;%OnDet bnc(:,m)=(STFT_present(v));%storing first 8 fft bins m=m+1; pin=pin+1; pout; end pk=[zeros(1,2) sum(onDet)]; nt=[zeros(1,2) (newthresh)]; peaks1=(diff(pk));%derivative of onsets thresh = min(thresh, mean(abs(peaks1))); adpk=pk.*0.6; adthresh=diff(adpk); peaks=[-20 LocalMaxima(peaks1,thresh,adthresh)];%peakpicking

43

for n = 2:length(peaks); if ((peaks(n))-(peaks(n-1))) > 15 P1(q) = peaks(n); q = q+1; end end P1 = [P1 size(bnc,2)]; for k = 1:length(P1)-1 Mx(:,k) = mean(bnc(:,P1(k):min(P1 ... (k)+7,P1(k+1)-1)),2); end ONSETS=(P1(1:end-1))*hop; % =========================== % ------- K-Means ------- % =========================== data.X=Mx'; [N,n]=size(data.X); data=clust_normalize(data,'range');%data normalization plot(data.X(:,8),data.X(:,8),'.') hold on param.c=3;% # of Clusters param.vis=1;% visualization parameter (set to 0 or 1) param.val=2; result=Kmeans(data,param); hold on plot(result.cluster.v(:,1),result.cluster.v(:,2),'ro') result=validity(result,data,param); result.validity; %plot(result.cluster.v') CS1=(result.cluster.v(1,:)); CS2=(result.cluster.v(2,:)); CS3=(result.cluster.v(3,:)); %means of centroids SC1m=(sum(CS1))/8; SC2m=(sum(CS2))/8; SC3m=(sum(CS3))/8; SCm=[SC1m SC2m SC3m]; [h,s]=max(SCm);%these 3 lines are defining the data [n,h]=min(SCm); k=6-s-h; [output.list.kick,J,output.class.k]=find(result.data.d(:,k).*result.data.f(:,k)); [output.list.snare,J,output.class.s]=find(result.data.d(:,s).*result.data.f(:,s)); [output.list.hat,J,output.class.h]=find(result.data.d(:,h).*result.data.f(:,h)); m=1; ON = [ONSETS length(y)]; for i=1:length(ON)-1 dur(m)=ON(i+1)-ON(i); m=m+1; end output.onsetinfo=ON; output.mMIDI.notes=(1:length(ONSETS))';

44

output.mMIDI.dur=dur; dur=(dur/fs); output.mMIDI.nmat=createnmat(output.mMIDI.notes,dur); output.mMIDI.start=ONSETS'; output.mMIDI.finish=ON(2:length(ON))'; toc

8.3 LocalMaxima.m function peaks = LocalMaxima(peaks1, thresh, adthresh); % peaks = LocalMaxima(peaks1): return indices of elements of peaks1 % which are positive peaks i = 2:length(peaks1)-1; peaks = find((peaks1(i)>peaks1(i-1))&(peaks1(i)>peaks1(i+1))... &(peaks1(i)>(thresh + adthresh(i))));

8.4 PhaseVoxMut.m function pvmout = PhaseVoxMut(file1, file2, file3, file4,r_coef,t_coef,framesize); %Jason A. Hockman %NYU Music Technology %This function, based on the design of a phase vocoder from the DAFx book, %provides a simple method of combining the two alike classified sounds %through spectral analysis and combination. The modification here which %makes the process different, however, is a two tiered coefficient method, %r_coef and t_coef, which control the spectral makeup and envelope amounts %from A or B, respectively; providing user control. A=file1;B=file2; output=file3; outputfldr=(['//Applications/MATLAB_SV71/' num2str(file4)]); %=========================== % ---- INITIALIZATIONS ---- %=========================== n1=framesize;fs=44100; n2=n1; N=8192; win=hanningz(N/2); c=min(length(A),length(B));d=max(length(A),length(B)); X1=zeros(d,1);X2=zeros(d,1); X1(1:length(A))=A; X2(1:length(B))=B; L=min(length(X1),length(X2)); X1=[zeros(N,1);X1;zeros(N-mod(L,n1),1)]/max(abs(X1)); X2=[zeros(N,1);X2;zeros(N-mod(L,n1),1)]/max(abs(X2)); X_out=zeros(length(X1),1);

45

tic %============================ % ------- Phase Vox -------- %============================ pin=40; pout=0; pend=length(X1)-2*N; while pin<pend grain1 = X1(pin+1:pin+N).* win; grain2 = X2(pin+1:pin+N).* win; f1 = fft(fftshift(grain1)); r1 = abs(f1); theta1 = angle(f1); f2 = fft(fftshift(grain2)); r2 = (abs(f2)); theta2 = angle(f2); r = (r1*(1-r_coef))+(r_coef*r2); theta = (theta1*(1-t_coef))+(t_coef*theta2); ft = (r.* exp(i*theta)); grain = fftshift(real(ifft(ft))).*win; %resynthesis X_out(pout+1:pout+N) = X_out(pout+1:pout+N) + grain; pin = pin + n1; pout = pout + n2; end toc %=========================== % -------- Output --------- %=========================== cd(outputfldr); X1 = X1(N+1:N+L); X_out = X_out(N+1:N+L) / max(abs(X_out)); wavwrite(X_out, fs, output); pvmout.result=X_out; pvmout.c=c;

46

9. REFERENCES

47

[1] E. Ravelli, J.P. Bello, M. Sandler, “Automatic Rhythm Modification of Drum Loops,” IEEE Signal Processing Letters, April 2007 [2] J.P. Bello, E. Ravelli, M. Sandler, “Drum Sound Analysis for the Manipulation of Rhythm in Drum Loops,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 5, May 2006, pp. 233 - 236. [3] J.P. Bello, C. Duxbury, M. Davies, and M. Sandler, “On the Use of Phase and Energy for Musical Onset Detection in the Complex Domain,” IEEE Signal Processing Letters, vol. 11, no. 6, June 2004 [4] R.O. Duda, P.E. Hart, D.G. Stork, “Pattern Classification” New York, John Wiley and Sons – Interscience, 2nd edition [5] J.P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, M.B. Sandler, “A Tutorial on Onset Detection in Music Signals,” IEEE Transactions on Speech and Audio Processing, vol.13, no.5, September 2005 [6] U. Zolzer, “Introduction” DAFx - Digital Audio Effects, Wiley and Sons, London, 2002 [7] X. Rodet and F. Jaillet, “Detection and Modeling of Fast Attack Transients” in Proceedings of the International Computer Music Conference, 2001 [8] J.P. Bello and M. Sandler. “Phase-Based Note Onset Detection for Music Signals,” in Proc. IEEE Int. Conference Acoustics, Speech, and Signal Processing (ICASSP-03), Hong Kong, 2003 [9] C. Duxbury, J.P. Bello, M. Sandler, M. Davies, “Comparison Between Fixed and Multiresolution Analysis for Onset Detection in Musical Signals” in the 7th Conf. on Digital Audio Effects. Naples, Italy, October 2004 [10] J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1 [11] B. Balazsko, J. Abonyi, B. Feil, “Fuzzy Clustering and Data Analysis Toolbox”, 2004 [12] D. Arfib, F. Keiler, U. Zolzer, “Time Frequency Processing,” DAFx - Digital Audio Effects, Wiley and Sons, London, 2002

48

[13] R.E. Crochiere, “A Weighted Overlap-Add Method of Short-Time Fourier Analysis/Resynthesis,” IEEE Transactions on Acoustics, Speech, and Signal Processing, 281(1), 1980 [14] T. Eerola, P. Toivianinen, “MIDI Toolbox: MATLAB Tools for Music Research,” Department of Music, University of Jyväskylä, Finland, 2004 [15] C. Duxbury, J.P. Bello, M. Sandler, and M. Davies, “A Comparison Between Fixed and Multiresolution Analysis for Onset Detection in Music Signals” in the 7th Conference on Digital Audio Effects. Naples, Italy, October 2004 [16] A. Light, “The Vibe History of Hip Hop,” New York, Three Rivers Press, 1999 [17] http://www.propellerheads.se/products/recycle/index.cmf [18] http://www.fxpansion.com/product-guru-main.php

automatic timbre mutation of drum loops · the same process. the two loops are then synthesized...

Documents