prosodic structure shapes the temporal realization of ... · of intonation and manual gesture...

16
Article Prosodic Structure Shapes the Temporal Realization of Intonation and Manual Gesture Movements Nu ´ ria Esteve-Gibert a and Pilar Prieto b Purpose: Previous work on the temporal coordination between gesture and speech found that the prominence in gesture coordinates with speech prominence. In this study, the authors investigated the anchoring regions in speech and pointing gesture that align with each other. The authors hypothesized that (a) in contrastive focus conditions, the gesture apex is anchored in the intonation peak and (b) the upcoming prosodic boundary influences the timing of gesture and intonation movements. Method: Fifteen Catalan speakers pointed at a screen while pronouncing a target word with different metrical patterns in a contrastive focus condition and followed by a phrase boundary. A total of 702 co-speech deictic gestures were acoustically and gesturally analyzed. Results: Intonation peaks and gesture apexes showed parallel behavior with respect to their position within the accented syllable: They occurred at the end of the accented syllable in non–phrase-final position, whereas they occurred well before the end of the accented syllable in phrase-final position. Crucially, the position of intonation peaks and gesture apexes was correlated and was bound by prosodic structure. Conclusions: The results refine the phonological synchron- ization rule (McNeill, 1992), showing that gesture apexes are anchored in intonation peaks and that gesture and prosodic movements are bound by prosodic phrasing. Key Words: pointing, temporal synchronization, intonation– gesture coordination, prosody S tudies on the relation between gestures and speech in human communication in recent years have shown that gesture and speech are tightly integrated. In his groundbreaking book, McNeill (1992) gives five main reasons to argue that gesture and speech form a single system in communication: (a) gestures occur with speech in 90% of cases, (b) gesture and speech are phonologically synchron- ous, (c) gesture and speech are semantically and pragma- tically coexpressive, (d) gesture and speech develop together in children, and (e) gesture and speech break down together in aphasia. According to McNeill, gesture and speech synchronize at three different levels: at a semantic level, because gesture and speech present the same meanings at the same time; at a pragmatic level, because the two modalities perform the same pragmatic functions when co-occurring; and at a phonological level, as the most prominent part of the gesture seems to be integrated into the phonology of the utterance. Broadly speaking, studies on the alignment between gestures and speech assume that gesture and speech are temporally aligned in the sense that the most prominent part of the gesture and the most prominent part of the speech co-occur (Birdwhistell, 1952, 1970; Kendon, 1972, 1980). Yet the specific way this temporal alignment operates is still unclear. To date, there are few empirical studies that have investigated this alignment, and they have used distinct methodologies: in some studies, for example, the prominent part of the gesture aligning with speech is the stroke phasethat is, the interval of time in which the peak of effort in the gesture occurs (McNeill, 1992). For other studies, what temporally coordinates with speech is the apex—that is, the point in time in which the movement reaches its kinetic ‘‘goal’’ (Loehr, 2004). These different methodologies are examined below. Similarly, there is no consensus in the literature on what part of the speech aligns with the most meaningful part of the gesture (i.e., the stroke phase or the apex). Previous studies can be separated into four groups according to the conclusion that they reached. The first group concluded that the prominence in gesture coordinates with the focused word (Butterworth & Beattie, 1978; Roustan & Dohen, 2010); the second, that the prominence in gesture coordinates with the lexically stressed syllable (Loehr, 2004, 2007; Rochet- Capellan, Laboissie `re, Galva ´n, & Schwartz, 2008); the third, that the prominence in gesture coordinates with syllables with intonation peaks (De Ruiter, 1998, second experiment; Nobe, 1996); and the fourth, that there is, in fact, no a Universitat Pompeu Fabra, Barcelona, Spain b Institucio ´ Catalan de Recerca i Estudis Avanc ¸ats, Universitat Pompeu Fabra, Barcelona, Spain Correspondence to Nu ´ria Esteve-Gibert: [email protected] Editor: Jody Kreiman Associate Editor: Shari Baum Received February 7, 2012 Revision received August 10, 2012 Accepted September 20, 2012 DOI: 10.1044/1092-4388(2012/12-0049) 850 Journal of Speech, Language, and Hearing Research N Vol. 56 N 850–864 N June 2013 N ß American Speech-Language-Hearing Association

Upload: others

Post on 23-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Prosodic Structure Shapes the Temporal Realization of ... · of Intonation and Manual Gesture Movements Nu´ria Esteve-Giberta and Pilar Prietob Purpose: Previous work on the temporal

Article

Prosodic Structure Shapes the Temporal Realizationof Intonation and Manual Gesture Movements

Nuria Esteve-Giberta and Pilar Prietob

Purpose: Previous work on the temporal coordinationbetween gesture and speech found that the prominence ingesture coordinates with speech prominence. In this study,the authors investigated the anchoring regions in speech andpointing gesture that align with each other. The authorshypothesized that (a) in contrastive focus conditions, thegesture apex is anchored in the intonation peak and (b) theupcoming prosodic boundary influences the timing of gestureand intonation movements.Method: Fifteen Catalan speakers pointed at a screen whilepronouncing a target word with different metrical patterns in acontrastive focus condition and followed by a phraseboundary. A total of 702 co-speech deictic gestures wereacoustically and gesturally analyzed.

Results: Intonation peaks and gesture apexes showed parallelbehavior with respect to their position within the accentedsyllable: They occurred at the end of the accented syllable innon–phrase-final position, whereas they occurred well beforethe end of the accented syllable in phrase-final position.Crucially, the position of intonation peaks and gesture apexeswas correlated and was bound by prosodic structure.Conclusions: The results refine the phonological synchron-ization rule (McNeill, 1992), showing that gesture apexes areanchored in intonation peaks and that gesture and prosodicmovements are bound by prosodic phrasing.

Key Words: pointing, temporal synchronization, intonation–gesture coordination, prosody

Studies on the relation between gestures and speech inhuman communication in recent years have shownthat gesture and speech are tightly integrated. In his

groundbreaking book, McNeill (1992) gives five mainreasons to argue that gesture and speech form a single systemin communication: (a) gestures occur with speech in 90% ofcases, (b) gesture and speech are phonologically synchron-ous, (c) gesture and speech are semantically and pragma-tically coexpressive, (d) gesture and speech develop togetherin children, and (e) gesture and speech break down togetherin aphasia. According to McNeill, gesture and speechsynchronize at three different levels: at a semantic level,because gesture and speech present the same meanings at thesame time; at a pragmatic level, because the two modalitiesperform the same pragmatic functions when co-occurring;and at a phonological level, as the most prominent part ofthe gesture seems to be integrated into the phonology of theutterance.

Broadly speaking, studies on the alignment betweengestures and speech assume that gesture and speech aretemporally aligned in the sense that the most prominent partof the gesture and the most prominent part of the speechco-occur (Birdwhistell, 1952, 1970; Kendon, 1972, 1980). Yetthe specific way this temporal alignment operates is stillunclear. To date, there are few empirical studies that haveinvestigated this alignment, and they have used distinctmethodologies: in some studies, for example, the prominentpart of the gesture aligning with speech is the stroke phase—that is, the interval of time in which the peak of effort in thegesture occurs (McNeill, 1992). For other studies, whattemporally coordinates with speech is the apex—that is, thepoint in time in which the movement reaches its kinetic‘‘goal’’ (Loehr, 2004). These different methodologies areexamined below.

Similarly, there is no consensus in the literature onwhat part of the speech aligns with the most meaningful partof the gesture (i.e., the stroke phase or the apex). Previousstudies can be separated into four groups according to theconclusion that they reached. The first group concluded thatthe prominence in gesture coordinates with the focused word(Butterworth & Beattie, 1978; Roustan & Dohen, 2010);the second, that the prominence in gesture coordinates withthe lexically stressed syllable (Loehr, 2004, 2007; Rochet-Capellan, Laboissiere, Galvan, & Schwartz, 2008); the third,that the prominence in gesture coordinates with syllableswith intonation peaks (De Ruiter, 1998, second experiment;Nobe, 1996); and the fourth, that there is, in fact, no

aUniversitat Pompeu Fabra, Barcelona, SpainbInstitucio Catalan de Recerca i Estudis Avancats, Universitat Pompeu

Fabra, Barcelona, Spain

Correspondence to Nuria Esteve-Gibert: [email protected]

Editor: Jody Kreiman

Associate Editor: Shari Baum

Received February 7, 2012

Revision received August 10, 2012

Accepted September 20, 2012

DOI: 10.1044/1092-4388(2012/12-0049)

850 Journal of Speech, Language, and Hearing Research N Vol. 56 N 850–864 N June 2013 N � American Speech-Language-Hearing Association

Page 2: Prosodic Structure Shapes the Temporal Realization of ... · of Intonation and Manual Gesture Movements Nu´ria Esteve-Giberta and Pilar Prietob Purpose: Previous work on the temporal

alignment between the prominence in gesture and speechprominence (De Ruiter, 1998, first experiment; McClave,1994; Rusiewicz, 2010). All of these studies used differentmethodologies, so it is worth taking a closer look at them inorder to better understand their results.

Butterworth and Beattie (1978) studied iconic gesturesand found that gestures tend to start in a pause just before acorresponding noun, verb, adverb, or adjective. Similarly,Roustan and Dohen (2010) also found that the apexcoordinates with prosodic focus in a broad sense. They tested10 adult French speakers in an experimental task in whichthey had to point at an object appearing on a monitor screenor perform a beat gesture while pronouncing a word in acontrastive focus condition. The results showed that the apexof pointing and beat gestures overlapped with the focusedword.

Another group of results showed that the mostmeaningful part of the gesture coordinates with prominent(or pitch-accented/stressed) syllables of speech. First in hisdissertation and later in an article, Loehr (2004, 2007)investigated the alignment between gesture strokes and pitch-accented syllables in 15 English speakers engaged in naturalconversations. The author selected four clips from theseconversations and studied all the gestures they produced(iconics, deictics, metaphorics, emblems, beats, and headmovements). Results showed that gesture apexes weresignificantly more likely to align with pitch-accented thanwith non–pitch-accented syllables. Two other studiesinvestigated the gesture–syllable alignment in experimentalsettings. Rochet-Capellan et al. (2008) investigated thetemporal coordination between deictic pointing gestures andjaw movements in a task in which 20 Brazilian Portuguesespeakers had to point at a target on a screen while naming anonword. The alignment of two measures was analyzed: thepointing plateau (the amount of time during which the fingerremained pointed at the target) and the jaw-opening apex.Results showed that when the stress was in the first syllableposition (e.g., ˈpapa), the beginning of the pointing plateaualigned with the jaw-opening apex of stressed syllables.However, when the stress was in the second syllable position(e.g., paˈpa), the end rather than the beginning of the pointingplateau synchronized with the jaw-opening apex.

The coordination between gesture and speech has alsobeen studied with a focus on the intonational–metricalstructure of speech. In general, these studies have shown thatprominent syllables with intonation peaks coordinate withgestural strokes or apexes. Nobe (1996, Experiment 4)studied iconic, deictic, and metaphoric gestures in cartoonnarrations of six English speakers. The authors selected asample of data (no explanation was given regarding theselection criteria), and after analyzing it, they found thatmost of the gestures occurred in speech fragments, includingboth pitch peak and intensity peak. A more precise study wascarried out by De Ruiter (1998, Experiment 2), who testedeight Dutch speakers in a controlled experimental setting inorder to investigate the alignment of deictic gestures andprosodic structures in a context of contrastive focus produc-tion. Results revealed that prosodic structure influenced

the gesture movement because the gesture launch (i.e., thedistance in time between the beginning of the preparationphase and the apex) was longer when the word had the stressin phrase-final position and shorter when the word had thestress in non–phrase-final position. He thus showed that theapex aligns with the accented syllable irrespective of themetrical structure of the target word. The author concludedthat the contrastive focus condition was crucial to obtainthese results because it enhanced the phonetic realization ofstress and introduced a more marked intonational contour inthe production of speech.

Such findings were not obtained, however, in the firstexperiment in De Ruiter (1998). The author analyzed thecoordination between deictic pointing gestures and lexicallystressed syllables (and not pitch-accented syllables, as in hissecond experiment) in disyllabic and trisyllabic words withdifferent metrical patterns. Participants were eight Dutchspeakers who were presented with a controlled setting inwhich they had to point at a target picture on a screen whilenaming it. The results showed that the temporal parametersof the gesture apex were not affected by stress position. Incontrast with his previous hypotheses, he actually found thatgesture apexes did not always occur in stressed syllables.Thus, when the stress was in phrase-final position, apexesoccurred in non–phrase-final position, and vice-versa.Therefore, not all studies investigating the coordinationbetween the prominent part of gestures and prominentsyllables have found that they are synchronized. Actually, thefirst author to study concrete examples of gesture–speechcombinations (McClave, 1994) did not find temporalcoordination between the meaningful part of the gesture andstressed syllables, either. This author studied beat gestures inspontaneous conversations of four speakers, and herquantitative results revealed that apexes of beat gesturescoincided with both stressed and unstressed syllables. Morerecently, Rusiewicz (2010) analyzed the influence of con-trastive pitch accents and syllable position on the apexes ofdeictic pointing gestures. Fifteen English speakers partici-pated in a controlled task in which they had to point at atarget object on a screen while naming it. The results of thisstudy showed that metrical structure did not influence thetiming of the gesture because gesture apexes were time-aligned with the initial onset of the target word rather thanoccurring with prosodically prominent syllables.

Thus, results of most of the previous studies haveshown that in multimodal communication, the mostprominent part of gestures coordinate with prominentsyllables, whether lexical stressed syllables or pitch-accentedsyllables. However, the coordination of gesture and speechhas been studied rather imprecisely by comparing twodistinct measures: Some have investigated alignment of twointervals of time—namely, the stroke of the gesture and theprominent syllable or prominent word—and others haveinvestigated the alignment between one specific moment andan interval of time—namely, the apex of the gesture and themost prominent syllable or word. However, as far as weknow, no study has tried to investigate the alignmentbetween two specific points in time: the apex of the gesture

Esteve-Gibert & Prieto: Temporal Coordination of Intonation and Manual ‘‘g’’ 851

Page 3: Prosodic Structure Shapes the Temporal Realization of ... · of Intonation and Manual Gesture Movements Nu´ria Esteve-Giberta and Pilar Prietob Purpose: Previous work on the temporal

and a specific point within the most prominent syllable. OnlyRusiewicz (2010) took the vowel midpoint as a measure tocompare its alignment with the apex of the gesture. Thevowel midpoint is nevertheless an arbitrary anchoring pointin the syllable, with no relation to its intonational–metricalstructure. Instead of the vowel midpoint measure, wepropose another anchoring point that may attract theprominence in gesture—namely, the peak of the fundamentalfrequency line: the intonation peak. To our knowledge, byinvestigating the position of the pitch peak in relation to thepeak of the gesture (i.e., the apex), our study will be thefirst to take two specific points in time as measures forstudying the synchrony between gesture and speech.

The goal of this study was to investigate how gestureand speech structures synchronize in time. To investigate thisissue, the timing of co-speech deictic gestures was analyzedwith respect to the prosodic structure of target words andsentences. Crucially, the target words had distinct metricalstructures and were produced in a contrastive focuscondition. In Catalan, contrastive focus is produced withfronting of the focused word, which is produced with aprominent L+H* pitch accent (a rising pitch movementduring the accented syllable) and crucially followed by a low-boundary tone, which allowed us to trigger distinct stress andintonation peak positions (see the Materials section forfurther details). We tested two hypotheses: (a) that it is thepeak of the focused pitch accent in the intonational structurethat serves as an anchoring site for the most prominent partof gestures and (b) that both intonation and gesturalmovements are bound by prosodic structure, and thus thepresence of the upcoming low-boundary tone shouldproduce similar effects in both intonation and gesturalmovements. From the analyses, two potential conclusionscan be drawn: (a) that the gesture apexes coordinate with theaccented syllable (if they occur during the accented syllablebut their position with respect to the accented syllable doesnot vary when the accented syllable changes its position inthe intonational phrase) or (b) that the gesture apexescoordinate with F0 peaks and show the same alignmentpatterns (if their position varies depending on the position ofthe accented syllable with respect to the phrase boundary).

The present study focused on the temporal coordina-tion of co-speech pointing deictic gestures. Pointing deicticgestures were chosen because contradictory results have beenobtained when studying the synchronization between gestureand speech in this gesture type (De Ruiter, 1998; Loehr,2004, 2007; Rusiewicz, 2010). Our aim was to contributemore evidence to the previous work on the temporalcoordination of gesture and speech prominence.

Method

Participants

The participants were 15 adult native speakers ofCatalan (two men, 13 women) ages 20–30 years. They werepaid J5 for participating in the study and were all right-handed. They were not aware of the purpose of theexperiment.

Procedure

Participants were audiovisually recorded using aPanasonic HD AVCCAM, which can record at 25 framesper second. The camera was placed at a distance ofapproximately 1 m from the participant, as close as possibleto capture the entire pointing-arm movement, and focusedon the participant’s left profile. Also, a RØDE NTG-2microphone was placed 20 cm from the participant’smouth.

The procedure we followed was based on Rochet-Capellan et al. (2008).1 Each participant was seated at a tablethat was approximately 50 cm from the screen (see Figure 1for an illustration of the experimental setting). Participantshad to perform a pointing–naming task in which they had topoint at a smiley face projected on the screen. The projectedface was first red, and after a short period of time (varyingfrom 1 s to 3.5 s), the red face turned to green. At thatspecific point in time, participants had to pronounce asentence projected directly under the colored face. This colorchanging worked as a ‘‘go’’ signal and was performed toavoid anticipatory responses and to make participants payattention to the experiment. A black square pasted on themiddle of the table was used to indicate the finger-restingposition. We chose an experimental setting rather than anaturalistic conversation to test our hypothesis because thepointing–naming task permitted us to control with highaccuracy two sets of variables that were crucial for ourpurpose—that is, arm trajectory and prosodic structure.Also, other studies in the field (De Ruiter, 1998; Rusiewicz,2010) used a methodology consisting of an experimentalpointing–naming task; thus, we would be able to compareour results with theirs.

1Our procedure was based on Rochet-Capellan et al. (2008) because we

elicited the pointing–naming gestures following their experimental design

and procedure. However, our data recording and postprocessing

methodology differed from theirs, as they used Optotrak for their

analysis of arm and jaw movements, whereas we used acoustic and visual

measures. See subsection titled ‘‘Coding.’’

Figure 1. Experimental setting.

852 Journal of Speech, Language, and Hearing Research N Vol. 56 N 850–864 N June 2013

Page 4: Prosodic Structure Shapes the Temporal Realization of ... · of Intonation and Manual Gesture Movements Nu´ria Esteve-Giberta and Pilar Prietob Purpose: Previous work on the temporal

Prior to the experiments, participants were told toimagine that the instructor had just pointed at an object onthe screen and pronounced the name of that objectincorrectly. Then, participants were told that they now hadto point at the smiley face to indicate its exact position inrelation to the instructor while pronouncing the object’sname properly. This pointing–naming task allowed us toobtain a co-speech deictic gesture with a target wordfocalized with respect to both the speech and the gesture: Thetarget word was pronounced in a contrastive focus condition,and the pointing gesture focally served to establish jointattention and to inform about the location (Enfield, Kita, &De Ruiter, 2007). All the information relative to how theinstructor mispronounced the object and its correct pro-nunciation appeared below the smiley face in a sentence suchas ‘‘PA, es diu, i no paPA’’ (in English, It’s pronounced PApa,not paPA). Thus, participants had to point at the smiley facewhile pronouncing the focus word PApa. The accentedsyllables of the target words were underlined to facilitate thereading task. These instructions also appeared on alternateslides during the experimental trials. In addition, beforestarting with the experiment, participants were briefly trainedto become familiarized with the task: First, they had topronounce aloud the words that would appear during theexperiment, and second, they had to practice the pointing–naming task they would perform.

The experiment was divided into three blocks. The firstblock was a familiarization task in which participants wereasked to read aloud the sentences containing the target wordsthey would encounter during the experimental trial. Thesecond block was also a familiarization task in whichparticipants had to practice the pointing–naming task withone repetition for each particular stimulus (12 items total).Finally, the third block consisted of the experimental trial,where four repetitions per stimuli were randomly displayedon the screen (12 items × 4 repetitions = 48 tokens).

During the second block, in which participants had topractice the pointing–naming task, it was possible thatparticipants would point at the screen while naming otherparts of the sentence but not the target word. In such cases,they were reminded of the purpose of their pointing with thefollowing instruction: ‘‘Remember that you are correctingmy pronunciation of the target word while pointing to itsexact position. The important information is the propername ‘MAma’ (for instance), so the rest of the sentence is asif you are telling this information to yourself.’’ Once thisreinforcement of the instructions was given, the second blockwith the familiarization of the pointing–naming task was runagain. Nevertheless, only two participants of the 15 whoparticipated in the study had to repeat this second block withthe familiarization materials.

Materials

The target words in focus appearing below the smileyface had varying numbers of syllables and varying stresspositions, namely, ˈCV, CVˈCV, and ˈCVCV. The targetconsonants also varied to keep participants’ attentionfocused on the task. Thus, the following 12 target words were

used: three monosyllables ([ˈpa], [ˈta], [ˈma], [ˈna]); threetrochees ([ˈpapə], [ˈtatə], [ˈmamə], [ˈnanə]); and three iambs([pəˈpa], [təˈta], [məˈma], [nəˈna]). These test materialsincluded voiceless consonants surrounding some of the targetaccented vowels (as in the words [ˈpapə], [ˈtatə], etc.). Thiswas not a problem for the location of the F0 peaks because,as mentioned by previous studies on the location of F0peaks in contrastive focus pitch accents (L+H*) in Catalan(Vanrell, 2011), F0 peaks in contrastive focus pitch accentsare found to be reliably located at the end of the accentedvowel.

The target words with different metrical patterns werepronounced in a contrastive focus condition and werefollowed by a clear phrase boundary (induced by thesemantic meaning of the utterance and by an orthographiccomma) to trigger different temporal positions of theintonation peaks within the accented syllables.

It is well established that F0 peak location varies withthe presence of upcoming prosodic events, such as otherpitch accents (clash situations) and boundary tones (for asummary, see Silverman & Pierrehumbert, 1990). There is alarge body of cross-linguistic research on tonal compressioneffects caused by tonal crowding; this research has demon-strated that pitch pressure environments (namely, proximityto a boundary tone or to an upcoming pitch accent) candrastically affect surface H alignment patterns (Prieto, 1995,for Catalan; Prieto, van Santen, & Hirschberg, 1995, forSpanish; Silverman & Pierrehumbert, 1990). Specifically, thelocation of F0 peaks in rising pitch accents (namely, L+H*)can be directly influenced by the presence of an upcominglow boundary tone. When H* accents occur in phrase-finalnuclear position before low-boundary tones, they are movedto the left by the closeness of the boundary tone (see workon rising accents in English by Silverman & Pierrehumbert,1990; in Spanish by Prieto et al., 1995; and in Italian,D’Imperio, 2001). That is, the articulatory gesture for thepitch accent is repelled earlier in time in order for the pitchaccent and the following boundary tone to be producedin the same tone-bearing unit (the phrase-final syllable).Figure 2 displays a schematic representation of the distinctlocations of the F0 peak in the two metrical patterns and in anarrow–contrastive focus condition. Figure 3 shows threeexamples as a comparison of the F0 movement produced byparticipants in the contrastive focus conditions when the

Figure 2. Schematic representation of the realization of the nuclearpitch accent in narrow–contrastive focus utterances, in words withpenultimate stress and word-final stress (Prieto & Ortega-Llebaria,2009).

Esteve-Gibert & Prieto: Temporal Coordination of Intonation and Manual ‘‘g’’ 853

Page 5: Prosodic Structure Shapes the Temporal Realization of ... · of Intonation and Manual Gesture Movements Nu´ria Esteve-Giberta and Pilar Prietob Purpose: Previous work on the temporal

stress was in the word-final syllable (central and bottompanel) and when the stress was in the penultimate syllable(top panel).

The contrastive focus utterances produced during thepointing–naming task followed the expected prosodic pattern(see Vanrell, 2011; see Figure 4). Target sentences consistedof two prosodic phrases. The first prosodic phrase containedthe fronted focus of the communicative act—that is, thehighlighted element with sentence-level prominence, which

was pronounced with a prominent L+H* pitch accent on theaccented syllable, followed by a low-boundary tone. And thesecond prosodic phrase was the postfocal material withprosodic deaccenting (i.e., produced with global pitch rangecompression). Three examples of these sentences can beobserved in Figure 3. In these examples, we can appreciatethe location of the nuclear L+H* pitch accent over theaccented syllable of the focal elements (NAna in the toppanel and naNA and NA in the central and bottom panels,respectively). The difference between the three realizationsof the focal prominence is the fact that the low-boundarytone has to be realized within the accented syllable in theiambic and monosyllabic conditions (e.g., naNA and NA incentral and bottom panels), whereas the low-boundary tonecan be realized in the post-tonic syllable in the trochaiccondition (e.g., NAna; see top panel).

Coding

In total, 720 instances of deictic gesture–speech combin-ations were obtained from the experimental task. From these,17 gesture–speech combinations were excluded from theanalysis because the target words were mispronounced (16items) or because the pointing movement was interrupted (oneitem). Thus, 703 instances of deictic gesture–speech combina-tions were acoustically and gesturally analyzed.

The gesture and acoustic measures were annotatedusing ELAN software package (Lausberg & Sloetjes, 2009)in separate tiers, as shown in Figure 5. As for the gestureanalysis, three distinct measures were coded. First, wemeasured the limits of the pointing gesture, from the start ofthe preparation phase (i.e., the initial arm movement) to theend of the retraction phase (i.e., the final arm movement;Tier 1 in Figure 5). Second, we measured gesture phases(Tier 2 in Figure 5). Following the method of McNeill(1992), the gesture phases taken into account were thepreparation (the arm moves from the rest position until thepeak of effort in the gesture), the stroke (the peak of effort inthe gesture), and the retraction (the arm moves from itsfarthest extension to the rest position again). Third, wemeasured the apex (Tier 3 in Figure 5), which is a specificpoint in time within the most meaningful part of the gesture.The apex measurement has been used in other studiesinvestigating the gesture–speech synchronization because itmakes it possible to identify the point of maximumprominence in the gesture.

In order to locate the stroke and apex of the pointinggesture, we examined the video file. ELAN allows precise,frame-by-frame navigation through the video recording.Even though the software program also allows a more preciseannotation (2 ms × 2 ms), this option could not be appliedbecause the video was recorded with a frame rate of 25 framesper second. From the one hand, the stroke of the gesture wasannotated in those video frames in which the arm was wellextended with no blurring of the image, with fingertip totallyextended or not. Despite the absence of blurring of the image,the arm was not totally static during the interval of thegesture stroke; the fingertip moved a few pixels back andforth. From the other hand, the gesture apex was annotated

Figure 3. Fundamental frequency of the target sentencepronounced by one participant in the three metrical structures:trochaic condition (top panel), iambic condition (middle panel), andmonosyllabic condition (bottom panel).

Figure 4. Example of a slide projected in front of the participantswith the sentence mama, es diu, i no mama (in English, It’spronounced mama, not mama).

854 Journal of Speech, Language, and Hearing Research N Vol. 56 N 850–864 N June 2013

Page 6: Prosodic Structure Shapes the Temporal Realization of ... · of Intonation and Manual Gesture Movements Nu´ria Esteve-Giberta and Pilar Prietob Purpose: Previous work on the temporal

in the specific video frame in which we located the farthestspatial excursion of the fingertip during the interval of time inwhich the arm was maximally extended. When participantsperformed the pointing gesture more slowly, this point in timeof gesture peak could last more than one frame (typically, twoframes). In those cases, the gesture peak was considered to bethe last of these video frames.

For the acoustic analysis, three measures wereannotated using the Praat software package (Boersma &Weenink, 2012) and then imported into ELAN: thebeginning and end of the target focused word (Tier 4 inFigure 5), the beginning and end of the accented syllable(Tier 5 in Figure 5), and the pitch peak of the F0 line (Tier 6in Figure 5). Figure 6 is an example of the acoustic labelingin Praat that was later imported into ELAN. As can beobserved, Praat allows annotating the acoustic signal with atime set of 0.001 seconds. The precise number of millisecondsin which these acoustic measures occurred in Praat was laterannotated in ELAN by means of the option ‘‘accessingpoints in time’’ that this annotation software offers. The

pitch peak was easily located in the F0 line because targetwords were pronounced in a contrastive focus condition, sothe rising contour was very salient. Following the method ofKnight and Nolan (2006), if participants did not pronouncethe target word emphatically and, as a consequence, the pitchpeak was not so salient in the F0 line, then the pitch peak waslocated at the end of the plateau. Knight and Nolan (2006)showed that whenever an H tone of a nuclear accent isrealized as a flat stretch of contour rather than as a singleturning point, the end of the plateau is the point that isanchored within the syllable and seems to be equivalent tothe alignment of H turning points and thus is a marker oflinguistic structure.

Results

The results of the acoustic and gestural analysis showhow gesture and prosodic structures are coordinated. Wesought to answer the following research questions: (1) Doesthe position of the intonation peaks within the accented

Figure 5. Image stills of a pointing gesture annotated in ELAN. Below, five specific enlarged images during the gesture:(1) during the preparation phase, (2) during the stroke phase and before the apex is reached, (3) at the apex, (4) during thestroke and after the apex is reached, and (5) during the retraction phase.

Esteve-Gibert & Prieto: Temporal Coordination of Intonation and Manual ‘‘g’’ 855

Page 7: Prosodic Structure Shapes the Temporal Realization of ... · of Intonation and Manual Gesture Movements Nu´ria Esteve-Giberta and Pilar Prietob Purpose: Previous work on the temporal

syllable change when the position of the accented syllablewith respect to the phrase boundary is different? If so, ourresults would corroborate previous findings in Silverman andPierrehumbert (1990), Prieto et al. (1995), and D’Imperio(2001) in the sense that the position of the F0 peak variesdepending on the position of the stress with respect to thephrase boundary. (2) In what way does the position of theapexes change with respect to the end of the accented syllablewhen the position of the accented syllable within theintonational phrase is different? (3) Does the timing ofintonation peaks correlate with the timing of gesture apexes?If results of Questions 1 and 2 show that intonation peaksand apexes vary depending on the distance of the accentedsyllable from the boundary tone, then we would show thatintonation peaks and gesture apexes are synchronized andfollow the same timing patterns.

Our results are presented in three subsections—one foreach research question. The first subsection presents theresults for the position of intonation peaks with respect tothe accented syllable, taking the end of the accented syllableas the reference point to measure its position. The secondsubsection shows the results for the position of gestureapexes with respect to the end of the accented syllable, as wellas results for the timing of the gesture onset with respect tothe speech signal. And the third subsection presents theresults obtained from the analysis of the third researchquestion—namely, whether intonation peaks and gestureapexes align with each other.

Position of the Intonation Peaks With Respect to theEnd of the Accented Syllable

The data were analyzed using a repeated measuresanalysis of variance (RM ANOVA). The independent factorwas pitch accent position within the word (three levels:monosyllables, trochees, iambs), and the dependent factorwas the distance in milliseconds between the intonation peak

and the end of the accented syllable.2 Mauchly’s testindicated that the assumption of sphericity had beenviolated, x2(2) = 13.1, p < .01, so the degrees of freedom werecorrected using Greenhouse–Geisser estimates of sphericity(e = 0.95). The statistical analysis revealed that the positionof the intonation peaks within the accented syllable wassignificantly different depending on the pitch accent position,F(2, 434) = 580, p < .001, ηp

2 = .717. As expected, Bonferronipair-wise comparisons showed that the distance in timebetween intonation peaks and the ends of accented syllablesdid not vary significantly in monosyllables and iambs(p = 1.000), whereas this time was significantly different whencomparing monosyllables and trochees (p < .001) and whencomparing trochees and iambs (p < .001). Figure 7 shows thetime (in milliseconds) between intonation peaks and the endof the accented syllable as a function of the distinct pitchaccent positions analyzed.

These results confirm previous findings (D’Imperio,2001; Prieto et al., 1995; Silverman & Pierrehumbert, 1990)showing that the position of H intonation peaks variesdepending on the distance in time (in milliseconds) to theupcoming boundary tone. Thus, intonation peaks areretracted in monosyllables and in iambs in a contrastivefocus condition in such a way that the rising accent isrealized as a complex rise–fall movement if in phrase-final

Figure 6. Example of acoustic labeling in Praat of the target word naˈna. Freq. = frequency; acc. = accented; syll. = syllable.

2In order to test whether the consonant status had an effect on the timing

of intonation peak, a RM ANOVA was carried out, with consonant

voicing as the independent factor (two levels: voiced nasals, unvoiced

stops) and distance (in ms) between intonation peak and end of accented

syllable as the dependent factor. The analysis indicated that the position

of the F0 peak within the accented syllable was not significantly affected

by the voicing of the consonant, F(1, 344) = 0.012, p = .913. This result

confirms previous claims on the location of the F0 H peaks for rising

pitch accents in contrastive focus in Catalan: F0 peak are consistently

located at the end of the accented syllable, regardless of variation in

consonant identity (see, e.g., Vanrell, 2011).

856 Journal of Speech, Language, and Hearing Research N Vol. 56 N 850–864 N June 2013

Page 8: Prosodic Structure Shapes the Temporal Realization of ... · of Intonation and Manual Gesture Movements Nu´ria Esteve-Giberta and Pilar Prietob Purpose: Previous work on the temporal

position. Thus, our results corroborate previous findingsin the literature by showing that the intonation peaks areretracted in the case of monosyllables (M = 141.30 ms,SD = 53.04) and iambs (M = 137.52 ms, SD = 50.36)and that they are not retracted in the case of trochees(M = 11.90 ms, SD = 39.1), thus occurring almost at theend of the accented syllable.

Position of the Gesture Apexes With Respect to theEnd of the Accented Syllable

Again, the data were analyzed using a RM ANOVA.The independent factor was pitch accent position within theword (three levels: monosyllables, trochees, iambs), and thedependent factor was the distance in time (in milliseconds)between the apex and the end of the accented syllable.Mauchly’s test indicated that the assumption of sphericityhad been violated, x2(2) = 9.5, p < .01, so the degrees offreedom were corrected using Greenhouse–Geisser estimatesof sphericity (e = 0.96). The statistical analysis revealed thatthe position of the gesture apexes within the accented syllablewas significantly different depending on the pitch accentposition it occupied, F(2, 440) = 197, p < .001, ηp

2 = .462.Bonferroni pair-wise comparisons showed that the distancein time between gesture apexes and the end of accentedsyllables varied significantly between monosyllables andiambs (p < .001), between iambs and trochees (p < .001), andbetween monosyllables and trochees (p < .001). Figure 8shows the distance in time between the gesture apex and theend of the accented syllable as a function of the distinct pitchaccent positions analyzed.

When the box plots of distance in time betweenintonation peaks and the end of the accented syllable (seeFigure 7) are compared with the box plots of the distance intime between apexes and the end of the accented syllable (seeFigure 8), it can be observed that both intonation peaks andapexes occur at the end of the accented syllable in trochaicwords—that is, when the stress is in non–phrase-finalposition. And when the stress is in phrase-final position,intonation peaks and apexes occur well before the end of theaccented syllable. However, the statistical analyses show thatthere is, in fact, a difference across the results of these twoanalyses. Specifically, monosyllables and iambs behavesimilarly in terms of the position of the intonation peakwithin the accented syllable, whereas they behave signifi-cantly differently in terms of the position of the gesture apexwithin the accented syllable, being less retracted in mono-syllables than in iambs.

In order to find out the reason why gesture apexesare less retracted in monosyllables than in iambs, whereasintonation peaks are equally retracted in both metricalstructures, a closer analysis of the coordination between thegesture phases and the metrical structures was carried out.The specific metrical structure of monosyllabic words (i.e.,words with the stress in phrase-final position and withoutpre-tonic segmental material) in comparison with themetrical structure of iambic words (i.e., words with thestress in phrase-final position and with pre-tonic segmentalmaterial) might influence the gesture phases and, thus, causethis lagging effect of gesture apexes.

We hypothesize that the gesture onset starts beingrealized before the accented syllable when the stress is in

Figure 7. Box plots of the distance in time (in milliseconds) between the intonation peak and the end of the accented syllable as a function ofthe three distinct accented syllable positions within the word. Error bars indicate 95% confidence intervals (CIs).

Esteve-Gibert & Prieto: Temporal Coordination of Intonation and Manual ‘‘g’’ 857

Page 9: Prosodic Structure Shapes the Temporal Realization of ... · of Intonation and Manual Gesture Movements Nu´ria Esteve-Giberta and Pilar Prietob Purpose: Previous work on the temporal

phrase-medial position, and so the gesture apex is notdelayed by a preceding phrase boundary. Then, this gestureonset timing makes the gesture stroke occur at differentpositions with respect to the onset of the accented syllable.Our hypothesis is that the specific metrical structure ofmonosyllabic words in comparison with the metricalstructure of iambic words might have had an influence on therealization of the timing of the preparation phase, thuscausing a lagging effect of gesture apexes. In order to test thishypothesis, two distinct analyses were carried out: first, totest whether the pre-tonic syllable in iambs contains part ofthe gesture preparation phase; second, to test whether thisdifference in timing was due to a difference in the gesturevelocity or to the different initiation time of the gesture withrespect to the accented syllable.

For the first analysis, a RM ANOVA was carried out,with the pitch accent positions as the independent variable(three levels: monosyllables, trochees, iambs) and the distancebetween the stroke onset and onset of accented syllable asthe dependent variable. Mauchly’s test indicated that theassumption of sphericity had not been violated, x2(2) = 2.6,p = .278, so sphericity was assumed (e = 0.99). The statisticalanalysis revealed that the distance between stroke onset andonset of the accented syllable was significantly differentdepending on the position of the pitch accent position in thetarget word, F(2, 458) = 108, p < .001, ηp

2 = .321. Bonferronipair-wise comparisons showed that the distance between thebeginning of the stroke and the beginning of the accentedsyllable varied significantly between monosyllables and iambs(p < .001) and between trochees and iambs (p < .001), but not

between monosyllables and trochees (p = 1.000). Figure 9illustrates the distance in time (in milliseconds) between thebeginning of the stroke and the beginning of the accentedsyllable as a function of the distinct pitch accent positionsanalyzed, showing that the stroke of the gesture starts whenthe accented syllable has already been initiated in mono-syllables and trochees. In contrast, the stroke and accentedsyllable start almost simultaneously in the iambic condition.These results show that when the stress is in non–phrase-initial position, the pre-tonic syllable already contains part ofthe preparation phase of the gesture.

We performed an additional RM ANOVA to testwhether the onset of the gesture was different acrossconditions. The distance between gesture onset and onsetof accented syllable was the dependent variable, and stressposition within the phrase was the independent variable.Mauchly’s test indicated that the assumption of sphericityhad been violated, x2(2) = 187.4, p = .001, and degrees offreedom were therefore corrected using Greenhouse–Geisserestimates of sphericity (e = 0.637). The statistical analysisrevealed that the distance between gesture onset and onset ofaccented syllable was significantly affected by the positionof the accented syllable with respect to the phrase boundary,F(2, 284) = 29, p < 001. Bonferroni post hoc tests showedthat monosyllables and trochees did not differ significantly(p = 1.000), whereas iambs differed with respect to mono-syllables (p < .001) and trochees (p < .001). The mean valuesof this distance reveal that, in monosyllables, the gesturestarted 460.50 ms before the accented syllable; in iambs,the initiation of the gesture occurred 576.38 ms before the

Figure 8. Box plots of the distance in time (in milliseconds) between the apex and the end of the accented syllable as a function of the threedifferent positions of the accented syllable within the phrase. Error bars indicate 95% CIs.

858 Journal of Speech, Language, and Hearing Research N Vol. 56 N 850–864 N June 2013

Page 10: Prosodic Structure Shapes the Temporal Realization of ... · of Intonation and Manual Gesture Movements Nu´ria Esteve-Giberta and Pilar Prietob Purpose: Previous work on the temporal

accented syllable started being pronounced; and in trochees,it occurred 475.82 ms before the accented syllable. Thus, thegesture is initiated at different points in time with respect tothe speech signal, being initiated earlier in iambs than in theother two conditions.

Second, in order to test whether this difference in thetiming of peak and speech prominence was due to a differencein the duration of the preparation phase (or launching of thepointing gesture, a measure that is indirectly related to thevelocity of the launching movement), an additional RMANOVA was performed. The distance between gesture onsetand gesture apex was the dependent variable, and pitchaccent position was the independent variable. Mauchly’s testindicated that the assumption of sphericity had been violated,x2(2) = 172.6, p < .001, and degrees of freedom were thereforecorrected using Greenhouse–Geisser estimates of sphericity(e = 0.649).The statistical analysis revealed that the distancebetween these two points was not significantly affected by thepitch accent position, F(2, 290) = 3, p = .091. This resultsuggests that because the time of the launching gesture doesnot differ across conditions, velocity profiles of the launchingtimes do not seem to be responsible for the different locationof the gesture apexes.

All together, the results of the previous analyses offeran explanation for why the gesture apexes occur later inmonosyllables than in iambs with respect to the accentedsyllable: The gesture has to start later in monosyllables thanin iambs (because they do not have pre-tonic material tocontain the preparation phase), and so the apex is also‘‘pressed’’ to the end of the accented syllable in monosyllabiccondition due to a lagging effect. Table 1 shows several still

images of the timing targets of the pointing in the twoconditions in which the stress is in phrase-final position. Atthe beginning of the accented syllable, the arm is realizing thepreparation phase in monosyllables (top left), whereas it isalready at the stroke phase in iambs (bottom left). Then,when the accented syllable ends, the arm is still extended inmonosyllables (top right), whereas, in iambs, it is already inthe retraction phase (bottom right). These findings suggestthat prosodic phrase boundaries strongly determine thetiming of the gesture phases and can cause retracting effects(when there is an upcoming phrase boundary) as well aslagging effects (when there is a preceding phrase boundary).

Does the Timing of Intonation Peaks CorrelateWith the Timing of Gesture Apexes?

The goal of our study was to investigate whichanchoring regions in speech and in gesture align with eachother, hypothesizing that it is the intonation peak and thegesture apex that are temporally coordinated. A Pearsoncorrelation analysis was carried out, with two independentvariables: timing of the gesture apex with respect to the onsetof the accented syllable and timing of the F0 peak withrespect to the onset of the accented syllable. Results showedthat the timing of gesture apexes and F0 peaks with respectto the onset of the accented syllable was significantlycorrelated, r(703) = .528, p < .01. Figure 10 illustrates thatthe timing of gesture apexes and the timing of the intonationpeaks are correlated.

Interesting results are obtained when comparing thecorrelation between gesture apex and F0 peaks (revealed as

Figure 9. Box plots of the distance in time (in milliseconds) between the beginning of the stroke and the beginning of the accented syllable as afunction of the three different accented syllable positions with respect to the boundary tone. Error bars indicate 95% CIs.

Esteve-Gibert & Prieto: Temporal Coordination of Intonation and Manual ‘‘g’’ 859

Page 11: Prosodic Structure Shapes the Temporal Realization of ... · of Intonation and Manual Gesture Movements Nu´ria Esteve-Giberta and Pilar Prietob Purpose: Previous work on the temporal

significant in our study) with other values that, as previousstudies have suggested, play a crucial role in the temporalcoordination of gesture and speech prominence: the end ofthe accented syllable (as an alternative to F0 peak) and thestroke onset and stroke offset (as alternatives to the gestureapex).

Thus, three additional Pearson correlation analyseswere carried out. In the first correlation analysis, the twoindependent variables were the gesture apex and the end ofthe accented syllable (as opposed to the F0 peak) with respect

to the onset of the accented syllable. The analysis revealedthat even though the timing of the gesture apex and the endof the accented syllable were significantly correlated, a lowcorrelation coefficient value indicates that they are weaklycorrelated, r(703) = .276, p = .01. In the second additionalcorrelation, the two independent variables were the end ofthe stroke (as opposed to the gesture apex) and the F0 peakwith respect to the onset of the accented syllable. Resultsshow that these two variables are also significantly corre-lated, yet to a lesser extent than in the apex–intonation peak

Table 1. Still images of arm positions in three specific moments of the speech signal (the beginning of the accented syllable, the F0 peak, andthe end of the accented syllable) in the monosyllabic condition and in the iambic condition.

Figure 10. Scatter plot showing the relation between the timing of the gesture apexes and the timing of the F0 peaks with respect to the onsetof the accented syllable.

860 Journal of Speech, Language, and Hearing Research N Vol. 56 N 850–864 N June 2013

Page 12: Prosodic Structure Shapes the Temporal Realization of ... · of Intonation and Manual Gesture Movements Nu´ria Esteve-Giberta and Pilar Prietob Purpose: Previous work on the temporal

analysis, r(703) = .451, p = .01. In the third additionalcorrelation, the two independent variables were the strokeonset (as opposed to the gesture apex) and the F0 peak withrespect to the onset of the accented syllable. Results showthat these two variables were significantly (but weakly)correlated, r(703) = .397, p = .01. All together, thesecorrelation analyses reveal that, given the r correlationcoefficient values, the two measurements that are morestrongly correlated are the ones we predicted: the timing ofthe gesture apex and the timing of the F0 peak.

Discussion

This study had twomain aims—namely (a) to determinethe anchoring region in speech that aligns with the prominentpart of the gesture in a contrastive focus condition and (b)to test whether intonation and gestural movements behave in asimilar way and whether they are both bound by prosodicstructure. Our hypotheses were (a) that the peak of the focusedpitch accent in the intonational structure serves as ananchoring site for gestural prominence and (b) that upcominglow-boundary tones affect the timing of both intonation andgesture movements. In order to tease out both researchquestions, we analyzed 703 instances of co-speech pointingdeictic gestures produced by monolingual adult Catalanspeakers. During the experiment, the speech consisted of thepronunciation of a target word followed by a boundary tone ina contrastive focus condition. When designing the experiment,it was crucial to make sure that participants produced targetfocused words with different metrical patterns in a contrastivefocus condition, which is typically realized prosodically with aprominent L+H* pitch accent and followed by a low-boundary tone. This automatically triggered distinct positionsof the intonation peaks, which were retracted in the case ofstress-final words.

After analyzing our data, we obtained two mainfindings: first, that the timing of gesture apexes behaves verymuch like intonation peaks with respect to their positionwithin the accented syllable—that is, both were retractedwhen a prosodic phrase was present at the end of theaccented syllable; second, that the position of the peak of thepitch movement (intonation peak) and the position of thepeak of the gesture movement (apex) were the twomeasurements with stronger correlations. We now discussthese two main findings.

The first main finding was that the temporal behavior ofintonation peaks and apexes is similar. On the one hand, thestatistical analyses showed that the distance between intona-tion peaks and the end of the accented syllable variedsignificantly, depending on the position of the pitch accent inthe target word with respect to the upcoming phrase boundary:When the target word had the stress in non–phrase-finalposition, intonation peaks were produced in close alignmentwith the end of the accented syllable. However, when the targetword had the stress in phrase-final position (i.e., in mono-syllables and iambs), intonation peaks were produced wellbefore the end of the accented syllable. On the other hand, thestatistical analyses showed that the position of gesture apexes

within the accented syllable also varied significantly, depend-ing on the position of the pitch accent with respect to thephrase boundary: The apex was closely aligned with the end ofthe accented syllable when the accented syllable was in non–phrase-final position, and it was produced well before the endof the accented syllable when the accented syllable was inphrase-final position. In conclusion, time-wise, both intona-tion peaks and gesture apexes occur at the end of the accentedsyllable when the stress is not in phrase-final position, whilethey are temporally retracted when the stress is in phrase-finalposition.

The temporal retraction of intonation peaks withinthe accented syllable before upcoming prosodic boundariesis a well-documented phenomena in Catalan and otherlanguages (D’Imperio, 2001; Prieto et al., 1995; Silverman &Pierrehumbert, 1990). It is caused by the fact that when H*accents occur in phrase-final nuclear position before low-boundary tones, they are moved to the left by the adjacencyof the boundary tone, thus retracting the intonation peaktoward the beginning of the accented syllable. By contrast, innon–phrase final nuclear position, the accented syllable doesnot have to accommodate the entire complex pitch move-ment: The rising pitch movement is realized during theaccented syllable, and the falling pitch movement is realizedduring the unaccented syllable. Similarly, the same ‘‘tem-poral adaptation’’ phenomenon might occur with gestures.The temporal behavior of gesture apexes in our experimentshows that when the nuclear pitch accent is in phrase-finalposition, the gesture apex needs to retract for the arm to havetime to return at its rest position within the limits of thetarget segmental material. In a non–phrase-final condition,however, because the arm’s return to its rest position doesnot have to be completed by the end of the accented syllable(which is not in phrase-final position), the rest of the armtrajectory can be accommodated during the post-tonicunaccented syllable.

Interesting results were obtained when analyzing thegestural timing patterns when the focal pitch accent was inphrase-final position—namely, that the gesture apex isanchored at a different position within the accented syllablewhen comparing monosyllables and iambs. The statisticalanalyses showed that gesture apexes are retracted in bothmonosyllables and iambs but that the amount of retraction issignificantly different. In iambs, the gesture apex occurred amean of 141.11 ms before the end of the accented syllable,whereas, in monosyllables, the gesture apex occurred a meanof 89.12 ms before the end of the accented syllable (seeFigure 7, where this difference becomes clear). In order tounderstand why gesture apexes were more retracted in iambsthan in monosyllables, we analyzed in detail the timing ofthe arm-launching movement with respect to the timing ofspeech in both metrical structures. These analyses showedthat the distance to the beginning of the prosodic phrase hada direct influence on the timing of the gesture phases:When the stress is in phrase-initial position (in monosyllablesand iambs), the preparation phase of the pointing gesturehas to start during the accented syllable because there are nopreceding unstressed syllables to which one can anchor the

Esteve-Gibert & Prieto: Temporal Coordination of Intonation and Manual ‘‘g’’ 861

Page 13: Prosodic Structure Shapes the Temporal Realization of ... · of Intonation and Manual Gesture Movements Nu´ria Esteve-Giberta and Pilar Prietob Purpose: Previous work on the temporal

preparation phase. However, when the stress is in phrase-medial position (in iambs), the preparation phase can startbefore the accented syllable because there is a previousunaccented syllable that can accommodate part of thepreparation phase. Further analyses showed that theseresults are due more to a different initiation time of thegesture movement with respect to the accented syllable than tovelocity effects: The initiation of the gesture with respect to theonset of the accented syllable occurs earlier in iambs (wherethere is an preceding unstressed anchoring syllable) than inmonosyllables (where the accented syllable is in phrase-initialposition).

The monosyllabic condition turned out to be a veryinteresting case because it was the only condition in whichthe target syllable was adjacent to both a phrase-initial and aphrase-final position: An adjacent phrase-final boundarycaused a retraction effect in the location of the gesture apex,but an immediately preceding start of a phrase made thisretraction not as evident as in the iambic case, causing whatwe called a lagging effect. Thus, the timing of the launchinggesture had to adjust to both utterance-initial and utterance-final prosodic requirements because there was no othersegmental material between the phrase boundaries in whichto accommodate the gestural initiation and retractionphrases. In contrast, when there was a pre-tonic syllableavailable (as in iambs) or a post-tonic syllable available (as introchees), the lagging effect was not observable in the timingof the gesture peaks.

The second main finding was that the position of thepeak of the pitch movement (intonation peak) and theposition of the peak of the gesture movement (apex) werethe two measurements in gesture and speech with a highercorrelation, thus corroborating our initial hypothesis. Someprevious studies have suggested that the prominent part ofthe gesture coordinates with the stressed syllable (Loehr,2004, 2007; Rochet-Capellan et al., 2008), whereas othershave found that only stressed syllables containing anintonation peak (De Ruiter, 1998; Nobe, 1996) coordinatewith the prominence in gesture. In our study, we definedmore precisely the assumption that the prominent part ofthe gesture (the stroke phase) coordinates with the speechprominence (the accented syllable): The location of thegesture apex with respect of the accented syllable is highlycorrelated with the location of the intonation peak withinthe accented syllable in a contrastive focus condition andwhen followed by a phrase boundary. Comparing ourresults on the correlation between intonation and gesturepeaks with other measurements taken into account in theliterature (namely, the accented syllable in the speechmodality and the stroke onset or stroke offset in the gesturemodality), our results crucially show that the time locationof the apex is less correlated with the end of the accentedsyllable than with the F0 peak and that F0 peaks are lesscorrelated with stroke onset or stroke offset than withgesture apexes, as gesture apexes and F0 peaks are bothbound by prosodic structure.

All in all, our results show that prosodic structureinfluences the timing of gesture movements in two main

ways: First, the gesture apex always occurs during theaccented syllable, regardless of the position of the accentedsyllable with respect to the phrase boundary; second, theposition of the gesture apex is highly correlated with theposition of the F0 peak, regardless of their position withinthe accented syllable triggered by an upcoming phraseboundary tone. These results are in line with Rochet-Capellan et al.’s (2008) study in the sense that the position ofthe accented syllable within the phrase affected the timing ofgesture phases. Similarly, our results confirm De Ruiter’s(1998) second experiment, in which the stress locationdirectly affected the timing of the gesture: In these findings,the apex of the gesture occurred during the accented syllable,regardless of the metrical structure of the target word.

In addition, our findings confirm that the gesture apexis an important measure to take into account when studyingtemporal coordination between deictic gestures and speech(in line with De Ruiter, 1998; McClave, 1994; Rochet-Capellan et al., 2008; Roustan & Dohen, 2010; andRusiewicz, 2010). When trying to characterize gesture–speechprominence alignment, other gesture measurements takeninto account in the literature—such as the gesture onset orthe gesture stroke—do not seem to be so telling. In ourresults, it is precisely the gesture apex that is stronglycorrelated with the intonation peak, independently of theposition of the accented syllable within the prosodic phrase.

The main contribution of this article is to show thatboth intonation and gesture movements are bound byprosodic phrasing, causing retracting effects when there is anupcoming phrase boundary (as in monosyllables and iambs)and lagging effects when there is no pre-tonic or post-tonicsyllable after a preceding phrase boundary to contain part ofthe gesture prominence (in monosyllables). We also definedin a precise way the temporal coordination between gestureand speech in a narrow–contrastive focus condition, showingthat the anchoring point for the gesture apex is the intonationpeak. By analyzing deictic gestures accompanied by targetwords with different metrical structures followed by a phraseboundary and in a contrastive focus condition, we showedthat gesture apexes align with intonation peaks, and that bothintonation and gesture structures have a parallel temporalbehavior: When F0 peaks are produced at the end of theaccented syllable, gesture apexes are also produced at the endof the accented syllable (e.g., in the trochaic condition), andwhen F0 peaks are retracted within the accented syllable,gesture apexes are also retracted within the accented syllable(e.g., in the iambic condition). Gesture apexes and F0 peaksare closely aligned even in monosyllabic words, whose gestureapexes are influenced by a lagging effect (because of thepresence of a preceding prosodic phrase boundary) and by aretracting effect (because of an upcoming prosodic phraseboundary). Hence, the results of our study suggest that gesturestructure adapts to prosodic structure and that pointinggesture movements are bound by prosodic phrasing in thesame way that pitch movements are.

Our study confirms the results of most of the previousstudies investigating the temporal coordination betweengesture and speech—namely, that the prominence in gesture

862 Journal of Speech, Language, and Hearing Research N Vol. 56 N 850–864 N June 2013

Page 14: Prosodic Structure Shapes the Temporal Realization of ... · of Intonation and Manual Gesture Movements Nu´ria Esteve-Giberta and Pilar Prietob Purpose: Previous work on the temporal

and the prominence in speech are temporally coordinated.Previous findings in the literature have shown that the mostmeaningful part of the gesture (i.e., the stroke or the apex)was temporally synchronized with the focused word(Butterworth & Beattie, 1978; Roustan & Dohen, 2010), withthe lexically stressed syllable (Loehr, 2004, 2007; Rochet-Capellan et al., 2008), or with syllables with intonation peaks(Nobe, 1996; De Ruiter, 1998, second experiment).

After analyzing pointing gestures accompanied bytarget words with varying positions of the accented syllablewith respect to the phrase boundary, our results show notonly that the most prominent part of the gesture (the apex) iscoordinated with the accented syllable but also that theanchoring region in speech that synchronizes with thegestural prominence is the intonation peak. This study showsthat, in a contrastive focus condition, the specific point in thespeech signal that is correlated with the apex of deicticgestures is the intonation peak rather than the interval oftime where the accented syllable is produced. Also, ourfindings suggest that both intonation and gesture movementsare bound by prosodic phrasing because the timing of theirstarting movements and prominence peaks (F0 peak andapex) varies if there is a preceding or an upcoming prosodicphrase boundary.

Further research is needed to form a completedescription of how gesture and speech are temporallysynchronized. First, our hypothesis should also be tested in anatural environment rather than in an experimental setting.The challenge is to obtain instances of spontaneous co-speech deictic gestures in different types of focus conditions.Second, these precise analyses could also be extended toother types of gestures in order to investigate the specificanchoring regions in prosodic and gesture structures that aretemporally coordinated. Third, it will be interesting to testthe alignment of gesture apexes with F0 peaks (and F0 falls)further, with experiments in which the F0 does not occurwithin the accented syllable or with languages that containdistinct alignment patterns of rising and falling pitch accentswithin the accented syllable. Finally, future experimentalstudies will have to analyze (a) the role of temporalcoordination patterns between gesture and prosody forsuccessful language communication and (b) the role oftemporal coordination patterns for language comprehension.Our hypothesis is that the lack of precise coordinationbetween gesture and speech affect the perception, compre-hension, and integration of the gesture–speech combinationin the brain. If this is, indeed, the case, then we can beconfident that our results are important for a full under-standing of human communication.

Acknowledgments

This research was funded by two research grants awarded bythe Spanish Ministry of Science and Innovation (Grant FFI2009-07648/FILO, ‘‘The Role of Tonal Scaling and Tonal Alignment inDistinguishing Intonational Categories in Catalan and Spanish,’’and Grant FFI2012-31995, ‘‘Gestures, Prosody and LinguisticStructure’’); by a grant from the Generalitat de Catalunya(2009SGR-701), awarded to the Grup d’Estudis de Prosodia; by

Grant RECERCAIXA 2012 (for the project ‘‘Els Precursors delLlenguatge. Una Guia TIC per a Pares i Educadors’’ awardedby Obra Social ‘La Caixa’); and by the Consolider-Ingenio 2010(CSD2007-00012) grant. A preliminary version of this articlewas presented at the following conferences: Architectures andMechanisms for Language Processing (Paris, France, September1–3, 2011), Gesture and Speech in Interaction (Bielefeld, Germany,September 5–7, 2011), V International Conference for GestureStudies (Lund, Sweden, July 24–27, 2012), and LaboratoryPhonology (Stuttgart, Germany, July 27–29, 2012). We would liketo thank participants at those meetings—especially Mark Swerts,Heather Rusiewicz, Stefan Kopp, Zofia Malisz, Stefanie Shattuck-Hufnagel, Adam Kendon, John J. Ohala, and Robert Port—fortheir helpful comments. We thank Aurora Bel, Louise McNally, andJose Ignacio Hualde for being part of the PhD project defense and,in particular, for their useful advice regarding the methodology usedin this study. We also thank Joan Borras for his help with runningthe experiment and carrying out the statistical analysis, and PaoloRoseano for his comments on Praat annotations.

ReferencesBirdwhistell, R. L. (1952). Introduction to kinesics: An annotated

system for analysis of body motion and gesture. Washington, DC:Department of State, Foreign Service Institute.

Birdwhistell, R. L. (1970). Kinesics and context: Essays on bodymotion communication. Philadelphia: University of PennsylvaniaPress.

Boersma, P., & Weenink, D. (2012). Praat: Doing phonetics bycomputer (Version 5.3.04) [Computer program]. Retrieved fromwww.praat.org

Butterworth, B., & Beattie, G. (1978). Gesture and silence asindicators of planning in speech. In R. Campbell & G. T. Smith(Eds.),Recent advances in the psychology of language: Formal andexperimental approaches (pp. 347–360). New York, NY: PlenumPress.

De Ruiter, J. P. (1998). Gesture and speech production (Unpublisheddoctoral dissertation). Katholieke Universiteit, Nijmegen, theNetherlands.

D’Imperio, M. (2001). Focus and tonal structure in NeapolitanItalian. Speech Communication, 33, 339–356.

Enfield, N. J., Kita, S., & De Ruiter, J. P. (2007). Primary andsecondary pragmatic functions of pointing gestures. Journal ofPragmatics, 39, 1722–1741.

Kendon, A. (1972). Some relations between body motion and speech.An analysis of an example. In W. Siegman & B. Pope (Eds.),Studies in dyadic communication (pp. 177–210). New York, NY:Pergamon Press.

Kendon, A. (1980). Gesticulation and speech: Two aspects of theprocess of utterance. In M. R. Key (Ed.), The relationship ofverbal and nonverbal communication (pp. 207–227). The Hague,the Netherlands: Mouton.

Knight, R. A., & Nolan, F. (2006). The effect of pitch span onintonational plateaux. Journal of the International PhoneticAssociation, 36(1), 21–38.

Lausberg, H., & Sloetjes, H. (2009). Coding gestural behavior withthe NEUROGES-ELAN system. Behavior Research Methods,Instruments, & Computers, 41, 841–849.

Loehr, D. P. (2004). Gesture and intonation (Unpublished doctoraldissertation). Georgetown University, Washington, DC.

Loehr, D. P. (2007). Aspects of rhythm in gesture and speech.Gesture, 7, 179–214.

McClave, E. (1994). Gestural beats: The rhythm hypothesis. Journalof Psycholinguistic Research, 23, 45–66.

Esteve-Gibert & Prieto: Temporal Coordination of Intonation and Manual ‘‘g’’ 863

Page 15: Prosodic Structure Shapes the Temporal Realization of ... · of Intonation and Manual Gesture Movements Nu´ria Esteve-Giberta and Pilar Prietob Purpose: Previous work on the temporal

McNeill, D. (1992). Hand and mind. Chicago, IL: University ofChicago Press.

Nobe, S. (1996). Representational gestures, cognitive rhythms, andacoustic aspects of speech: A network/threshold model of gestureproduction (Unpublished doctoral dissertation). University ofChicago.

Prieto, P. (1995). Aproximacio als contorns entonatius del catalacentral [Description of the intonation contours of CentralCatalan]. Caplletra, 19, 161–186.

Prieto, P., van Santen, J., & Hirschberg, J. (1995). Tonal alignmentpatterns in Spanish. Journal of Phonetics, 23, 429–451.

Prieto, P., & Ortega-Llebaria, M. (2009). Do complex pitch gesturesinduce syllable lengthening in Catalan and Spanish? InM. Vigario, S. Frota, & M. J. Freitas (Eds.), Phonetics andphonology: Interactions and interrelations (pp. 51–70).Philadelphia, PA: John Benjamins.

Rochet-Capellan, A., Laboissiere, R., Galvan, A., & Schwartz, J. L.(2008). The speech focus position effect on jaw–finger

coordination in a pointing task. Journal of Speech, Language,and Hearing Research, 51, 1507–1521.

Roustan, B., & Dohen, M. (2010, May). Co-production ofcontrastive prosodic focus and manual gestures: Temporalcoordination and effects on the acoustic and articulatorycorrelates of focus. Proceedings of Speech Prosody 2010Conference, 100110, 1–4. Retrieved from www.speechprosody2010.illinois.edu/papers/100110.pdf

Rusiewicz, H. L. (2010). The role of prosodic stress and speech pertur-bationon the temporal synchronizationof speechanddeictic gestures(Unpublished doctoral dissertation). University of Pittsburgh.

Silverman, K., & Pierrehumbert, J. (1990). The timing of prenuclearhigh accents in English. In J. Kingston & M. Beckman (Eds.),Papers in laboratory phonology I (pp. 72–106). Cambridge,United Kingdom: Cambridge University Press.

Vanrell, M. M. (2011). The phonological relevance of tonal scaling inthe intonational grammar of Catalan (Unpublished doctoral dis-sertation).UniversitatAutonomadeBarcelona, Bellaterra, Spain.

864 Journal of Speech, Language, and Hearing Research N Vol. 56 N 850–864 N June 2013

Page 16: Prosodic Structure Shapes the Temporal Realization of ... · of Intonation and Manual Gesture Movements Nu´ria Esteve-Giberta and Pilar Prietob Purpose: Previous work on the temporal

Copyright of Journal of Speech, Language & Hearing Research is the property of AmericanSpeech-Language-Hearing Association and its content may not be copied or emailed tomultiple sites or posted to a listserv without the copyright holder's express written permission.However, users may print, download, or email articles for individual use.