interspeech 2017 book of abstracts

260
INTERSPEECH 2017 Situated interaction August 20-24, 2017 Stockholm, Sweden BOOK OF ABSTRACTS

Upload: khangminh22

Post on 05-May-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

INTERSPEECH 2017 Situated interaction August 20-24, 2017 Stockholm, Sweden

BOOK OF ABSTRACTS

Södr

aH

uset

A2

C6

E10

F11

R

Post

er 1

Post

er 2

Post

er 3

Post

er 4

RR

E397

E306

B307RR

B3

B4

B5

RR

D7

D8

D9

RR

Reg

istr

atio

nO

rals

essi

ons

Post

erar

eas

Cof

fee

stat

ions

Spea

ker

chec

k-in

Show

&te

llW

ork

stat

ions

Exhi

bito

rsR

Res

troo

ms

Elev

ator

s

INTERSPEECH 2017SITUATED INTERACTION

20–24 August 2017Stockholm, Sweden

Copyright 2017 International Speech Communication Association (ISCA)http://[email protected] rights reservedEditors: Francisco Lacerda, David House, Mattias Heldner, Joakim Gustafson,Sofia Strömbergsson, Marcin Włodarczak

Abstracts and Proceedings USB Productions byCausal Productions Pty LtdPO Box 100, Rundle MallSA 5000, Australiahttp://[email protected]

Cover photo: Yanan Li/mediabank.visitstockholm.com

Organizers

Platinum Sponsors

Diamond Sponsors

Gold Sponsors

注意事项

% 1 9 0 9 G

3G

% L O2 2

Silver Sponsors

Bronze Sponsors

ASM Solutions ltd. Mitsubishi ElectricResearch Labs Yandex

Supporters

Beijing Magic DataTechnology Co., Ltd. Disney Research EML European Media

Laboratory GmbHUniversity ofWashington

Exhibitors

Alibaba Group INTERSPEECH 2018Amazon Alexa INTERSPEECH 2019Appen ISCAApple iSoftStone Inc.Beijing Huiting Technology Co., Ltd. Linguistic Data ConsortiumBeijing Magic Data Technology Co., Ltd. Microsoft CorporationCarstens Medizinelektronik GmbH NDI Europe GmbHCobalt Speech and Language ReadSpeakerCVTE Speechocean LtdDatatang Swedish Post and Telecom Authority (PTS)DiDi The Institute for Language and Folklore (ISOF)Furhat Robotics The Swedish Agency for Accessible Media (MTM)Globalme Language & Technology Tobii ProGoogle Yahoo! JAPANiMerit Wikimedia Sverige (WMSE)Intel

Institutional Supporters

4

Organizers

International Speech Communication Association (ISCA)

ISCA is a non-profit organization. Its original statutes were deposited on February 23 at thePrefecture of Grenoble, in France by René CARRÉ and registered on March 27, 1988.

The association started as ESCA (European Speech Communication Association) and, sinceits foundation, has been steadily expanding and consolidating its activities; it has offeredan increasing range of services and benefits to its members and it has put its financial andadministrative functions on a firm professional footing. Indeed, over the ten years of its existence,ESCA has evolved from a small EEC-supported European organization to a fully-independentand self-supporting international association.

At the General Assembly that took place during the last Eurospeech conference in Budapest(September 1999), ESCA became a truly international association in the global field of speechscience and technology, changing its name to ISCA (International Speech CommunicationAssociation) and modifying its statutes accordingly.

The purpose of the association is to promote, in an international world-wide context, activitiesand exchanges in all fields related to speech communication science and technology. The associa-tion is aimed at all persons and institutions interested in fundamental research and technologicaldevelopment that aims at describing, explaining and reproducing the various aspects of humancommunication by speech, that is, without assuming this enumeration to be exhaustive, pho-netics, linguistics, computer speech recognition and synthesis, speech compression, speakerrecognition, aids to medical diagnosis of voice pathologies.

The objectives of ISCA are to stimulate scientific research and education, to organize conferences,courses and workshops, to publish and to promote publication of scientific works, to promotethe exchange of scientific views in the field of speech communication, to encourage the studyof different languages, to collaborate with all related associations, to investigate industrialapplications of research results, and, more generally, to promote relations between Public andPrivate, and between Science and Technology.

5

Organizers

Stockholm University

Stockholm University, located in Sweden’s capital city, is the region’s centre for higher educationand research in humanities, law, the natural and social sciences, and a focus for the work ofleading international researchers.

With 30,000 undergraduate and master’s students, 1,700 doctoral students and 5,000 employees,Stockholm University is one of the largest universities in Sweden and one of the largest employ-ers in the capital. People of many different nationalities, with contacts throughout the world,contribute to the creation of a highly international atmosphere at Stockholm University.

The University is situated in the middle of the first National City Park in the world and ischaracterized by natural beauty, unique architecture and contemporary art and sculpture. TheUniversity is located only ten minutes from the urban buzz of the city with people, music,fashion, architecture, and culture.

Research at Stockholm University within the humanities, law, and the social and natural sciencesis outstanding in many areas, contributing both to our understanding of the world around usand to its continued improvement. Our research maintains a high standard within a wide rangeof subjects, enabling our researchers to develop an increasing cross-disciplinary cooperation.Many of the University’s research groups find themselves at the cutting edge of their field ofinquiry.

Research at the Department of Linguistics reflects the dynamic and multi-faceted character ofmodern language studies. The different research areas (phonetics, child language, computerlinguistics, general linguistics, Swedish sign language, Swedish as a second language for the deaf,as well as typology and language documentation) investigate not only their individual domainsbut engage in active collaboration with one another. Research at the department also contributesto multidisciplinary dialogue within humanities as well as with other social and natural sciencesboth in Sweden and abroad. In particular, the department is part of the Stockholm UniversityBrain Imaging Centre (SUBIC) – an advanced multidisciplinary infrastructure dedicated tonon-clinical academic research on brain function and structure.

6

Organizers

KTH Royal Institute of Technology

In the nearly two centuries that have passed since its founding, KTH Royal Institute of Technol-ogy has grown into its role as a leading technical university in Europe. It is the largest technicalresearch and learning institution in Sweden with around 12,000 full-time students, 2,000 PhDstudents and 3,700 full-time employees. KTH is situated at four campuses, with the main campusplaced at the northern part of Stockholm city since 1917.

In this environment, speech communication pioneer Gunnar Fant founded KTH Speech, Musicand Hearing in 1951. Today, the department has become a truly multi-disciplinary research insti-tution with a blend of fields as varied as speech technology, computer science, robotics, machinelearning, linguistics, phonetics and cognitive science, firmly rooted in an engineering modellingapproach. Its research forms the base for multimodal human-computer interaction systemsin which speech, music, sound and gestures combine to create human-like communication inwhich social interaction and attitudes are allowed to play a significant part. Applications suchas conversational spoken dialogue systems and social robots, as well as software for breedingexpressive content into automatized music performances make up prominent examples of itscutting-edge human-like communication implementations.

In general, KTH Speech, Music and Hearing consistently looks to enrich core speech technologieswith conversational and human-like characteristics. As a part of this goal, we strive for a deeperunderstanding of human communicative behaviours, and analysis and modelling of the humanvoice apparatus also brings us into contact with a wide range of disciplines, including acoustics,continuum mechanics, biomechanics, psychoacoustics, phoniatrics, speech-language pathology,and the related signal processing. Similarly, analyses of what humans do when they talk toeach other places high and particular demands on the availability both of analysis methods andof big data. KTH addresses this through a number of efforts and collaborations, both on thenational and international level, that aim to create a Swedish speech technology infrastructurethat allows us to maintain a leading position in speech research.

7

Organizers

Karolinska Institutet

Karolinska Institutet (KI) is a modern medical university and one of the foremost in the world.KI’s mission is to make a significant contribution to the improvement of human health byconducting research and education and to interact with the community.

With 6,000 full-time students taking educational and single subject courses at Bachelor andMaster levels, 2,200 doctoral students and 5,300 full-time employees, KI is Sweden’s singlelargest centre of medical academic research and offers the country’s widest range of medicalcourses and programmes. Most of the programmes lead to a professional exam. Several of theprogrammes also lead to a general degree.

Karolinska Institutet is situated at Campus Solna and Campus Flemingsberg, with Solna to thenorthwest and Flemingsberg to the south. A 10-15 minute walk or train ride is all it takes to getto Stockholm city centre from either campus.

Research at Karolinska Institutet spans the entire medical field, from basic experimental researchto patient-oriented research. The Karolinska innovation system comprises a number of actorsthat provide complementary competences and services so as to optimally utilize ideas arisingfrom research.

Activities at the Department of Clinical Science, Intervention and Technology reside in the inter-face to academic health care. Research at the Division of Speech and Language Pathology focuseson issues that increase our understanding of the causes and consequences of communication andswallowing problems. The research also strives to address the complex issues of characterisingspeech, language, and swallowing disorders, and to discover efficient and effective interventionsand preventive measures.

8

Contemporary interdisciplinary research on phonetics employs a wide range of approaches, from instrumental measures to perceptual and neurocognitive procedures, to computational modelling, for investi-gating the properties and principles of phonetics in communicative settings across the world’s languages. It also ranges across styles, types of language users, and communicative modalities (speech, sign, song). Phonetica is an international forum for phonetic science that covers all aspects of the subject matter, from phonetic and phonological de-scriptions, to articulatory and signal analytic measures of production, to perception, acquisition, and phonetic variation and change. Pho-netica thus provides a platform for a comprehensive understanding of producer-perceiver interaction across languages and dialects, and of learning throughout the lifespan and across contexts. Papers published in this journal report expert original work that deals both with theoreti-cal issues, new empirical data, and innovative methods and applications that help to advance the field.

PhoneticaFounded: 1957Category: Basic Research Fields of Interest: Phonetics, Communication Disorders

Listed in bibliographic services, including: PubMed/MEDLINE, Web of Science, Google Scholar, Scopus

2017: Volume 744 issues per volumeLanguage: EnglishISSN 0031–8388e-ISSN 1423–0321

More information at w w w.karger.com/pho

An interdisciplinary forum for phonetic science research and theory

Impact Factor: 0.458

EditorC.T. Best, Bankstown, N.S.W.

(Australia)

Associate EditorsS. Frota, Lisbon (Portugal)W. Gu, Nanjing (China)S. Hawkins, Cambridge

(United Kingdom)R. Hayes-Harb, Salt Lake City,

Utah (USA)A. Jongman, Lawrence, Kans.

(USA)G. Khattab, Newcastle upon

Tyne (United Kingdom)A. Kochetov, Toronto, Ort.

(Canada)I. Mennen, Graz (Austria)M. Pouplier, Munich (Germany)P.C.M. Wong, Hong Kong

(China)

Book ReviewsO. Niebuhr, Sonderborg

(Denmark)

Editorial BoardP.A. Barbosa, Campinas (Brazil)W.J. Barry,Sindlesham-Wokingham(United Kingdom)A. Beckford Wassink,

Seattle, Wash. (USA)P.S. Beddor, Ann Arbor, Mich.

(USA)A. Bradlow, Evanston, Ill. (USA)K. Dziubalska-Kołaczyk,

Poznan (Poland)J. Fletcher, Parkville, Vic.

(Australia)L. Goldstein,

Santa Monica, Calif. (USA)J. Hajek, Melbourne, Vic.

(Australia)D. House, Stockholm (Sweden)S. Kawahara, Tokyo (Japan)

L. Ménard, Montreal, Que. (Canada)

C. Mooshammer, Berlin (Germany)

F. Nolan, Cambridge (United Kingdom)

R. Ogden, York (United Kingdom)

L. Polka, Montreal, Que. (Canada)

D. Recasens, Barcelona (Spain)J.C. Roux, Potchefstroom

(South Africa)R. Smith, Glasgow

(United Kingdom)M. Swerts, Tilburg

(Netherlands)R. Walker,

Los Angeles, Calif. (USA)K. Watson, Christchurch

(New Zealand)D.H. Whalen, New York, N.Y.

(USA)

Sounds and Prosodies in Speech Communication

Selected contributions• Phonologically Constrained Variability in L1 and L2 Production and Perception: Vigário, M.; Butler, J.; Cruz, M. (Lisbon)

• Second Language Experience Can Hinder the Discrimination of Nonnative Phonological Contrasts: Holliday, J.J. (Seoul)

• Articulatory and Acoustic Characteristics of German Fricative Clusters: Pouplier, M.; Hoole, P. (Munich)

• Prosodic Typology II: The Phonology of Intonation and Phrasing: Botinis, A. (Athens)

• Telephone Transmission and Earwitnesses: Performance on Voice Parades Controlled for Voice Similarity: McDougall, K.; Nolan, F.; Hudson, T. (Cambridge)

• The Phonetic Realization of Devoiced Vowels in the Southern Ute Language: Oberly, S. (Tucson, Ariz.); Kharlamov, V. (Boca Raton, Fla.)

• Gestural Control in the English Past-Tense Suffix: An Articulatory Study Using Real-Time MRI: Lammert, A.; Goldstein, L.; Ramanarayanan, V.; Narayanan, S. (Los Angeles, Calif.)

• The Phonetics of Head and Body Movement in the Realization of American Sign Language Signs: Tyrone, M.E. (New Haven, Conn.); Mauk, C.E. (Pittsburgh, Pa.)

KF17

088

TO LEARN MORE, VISIT amazon.jobs/interspeech2017.CV AND RESUME SUBMISSIONS TO [email protected].

“Alexa, can I work for you?”The Amazon Alexa team focuses on bringing user-delighting, voice-activated experiences to Amazon customers. The team began with the development of Amazon Echo, designed entirely around your voice.

But the team of speech scientists, voice designers, developers, and more didn’t stop there. The Echo famiy of devices has grown to include Echo Dot, Amazon Tap, and more.

The Alexa Voice Service is bringing Alexa to other connected products, and the Alexa Skills Kit (ASK) allows third-party developers to easily build and add their own skills to Alexa.

Join us to discover where Alexa will go next —and what she’ll say.

Table of Contents

Welcome to INTERSPEECH 2017 | 13

Welcome to Stockholm | 19

INTERSPEECH 2017 General Information | 21

Social Program | 25

INTERSPEECH 2017 Organizing Team | 26

Technical Program Committee | 28

Scientific Review Committee | 30

Future INTERSPEECH Conferences | 38

Satellite Workshops and Events | 40

Tutorials | 44

ISCA Medalist and Keynote Speeches | 48

Special Sessions | 53

Special Events | 59

ISCA-SAC Special Events | 62

Awards | 63

Daily Schedule | 67

Session Index | 72

Abstracts |

11

76

| 244Author Index

Welcome to INTERSPEECH 2017

President’s Welcome Message

Haizhou Li

Welcome to Stockholm – where INTERSPEECH makes a return to the Nordic countries!

I am indeed honoured to pen the first words as a welcome message. In the past year, research inspeech communication science and technology continued to thrive in the ISCA community andall over the world. Research advancements have accelerated innovation in artificial intelligence,which has now become a hot topic in classrooms, boardrooms, and newsrooms. On this occasion,I share with you the excitement of our community as we gather again at our annual conference.

INTERSPEECH 2017 is special. We embrace a theme of “Situated interaction”, that is not onlybecause speech is a situated interaction, but also speech science and technology are well situatedto interact with academia and industry in the conference.

During the conference, the community will honour Professor Fumitada Itakura as the recipientof the 2017 ISCA Medal for Scientific Achievement for contributions in developing fundamentalstatistical algorithms in speech coding and recognition with broad and unparalleled impacts.We will also celebrate our members’ achievements by recognizing five ISCA Fellows 2017. Theyare Alan Black, Jean-Luc Gauvain, Bhuvuna Ramabhadran, Guiseppe Riccardi, and MichaelRiley. Please join me in giving them the warmest congratulations! INTERSPEECH 2017 marks the18th Annual Conference of ISCA, which continues the success of recent events. I feel privilegedto be part of conference preparation as the ISCA President in the second year of my term andas one of the 54 area chairs. We received a record number of 1711 paper submissions of which1582 went into the review process and 799 were accepted. The technical program committee, ledby the Technical Program Chairs, Mattias Heldner, Joakim Gustafson, and Sofia Strömbergssondeserve our gratitude for putting in an immense amount of work to prepare a quality technicalprogram that covers the latest advancement of speech science and technology. To ensure highquality in the paper review process, every paper submission has been reviewed by at least 3reviewers. A total of 1395 reviewers contributed to the review process this year. A big thanks toall of you!

ISCA is grateful to have Stockholm University, KTH Royal Institute of Technology, and Karolin-ska Institutet, the three prestigious institutions in Stockholm as the co-organizers of the confer-ence. Organizing an INTERSPEECH event takes enormous courage, endurance and dedication, Iwould like to express my gratitude and appreciation to the General Chair, Francisco Lacerda,and General Co-Chair, David House, who led the team to bring INTERSPEECH to Stockholm forthe first time. Finally, I do hope that you have an enjoyable and productive time in Stockholm,and that you will leave with fond memories of INTERSPEECH 2017. With my best wishes for asuccessful conference!

Haizhou LiISCA President

13

Welcome Message of the INTERSPEECH 2017 General Chair and Co-Chair

Francisco Lacerda David House

On behalf of the Organizing Committee, we would like to welcome you to INTERSPEECH 2017in Stockholm, Sweden. INTERSPEECH is the world’s largest and most comprehensive conferenceon spoken language processing emphasizing an interdisciplinary approach covering all aspectsof speech science and technology from basic theories to clinical and technological applications.This year’s INTERSPEECH conference is the 19th annual conference of the International SpeechCommunication Association (ISCA). We are extremely pleased to be able to welcome youto Sweden with its long academic history of speech and language research and favourableconditions for combining technology, innovation and entrepreneurship as key aspects of Swedishbusiness. Sweden was at the forefront of the IT boom, and continues to be a world leader asspeech takes on an increasingly central role in technological development.

The conference theme this year is “Situated interaction”. The overall goal is to provide a broadapproach to speech communication issues integrating speech technology and pragmatic aspectsof human conversational speech communication behaviour in different interaction contexts.All branches of speech communication science are included – be it human-machine interactionor human-human interaction in groups of different sizes, face-to-face or remote, exploring theaffordances of speech communication technology.

INTERSPEECH 2017 will be held at Stockholm University campus, which is attractively situatedon the northern part of The Royal National City Park, served by Stockholm’s metropolitan’sgood public transportation network. We are looking forward to a great scientifically inspiringevent enhanced by the natural and harmonic environment of Stockholm University campus,and we hope that INTERSPEECH 2017 will be a highly positive scientific, social and aestheticalexperience for all of you.

In addition to the conference with its plenary talks, oral and poster presentations, tutorials,special sessions, show and tell sessions, exhibits and social events, we hope you will have theopportunity to enjoy the natural and architectural beauty of Stockholm and also some of itscultural attractions. The city’s reputation as one of the most beautiful capitals in the world isintensified during the summer months when daylight is abundant and greenery flourishes.Stockholm also offers a large number of galleries and museums and is well known for its modern,innovative cuisine, design and music.

INTERSPEECH 2017 is jointly organized by the Department of Linguistics, Stockholm University,the Department of Speech, Music and Hearing, KTH Royal Institute of Technology, and theDivision of Speech and Language Pathology, Karolinska Institutet. These three departmentsrepresent a long and rich tradition of speech, language and voice research stretching back to1951 when Gunnar Fant founded the Department of Speech, Music and Hearing at KTH. AtKarolinska Institutet, the first Swedish academic degree program in Speech and LanguagePathology was founded by the initiative of Gunnar Bjuggren in 1964, and the Department ofLinguistics at the University of Stockholm was formed in 1965 with Björn Lindblom as headof the Phonetics Laboratory. We are very privileged indeed to have Björn Lindblom as one ofour plenary speakers along with Fumitada Itakura (ISCA Medalist), James Allen, and CatherinePelachaud.

14

Organizing INTERSPEECH is the result of the dedication of a large number of individuals whogave freely of their time and efforts. We would like to thank the organizing committee, theSwedish speech community, ISCA, all of our generous sponsors and exhibitors, and Akademikon-ferens – our Professional Conference Organizers – for all the valuable help and advice duringthe preparation and realization of the conference.

Finally, we would like to thank all the authors who submitted a paper to INTERSPEECH 2017, thereviewers, the area chairs, the session chairs and our volunteers for all their work in preparingand running the conference.

We wish you a very warm welcome to Stockholm and hope that you will have a productive andenjoyable time at INTERSPEECH 2017.

Francisco LacerdaStockholm UniversityGeneral Chair

David HouseKTH Royal Institute of TechnologyGeneral Co-Chair

15

Message from the Technical Program Chairs

Mattias Heldner Joakim Gustafson Sofia Strömbergsson

Some time after we signed up to organize the Technical Program of INTERSPEECH 2017, a formerTechnical Program Chair encouraged us by saying “by Easter 2017 you will be in a state of panic,but you will probably have recovered sometime in 2018”. At the time of writing this, “Easter” isa lot closer than “sometime in 2018”. We are shaking, we are nervous, but we are also honouredto have been given the opportunity to organize the Technical Program, and we are proud ofthe result of our work. We hope that you will enjoy the conference and wish you all Welcome toStockholm and to Interspeech 2017!

Although we are managing the Technical Program for the first time, we are working witha group of reliable colleagues who have really showed us the ropes. Especially the formerTechnical Program Chairs Panayiotis Georgiou (INTERSPEECH 2016), Bernd Möbius and ElmarNöth (INTERSPEECH 2015), Lori Lamel (INTERSPEECH 2013 and ISCA Technical Committee), thedeveloper of the START conference management software Rich Gerber, and George Vokalek atCausal Productions (producer of the conference proceedings) have been exceptionally helpful inguiding us in this work. It has also been extremely valuable for us to have served as Area Chairsand to participate in the TPC meetings for INTERSPEECH 2014, 2015 and 2016 to see how theaccept/reject decisions and the final program were made.

The Technical Program Committee consists of the three Technical Program Chairs and 54 AreaChairs including the ISCA President (Haizhou Li) and other ISCA Board representatives (JohnHansen, Douglas O’Shaughnessy, Mark Hasegawa-Johnson, Martin Cooke, Keikichi Hirose, KateKnill, Lori Lamel, Torbjørn Svendsen, Sebastian Möller), the Past ISCA President Tanja Schultz,three of the INTERSPEECH 2018 Technical Program Chairs (Hema A. Murthy, Preeti Rao, PaavoAlku), Technical Program Chairs from INTERSPEECH 2016, 2015, 2014 and 2013 (PanayiotisGeorgiou, Shrikanth Narayanan, Bernd Möbius, Elmar Nöth, Helen Meng, Lori Lamel), aSpecial Session Chair (Jens Edlund), and distinguished colleagues from near and far (OlovEngwall, Odette Scharenborg, Petra Wagner, Julia Hirschberg, Mary Beckman, Khiet Truong,Björn Schuller, Nigel Ward, Jean-Francois Bonastre, Kornel Laskowski, Michael Wagner, JonasBeskow, Martti Vainio, Tom Bäckström, Mirjam Wester, Ingmar Steiner, Alan Black, GiampieroSalvi, Børge Lindberg, Jean-Luc Gauvin, Geoff Zweig, Michiel Bacchiani, Roger Moore, FlorianMetze, Nikko Ström, Gabriel Skantze, Kristiina Jokinen, Agustin Gravano, Amanda Stent, RolfCarlson, Arne Jönsson, Isabel Trancoso, Roland Kuhn). We are most grateful to our Area Chairs;their contribution in the process of creating a technical program cannot be overstated.

The Technical Program work also required a huge Scientific Review Committee. Our ambitionwas to have each paper reviewed by four reviewers. With 1711 initial submissions and 1582papers remaining in the review process after withdrawals and removal of duplicates, we neededroughly 6330 reviews. The Technical Program Committee worked hard to increase the reviewerpool from past conferences and in the end, a record number of 1395 reviewers contributed to thereview process. A list of all the reviewers that completed review assignments this year can befound elsewhere in this book of abstracts. Thank you all reviewers! Your contributions are keyto the success of INTERSPEECH conferences. We can neither confirm nor deny that we also keepa list of those that did not complete their review assignments.

16

When you have these diligent committees, all you have to add to make a technical program fora scientific conference is high quality submissions. We received lots of them! 1582 went into thereview process, 799 were accepted so the acceptance rate was about 52%. We were happy to seethat more than 50% of the accepted papers had students as contact authors. We thank all authorsfor their contributions. Your hard work and efforts built the body and soul of the conference!

In addition, the program contains 9 Tutorials (organized by Gabriel Skantze and Björn Granström),4 special events (organized by Hatice Zora), 8 Show & Tell Sessions (organized by Jonas Beskow),13 Satellite Workshops orbiting the conference (organized by Anders Eriksson and Olov Eng-wall).

We are sure that you will find the technical program inspiring and equally sure that you willlove end-of-summer Stockholm!

Mattias HeldnerStockholm University

Joakim GustafsonKTH Royal Instituteof Technology

Sofia StrömbergssonKarolinska Institutet

17

Talkin’ about your generation!Let Cirrus Logic’s culture of innovation put your career in the spotlight!

#CirrusRocks

Now hiring the best and brightest in research and development in speech processing and speaker identification.

Work with our software and research teams in Austin, Salt Lake City, London, and Madrid to develop embedded solutions for the mobile and consumer markets.

Learn more at cirrus.com/careers

© 2017 Cirrus Logic, Inc. All rights reserved. Cirrus Logic, Cirrus, and the Cirrus Logic logo designs are trademarks of Cirrus Logic, Inc.

Welcome to Stockholm

About Stockholm

Stockholm, one of the most beauti-ful capitals in the world, is built on14 islands connected by 57 bridges.The beautiful buildings, the green-ery, the fresh air and the proximity tothe water are distinctive traits of thiscity. The Royal National City Park,(the first National City Park in theworld), is a green space that breathesfor the city, and a constant presencein the crush of the city.

With its 750 year history and rich cultural life, Stockholm offers a wide selection of world-classmuseums and attractions. Most of the city’s attractions can be reached on foot, and there’s agood chance of experiencing a lot of things in a short time. Experience big-city life, the historyof civilization and natural scenery, all in the course of the same day.

Visit Stockholm City Hall. Climb the City Hall tower for a fantastic view of Stockholm. Don’tmiss Gamla Stan, Stockholm’s oldest attraction and one of the best preserved medieval citycenters in the world. Walk through small winding streets lined with stores full of handicrafts,antiques, art galleries and cafés. The Royal Palace and Stockholm Cathedral are also located inGamla Stan.

The green island of Djurgården is home to some of the city’s most popular attractions. Visitthe world-famous warship the Vasa, the world’s oldest open-air museum Skansen, or AstridLindgren’s Junibacken. And don’t miss the chance to see Stockholm from the water. Naturally, acity built on fourteen islands offers marvelous views over the water. There are many differentsightseeing tours to choose from. And if fourteen islands aren’t enough, Stockholm offers awonderful archipelago with 30,000 islands, islet rocks and skerries.

Stockholm is also where you find the most multinational companies, the largest stock marketand, not least, the most visitors. People come to Stockholm for the food, the design and themusic. Stockholm also offers a unique range of galleries and museums, and every year the eyesof the world are on Stockholm when the Nobel Prizes are awarded.

About Sweden

Sweden is one of the largest countries in Europe, with great diversity in its nature and climate.Its distinctive yellow and blue flag is one of the national emblems that reflect centuries of historybetween Sweden and its Nordic neighbours.

Sweden is a sparsely populated country, characterised by its long coastline, extensive forestsand numerous lakes. It is one of the world’s northernmost countries. In terms of surface area itis comparable to Spain, Thailand or the US state of California.

Sweden experiences extreme contrasts between its long summer days and equally long winternights. In the summer, the sun stays in the sky around the clock in the parts of Sweden northof the Arctic Circle, but even as far south as Stockholm (59◦N) the June nights have only a fewhours of semi-darkness.

With its variety of landscapes, Sweden has everything from bears and wolves in the north to roedeer and wild boar in the south. The country also has a wealth of flora and aquatic life, whichcontribute to its biological diversity.

19

INTERSPEECH 2017 General Information

Conference venue

Stockholm UniversityUniversitetsvägen 10114 18 Stockholm

Plenary sessions: Aula MagnaOral sessions: Aula Magna and Södra HusetPoster sessions: Södra HusetTutorials: Södra Huset

Registration and Information Desks

Sunday 20 August 08:00–17:00 Södra Huset, House AMonday 21 August 08:00–10:00 Aula Magna

08:00–17:00 Södra Huset, House ATuesday 22 August 07:45–17:00 Södra Huset, House AWednesday 23 August 07:45–17:00 Södra Huset, House AThursday 24 August 07:45–17:00 Södra Huset, House A

Speaker Check-In

Speakers are required to use the computers provided by the conference for their oral presenta-tions. Personal laptops may not be used. PowerPoint or PDF are the only accepted presentationformats. Please make sure that your presentation is made using an up-to-date version of PPT(filename.pptx) or pdf. It is recommended that multi-media sound or video files are embeddedin the presentation file. If, for some reason, this is not possible they must be provided at thespeaker check-in together with the presentation file and clear instructions about which presenta-tion file they belong to. Aspect ratio of projectors will be 4:3. Presentations must be submitted ona USB pen drive at the Speaker Check-In Desk (room B307). You must submit your presentationat least two hours before the beginning of the session where it shall be used.

Speaker information

Oral presenters will have 15 minutes to present, followed by 4 minutes for questions, and 1minute for speaker change. The timing will be strict, and session chairs will be requested to stopspeakers exceeding the 15-minute presentation slot. Presenters should introduce themselves tothe session chairs during the break before the start of their oral session.

Please arrive to your presentation room 10 minutes prior to the session start time to familiarizeyourself with equipment and procedures. Please sit towards the front of the room in the sessionin which you present. The Session Chair will introduce your presentation as well as monitorthe length of the presentation. All mobile phones must be turned off while you are presenting.Mobile phones on silent will cause feedback with the microphones. A laser pointer and slideadvancer will be available at the podium for your use.

Poster information

The useful size of the poster boards (i.e. area inside the metal frame) is (W × H) 95 × 190 cm; 37× 75 in. Pins will be provided. You will have access to the boards 30 minutes before the start ofyour poster session. The posters should be taken down at the end of the session. At least one ofthe authors must be present at your poster during the poster session.

21

Special session information

The format of special sessions may vary. Presenters in these sessions will receive a specialinformation from their respective special session chairs.

Badges

The participant name badge will be provided at the registration desk. All participants arerequired to wear the badge throughout the conference. Only conference badge holders will beadmitted to the sessions.

WiFi

The wireless network eduroam is available at Stockholm University. If you already have eduroamconfigured on your devices, you should already have internet access. If you do not have eduroam,you can get online using the one-time code listed below. This code will be valid during the entireconference.

Login instructions:

1. Start a new web browser

2. Choose the SU network

3. Select “One time code”

4. Enter the username (otc-jrkn) and password (Bjk7Gh8kZX)

You can find IT-related information, rules and regulations, as well as links to support articles etc.at http://su.se/interspeech2017.

Work stations

Each house/section in Södra Huset has a Work Station area with seating and tables, wheremobile charging stations (with secure check-in) are located, as well as extra power outlets tocharge computers, etc.

Mobile app

The INTERSPEECH 2017 mobile app is an application for tablet and smartphone devices (iPhoneand Android). The mobile app provides easy-to-use interactive capabilities to enhance yourexperience as an attendee.

To download the application, visit your app store and search for “Interspeech 2017”. Providethe email you used during the online registration process.

Coffee and lunch

Coffee is included in the registration fee and will be served according to the program in all workstation areas in Södra huset, as well as outside the lecture hall in Aula Magna. Lunch can bebought in a number of restaurants both in Södra Huset and in Allhuset (see the campus map).

22

Accessibility

Special services (e.g., wheelchair-accessible transportation, reserved seating) are available if re-quested in advance. Should you require assistance onsite, please visit the conference RegistrationDesk.

Smoking policy

Sweden has a non-smoking policy, i.e. smoking is prohibited in public buildings, public transport,taxis, buses and trains.

Emergency

Please notify Academic Conferences or any of the INTERSPEECH 2017 staff for basic medicalassistance. The general emergency call number in Sweden is 112. First aid kits, fire extinguishersand heart defibrillators are available and clearly displayed in all campus buildings.

Force Majeure

The organizers are not liable for any claims for damages and/or losses if the entire conferencehas to be cancelled due to a force majeure incident.

Disclaimer

The organizers are not liable for damages and/or losses of any kind which may be incurredby the conference delegates or by any other individuals accompanying them, both during theofficial activities as well as going to/from the conference. Delegates are responsible for theirown safety and belongings.

Insurance and Vaccinations

The registration fee does not cover insurance for the delegates. The organizers recommendthat delegates take out insurance in their home country to cover pre-journey cancellation forpersonal reasons and necessary insurance to cover accidents, medical expenses and loss ofpersonal belongings during the visit. No vaccinations are needed when visiting Sweden.

Liability

The conference organizers cannot accept liability for injuries or losses arising from accidents orother situations during or as a consequence of the conference.

Lost & Found

Lost & Found is located at the Registration Desk in Södra Huset (House A).

23

The human face of A.I.

Social Program

Please remember to bring your badge and entrance tickets to all social events!

Welcome Reception

Monday, 21 August, 19:00–20:30Stockholm City Hall, Hantverkargatan 1

We invite INTERSPEECH 2017 participants to a welcome reception at the City Hall - the Swedishcapital’s landmark building and the venue of the annual Nobel Prize banquet. Walk down thesteps of the Blue Hall like a winner and enjoy snacks in the Golden Hall - illuminated by 18million pieces of gold mosaic depicting scenes from Stockholm’s history. This event is generouslyhosted by the City of Stockholm.

The City Hall is a short walk away from the central station (T-Centralen) across the Stadshusbronat the northern bank of lake Mälaren.

Student Reception

Tuesday, 22 August, 19:30Kägelbanan, Mosebacke Torg 1–3

Our students are invited to a party in Kägelbanan, at Södra Teatern.

Kägelbanan is located within 20 minutes subway distance from the university: Take the redline in the direction Fruängen and get off at Slussen. From there, head west on Ryssgården,then continue onto Peter Myndes backe, afterwards turn left onto Götgatan and left again ontoUrvädersgränd. Continue onto Mosebacke Torg until you see the sign to Södra Teatern.

Standing Banquet

Wednesday, 23 August, 19:00-22:00Tekniska museet, Museivägen 7

The INTERSPEECH 2017 standing banquet will take place at the Tekniska Museet – the NationalMuseum of Science and Technology. Apart from mingling with your colleagues among theexhibits (from Swedenborg’s ‘flying machine’ to early desktop computers), you will get to knowSwedish innovators throughout the ages, and enjoy the many interactive attractions of themuseum. For a multisensory experience, enter the MegaMind to make music with your wholebody, paint with your eyes and create virtual sculptures. Of course, taste and smell will also besatisfied: we will serve fully organic, Nordic food, including lovingly selected wine pairings.

Buses will wait for you at the Stockholm University campus and bring you to the banquetvenue. Alternatively, you can take the underground nr 14 to Östermalmstorg, walk down BirgerJarlsgatan to Nybroplan and arrive at the venue on bus 69 (stop: Museiparken).

25

INTERSPEECH 2017 Organizing Team

General Chair

Francisco Lacerda, Stockholm University

General Co-Chair

David House, KTH Royal Institute of Technology

Technical Program Chairs

Mattias Heldner, Stockholm University

Joakim Gustafson, KTH Royal Institute of Technology

Sofia Strömbergsson, Karolinska Institutet

Exhibits

Iris-Corinna Schwarz, Stockholm University

Special events

Hatice Zora, Stockholm University

Anders Eriksson, Stockholm University

Keynotes

Rolf Carlsson, KTH Royal Institute of Technology

Special sessions

Jens Edlund, KTH Royal Institute of Technology

Show & Tell

Jonas Beskow, KTH Royal Institute of Technology

Tutorials

Gabriel Skanze, KTH Royal Institute of Technology

Björn Granström, KTH Royal Institute of Technology

Satellite Workshops

Anders Eriksson, Stockholm University

Olov Engwall, KTH Royal Institute of Technology

26

Sponsorship

Johan Boye, KTH Royal Institute of Technology

Mats Wirén, Stockholm University

Samer Al Moubayed, KTH Royal Institute of Technology

Anders Eriksson, Stockholm University

Student Affairs

Catharine Oertel, KTH Royal Institute of Technology

Volunteer Coordinators

Kätlin Aare, Stockholm University

Johanna Schelhaas, Stockholm University

Social Activities

Zofia Malisz, KTH Royal Institute of Technology

Publications

Marcin Włodarczak, Stockholm University

Information Technology

Mia Söderbärj, Stockholm University

Conference Catering

Malin Björk, Hörs

Professional Conference Organiser

Annica Hultfeldt, Academic Conferences

Lina Sarenius, Academic Conferences

27

Technical Program Committee

Technical Program Chairs

Mattias Heldner, Stockholm University

Joakim Gustafson, KTH Royal Institute of Technology

Sofia Strömbergsson, Karolinska Institutet

Special sessions

Jens Edlund, KTH Royal Institute of Technology

Show & Tell

Jonas Beskow, KTH Royal Institute of Technology

Area Chairs

Speech Perception, Production, and AcquisitionOlov Engwall, KTH Royal Institute of Technology, SwedenShrikanth Narayanan, University of Southern California, USAMartin Cooke, University of the Basque Country, SpainOdette Scharenborg, Radboud Universiteit Nijmegen, NetherlandsBernd Möbius, Saarland University, Germany

Phonetics, Phonology, and ProsodyPetra Wagner, Bielefeld University, GermanyKeikichi Hirose, University of Tokyo, JapanJulia Hirschberg, Columbia University, USAMary Beckman, Ohio State University, USA

Analysis of Paralinguistics in Speech and LanguageElmar Nöth, University of Erlangen-Nuremberg, GermanyKhiet Truong, University of Twente, NetherlandsBjörn W. Schuller, University of Passau, GermanyNigel Ward, The University of Texas at El Paso, USAPanayiotis Georgiou, University of Southern California, USA

Speaker and Language IdentificationHaizhou Li, National University of Singapore, SingaporeJean-Francois Bonastre, University Avignon, FranceKornel Laskowski, Voci Technologies, USAMichael Wagner, TU Berlin, Germany

Analysis of Speech and Audio SignalsJonas Beskow, KTH, SwedenMartti Vainio, University of Helsinki, FinlandMark Hasegawa-Johnson, University of Illinois at Urbana-Champaign, USAPaavo Alku, Aalto University, Finland

28

Speech Coding and EnhancementTom Bäckström, Aalto University, FinlandSebastian Möller, TU Berlin, GermanyJohn Hansen, University of Taxes at Dallas, USAPreeti Rao, IIT Bombay, Powai, India

Speech Synthesis and Spoken Language GenerationHema A Murthy, IIT Madras, IndiaMirjam Wester, University of Edinburgh, UKIngmar Steiner, Saarland University, GermanyAlan Black, Carnegie Mellon University, USA

Speech Recognition — Signal Processing, Acoustic Modeling, Robustness and AdaptationDouglas O’Shaughnessy, Université du Québec, CanadaKate Knill, University of Cambridge, UKGiampiero Salvi, KTH, SwedenBørge Lindberg, Aalborg University, DenmarkJean-Luc Gouvin, LIMSI, France

Speech Recognition — Architecture, Search, and Linguistic ComponentsLori Lamel, LIMSI, FranceHelen Meng, Chinese University of Hong Kong, ChinaGeoff Zweig, Microsoft, USAMichiel Bacchiani, Google, USA

Speech Recognition — Technologies and Systems for New ApplicationsTorbjørn Svendsen, Norwegian University of Science and Technology, NorwayRoger Moore, University of Sheffield, UKFlorian Metze, Carnegie Mellon University, USANikko Ström, Amazon, USA

Spoken Dialog Systems and Analysis of ConversationGabriel Skantze, KTH, SwedenKristiina Jokinen, University of Helsinki, FinlandAgustin Gravano, Universidad de Buenos Aires, ArgentineAmanda Stent, Bloomberg LP, USA

Spoken Language Processing: Translation, Information Retrieval,Summarization, Resources and Evaluation

Arne Jönsson, Linköping University, SwedenTanja Schultz, Universität Bremen, GermanyIsabel Trancoso, INESC-ID, PortugalRoland Kuhn, National Research Council, Canada

29

Scientific Review Committee

Alberto AbadOssama Abdel-HamidNassima Abdelli-BeruhÅsa AbelinAlex AceroLauren AckermanAndre AdamiGilles AddaMartine Adda-DeckerJordi AdellMohamed AfifyYannis AgiomyrgiannakisShyam AgrawalManu AiraksinenMasato AkagiMasami AkamineMurat AkbacakYuya AkitaMd Jahangir AlamFelix AlbuJan AlexanderssonPaavo AlkuAlexandre AllauzenFil AllevaJens AllwoodJesús B. Alonso-HernándezTanel AlumäeEliathamby AmbikairajahGilbert AmbrazaitisAngélique AmelotNoam AmirOve AndersenTim AndersonWalter AndrewsJorn AnemullerPongtep AngkititrakulXavier AngueraTakayuki AraiShoko ArakiJulian David Arias LondoñoYasuo ArikiEbru ArisoySebastian ArndtMarc ArnelaHagai AronowitzLevent ArslanMichael AshbyPeter AssmannRamón AstudilloEva Liina AsuBishnu AtalKartik AudhkhasiNicolas AudibertCinzia AvesaniMatthew AylettHarald BaayenMolly Babel

Pierre BadinPaolo BaggiaLadan Baghai-RavaryMohamad Hasan BahariGerard BaillyJorge BaptistaPlinio BarbosaNelly BarbotEllen Gurman BardJon BarkerEtienne BarnardAnna BarneyDante BaroneRoberto Barra-ChicoteVincent BarriacNikoletta BasiouFernando BatistaAnton BatlinerStefan BaumannTimo BaumannAli Orkan BayerFrederic BechetTilman BeckerSteve BeetHomayoon BeigiPeter BellJerome BellegardaMohamed Ben JannetAtef Ben YoussefJose Miguel BenediŠtefan BenušMohamed Faouzi BenzeghibaElika BergelsonJens BergerChristina BergmannNicole BeringerKay BerklingJulie BerndsenGunilla BerndtssonJared BernsteinFrederic BerthommierNicola BertoldiLaurent BesacierPeter BirkholzJudith BishopJose Luis Blanco MurilloTobias BockletOcke-Schwen BohnAntonio BonafonteJean-Francois BonastreZinny BondDaniel BoneFrancesca BoninAnne BonneauJonas BorgstromHynek BorilTomáš Boril

Hans Rutger BoskerPhilippe Boula de MareüilGilles BoulianneHerve BourlardRachel BouserhalPierre-Michel BousquetSuzanne BoyceJohan BoyeFabian BrackhaneMichael BradyDavid BraudeBettina BraunAngelika BraunHervé BredinAndrew BreenCatherine BreslinJohn BridleLaurence BruggemanAlejna BrugosAlessio BruttiGreg BryantLuis BueraMurtaza BulutH Timothy BunnellHarry BuntL. Ann BurchfieldSusanne BurgerLukas BurgetFelix BurkhardtIan BurnettCarlos BussoDani ByrdTom BäckströmRonald BöckJoao CabralPeter CahillLuis Caldas de OliveiraZoraida CallejasJose Ramon Calvo de LaraNick CampbellWilliam CampbellValentín Cardeñoso-PayoPatrick CardinalChristopher CarignanRolf CarlsonRebecca CarrollFrancisco CasacubertaDiamantino CaseiroDiego CastanMaria Jose Castro-BledaChristophe CerisaraJan CernockýRupayan ChakrabortyWilliam ChanSenthilkumar ChandramohanDelphine CharletCiprian Chelba

30

Chandra Sekhar ChelluXin ChenLiping ChenBerlin ChenFei ChenFang ChenI-Fan ChenYun-Nung ChenXie ChenGuoguo ChenKuan-Yu ChenLing-Hui ChenNancy ChenJian ChengRathinavelu ChengalvarayanMohamed ChetouaniJonathan CheveluLuong Chi MaiJen-Tzung ChienK.K. ChinEng Siong ChngJeung-Yoon Elizabeth ChoiGérard CholletMathieu CholletMonojit ChoudhuryKhalid ChoukriHeidi ChristensenKenneth ChurchRobert ClarkCynthia ClopperMartin CmejrekLuísa CoheurJennifer ColeAlistair ConkieRobin CooperRicardo CordobaPiero CosiChristophe CouvreurAlejandrina CristiaOlivier CrouzetTamás Gábor CsapóHeriberto CuayahuitlJia CuiXiaodong CuiSandro CumaniNicholas CumminsFred CumminsFrancesco CutugnoChristophe d’AlessandroLuis Fernando D’HaroDeborah DahlRasmus DallGeraldine DamnatiJianwu DangFalavigna DanieleGiacobello DanieleKhalid DaoudiTran-Huy DatTorsten DauMarelie DavelChris Davis

Ken de JongCeline De LoozeJose Mario De MartinoRenato de MoriBert de VriesFebe De WetCarme de-la-MotaDavid DeanSalil DeenaGilles DegottexNajim DehakMichael DeisherPaul DekkerPhillip DeLeonHéctor DelgadoArnaud DelhayVeronique DelvauxAndrea DeMarcoGrazyna DemenkoCenk DemirogluKris DemuynckYasuharu DenBruce DenbyLi DengHuiqun DengAnoop DeorasNina DethlefsDavid DeVaultMaria-Gabriella Di BenedettoGiuseppe Di FabbrizioMireia DiezVassilis DigalakisDimitrios DimitriadisDiana DimitrovaSnezhina DimitrovaDileep Aroor DineshHongwei DingSascha DischPierre DivenyiOlga DmitrievaSimon DobnikGerry DochertyLaura Docio-FernandezRama Sanand DoddipatlaMarion DohenHans DolfingMinghui DongDavid DoukhanOlivier DerooChristoph DraxlerCarlo DrioliJasha DroppoThomas DrugmanAndrzej DrygajloHarishchandra DubeyJacques DuchateauSophie DufourRichard DufourStéphane DupontEmmanuel DupouxDaniel Duran

Thierry DutoitRobert EklundVaclav EkslerMounya ElhilaliBenjamin ElieDaniel EllisOlov EngwallJulien EppsHakan ErdoganDonna EricksonAnders ErikssonDaniel ErroEngin ErzinDavid EscuderoChristina EspositoYannick EstèveKeelan EvaniniNicholas EvansFlorian EybenMauro FalconeIsabel FaléTiago FalkXing FanJérôme FarinasKevin FarrellDavid FarrisMireia FarrúsFriedrich FaubelCamille FauthBenoit FauveBenoit FavreMarcello FedericoTibor FegyóJunlan FengRaul FernandezLaura Fernández GallardoEmmanuel FerragneIsabelle FerranéJavier FerreirosLuciana FerrerCarlos FerrerLionel FeugèreTim FingscheidtVolker FischerJanet FletcherJosé A. R. FonollosaEric Fosler-LussierGeorge FosterCecile FougeronRobert A. FoxHoracio FrancoCorinne FredouilleJohan FridGerald FriedlandDaniel FriedrichsSónia FrotaRobert FuchsSusanne FuchsGuillaume FuchsMark FuhsMasakiyo Fujimoto

31

Takashi FukudaTakahiro FukumoriSadaoki FuruiAdamantios GafosMark GalesOlivier GalibertAscension Gallardo-AntolinSriram GanapathyAnkur GandheSuryakanth V GangashettySiva Reddy GangireddySharon GannotFernando GarcíaDaniel Garcia-RomeroPhilip N. GarnerMaeva GarnierHarinath GarudadriRoberto GemelloMunir GeorgesKallirroi GeorgilaBranislav GerazovTove GerholmTimo GerkmannnBruce GerrattSayan GhoshPrasanta GhoshArnab GhoshalDafydd GibbonJonathan GinzburgLaurent GirinOndrej GlembekJuan Ignacio Godino LlorenteStefan GoetzeH. Gokhan IlkLouis GoldsteinPavel GolikChristian GollanAngel GomezPedro Gómez-VildaJian GongYifan GongJose A. GonzalezAllen GorinKyle GormanMária GósyMartijn GoudbeekPhilippe GournayEvandro GouveaMartin GraciarenaCalbert GrahamVolodya GrancharovBjörn GranströmAgustin GravanoGuillaume GravierDavid GraydenPhil GreenSteven GreenbergFrancis GrenezFrantisek GrezlGintare GrigonyteDavid Griol

Wentao GuOriol GuaschJon GudnasonCristina GuerreroTanaya GuhaRodrigo GuidoVishwa GuptaCarlos GussenhovenTino HaderleinReinhold Haeb-UmbachChristina HagedornSeongjun HahmStefan HahnThomas HainEva HajicovaDilek Hakkani-TurPierre HalleSimon HammondKyu HanCemal HanilciAmir Hossein Harati NejadTorbatiPhilip HardingJonathan HarringtonWilliam HartmannMadina HasanTaufiq HasanMark Hasegawa-JohnsonKei HashimotoVille HautamakiT. J. HazenXiaodong HeLei HeMartin HeckmannRajesh HegdeGeorg HeigoldPaul HeisterkampHartmut HelmkeMatthew HendersonJohn HendersonRichard HendriksNathalie HenrichGustav Eje HenterCaroline HentonChristian HerbstChristian HerffHynek HermanskyInma HernaezLuis Hernandez-GomezGabriel Hernandez-SierraJavier HernandoIngo HertrichWolfgang HessDirk HeylenJames HieronymusRyuichiro HigashinakaIvan HimawanRebecca HincksFlorian HintzYusuke HiokaHans-Guenter Hirsch

Ruediger HoffmannBjorn HoffmeisterVolker HohmannWendy HolmesKiyoshi HondaQingyang HongPierre-Edouard HonnetRon HooryTakaaki HoriChiori HoriMerle HorneJulian HoughDavid HouseIan HowardQiong HuRongqing HuangChien-Lin HuangPo-Sen HuangQiang HuangDongyan HuangRainer HuberMark HuckvaleThomas HueberDavid Huggins DainesMartina HuhtamäkiMelvyn HuntLluís-F. HurtadoAhmed Hussen AbdelazizBrian HutchinsonHarald HögeFlorian HönigOsamu IchikawaYusuke IjimaIrina IllinaSatoshi ImaizumiDavid ImsengElias IosifToshio IrinoToshiko Isei-JaakkolaCarlos IshiTakeshi IshiharaMasato IshizakiKhalil IskarousKen-ichi IsoAkinori ItoKoji IwanoBassam JabaianAdam JaninDavid JaniszekStefanie JannedyAren JansenPeter JaxMilan JelinekJesper JensenAlexandra JesseMichael JessenLuis M.T. JesusHui JiangMinho JinQin JinCheolwoo Jo

32

Michael JohnsonMichael JohnstonEmma JokinenOliver JokischCaroline JonesSzu-Chen Stan JouDenis JouvetTim JuergensPreethi JyothiRainer JäckelTokihiko KaburagiAbdellah KachaZdravko KacicJuliette KahnAlexander KainMarina KalashnikovaKaustubh KalgaonkarOzlem KalinliYutaka KamamotoHirokazu KameokaHerman KamperAhilan KanagasundaramNaoyuki KandaJohn KaneHong-Goo KangStephan KanthakArthur KantorVsevolod KapatsinskiMartin KarafiatFredrik KarlssonAlexey KarpovJames KatesHiroaki KatoAthanasios KatsamanisHideki KawaharaTatsuya KawaharaShinichi KawamotoHeysem KayaPatricia KeatingSimon KeizerFinnian KellyHeather KemberCasey KenningtonJoseph KeshetElie KhouryByeongchang KimHoi Rin KimHyung Soon KimJeesun KimHong Kook KimKee-Ho KimSamuel KimSeokhwan KimJangwon KimWooil KimNam Soo KimSanghun KimDoYeong KimJong-mi KimOwen KimballSimon King

Brian KingsburyTomi KinnunenKeisuke KinoshitaIrina KipyatkovaTatsuya KitamuraNorihide KitaokaEsther KlabbersDietrich KlakowFelicitas KleberThomas KleinbauerStefan KleinerKatarzyna KlessaHanseok KoTakao KobayashiAlexei KochetovMarcel KockmannSri Rama Murty KodukulaLaura KoenigTina KohlerDaniel KohlsdorfJachym KolarThomas KollarDorothea KolossaKazunori KomataniMariko KondoMyoung-Wan KooSunil Kumar KopparapuJacques KoremanTomoki KoriyamaTakafumi KoshinakaMaria KoutsogiannakiJarek KrajewskiIvan KraljevskiJody KreimanJelena KrivokapicChristian KroosGernot KubinOleg KudashevKshitiz KumarJimmy KunzmannGrace KuoGakuto KurataMikko KurimoChul Hong KwonOh-Wook KwonPietro LafaceCatherine LaiUnto K. LainePierre LanchantinIan LaneItshak LapidotYves LaprieAnthony LarcherRomain LarocheLars Bo LarsenMartha LarsonStaffan LarssonEva LasarcykLukas LataczJavier LatorreGalina Lavrentyeva

Aaron LawsonPhu LeSébastien Le MaguerJonathan Le RouxJeremie LecomteGwénolé LecorvéBenjamin LecouteuxChi-Chun LeeTan LeeSungjin LeeSungbok LeeLin-shan LeeHung-yi LeeKong Aik LeeChin-Hui LeeJaewon LeeRoch LefebvreFabrice LefevreMilan LegátYun LeiArne LeijonIolanda LeiteCheung-Chi LeungGary LeungMichael LevitRivka LevitanGina-Anne LevowNatalie LewandowskiAijun LiHaizhou LiMing LiBo LiJunfeng LiNing LiJinyu LiQi LiHui LiangHank LiaoRobin LickleyJean-Sylvain LienardCarlos LimaAmaro LimaJen-Chun LinGeorges LinaresJonas LindhAnders LindströmZhen-Hua LingPärtel LippusPierre LisonJia LiuGang LiuChaojun LiuWenju LiuPengfei LiuXunying LiuYi LiuEduardo Lleida SolanoJoaquim LlisterriDeborah LoakesAlexander LodermeyerAnders Lofqvist

33

Anette LohmanderDamien LoliveYanhua LongJosé LopesRamon Lopez-CozarPaula Lopez-OteroTeresa Lopez-SotoAnastassia LoukinaLiang LuXugang LuHeng LuJorge LuceroSteven LulichKristina Lundholm ForsSusann LuperfoyJordi LuqueAthanasios LykartsisJeff MaChangxue MaBin MaNing MaRoland MaasEwen MacDonaldJavier Macias-GuarasaIan MaddiesonSrikanth MadikeriKikuo MaekawaMathew Magimai DossShakuntala MahantaRanniery MaiaAndreas MaierManwai MakBrian MakFabrice MalfrèreZofia MaliszSri Harish MallidiLidia ManguKazunori ManoGautam MantenaKrzysztof MarasekErik MarchiStefania MarinEllen MarklundDavid MarksRainer MartinDavid Martínez GonzálezCarlos-D. Martínez-HinarejosDavid Martins de MatosRicard MarxerSameer MaskeyDom MassaroHinako MasudaTakashi MasukoAna Isabel MataMarco MatassoniPavel MatejkaSpyros MatsoukasTomoko MatsuiYuri MatveevLudo MaxAnita McAllister

Alan McCreeErik McDermottMitchell McLarenJames McQueenMichael McTearRaveesh MeenaDaryush MehtaSylvain MeignierAlexsandro MeirelesLucie MénardNorma Mendoza-DentonThomas MerrittAngeliki MetallinouMarie MeteerChristine MeunierFanny MeunierBernd T. MeyerYohann MeynadierLei MiaoMichael S. VitevitchMichel HoenAntonio MiguelBen MilnerYasuhiro MinamiNobuaki MinematsuMajid MirbagheriTaniya MishraAnanya MisraVikramjit MitraTakemi MochidaPeggy MokParham MokhtariHelena MonizSeung-Jae MoonElliot MooreTine MooshammerSylvia MoosmuellerNicolas MoralesJuan A. Morales-CordovillaMohamed MorchidAsuncion MorenoMasanori MoriseTakehiro MoriyaAlessandro MoschittiPetr MotlicekAthanasios MouchtarisAmr MousaEmily Mower ProvostPejman MowlaeeHannah MuckenhirnKaren MulakLudek MullerKevin MunhallBenjamin MunsonHema MurthyMarkus MüllerSara MyrbergSebastian MöllerNarendra N PCliment NadeuYasuko Nagano-Madsen

Venki NageshaTofigh NaghibiDevang NaikMaryam NajafianKazuhiro NakadaiSeiichi NakagawaAtsushi NakamuraSatoshi NakamuraTomohiro NakataniArun NarayananEva NavasMatteo NegriGéza NémethFriedrich NeubarthAndrew NevinsHermann NeyRaymond W. M. NgPatrick NguyenNoel NguyenTrung Hieu NguyenChongjia NiAilbhe Ní ChasaideMauro NicolaoOliver NiebuhrJan NiehuesKuniko NielsenThomas NieslerAnton RagniMattias NilssonKristina Nilsson BjörkenstamMasafumi NishidaMasayuki NishiguchiMasafumi NishimuraTakanobu NishiuraNobuyuki NishizawaSven NordholmTakashi NoseJan NouzaMirek NovakDavid NovickSergey NovoselovMarkus Nussbaum-ThomNicolas ObinYasunari ObuchiCatharine OertelAtsunori OgawaTetsuji OgawaYoon Mi OhYamato OhtaniHiroshi G. OkunoMohamed OmarIlya OparinRoeland OrdelmanJuan Rafael Orozco-ArroyaveAlfonso OrtegaMari OstendorfNele OtsSlim OuniKeiichiro OuraMukund PadmanabhanVincent Pagel

34

Serguei PakhomovYue PanYi-Cheng PanHo-hsien PanSankaran PanchapagesanAshish PandaPrem C. PandeyAasish PappuEmilia Parada-CabaleiroJosé PardoNaveen PariharAlok ParlikarPatrick ParoubekSHK (Hari) ParthasarathiSarangarajan ParthasarathyTanvina PatelHemant PatilKailash PatilMatthias PaulikSteffen PauwsVijayaditya PeddintiAntonio M. PeinadoCatherine PelachaudCarmen Peláez-MorenoJason PelecanosThomas PellegriniBryan PellomXavier PelorsonSharon PeperkampFernando PerdigãoJavier PerezRubén Pérez RamónJosé L. Pérez-CórdobaFranz PernkopfPascal PerrierOlivier PerrotinSandra PetersCaterina PetroneDijana PetrovskaMichael PichenyRoberto PieracciniOlivier PietquinJulien PinquierJohn PitrelliFerran PlaChristian PlahlOldrich PlchotAndrew PlummerJouni PohjalainenJoseph PolifroniLinda PolkaJosé PortêloFrançois PortetAlexandros PotamianosGerasimos PotamianosBlaise PotardMarianne PouplierDan PoveyRohit PrabhavalkarS R Mahadeva PrasannaKristin Precoda

Patti PriceRyan PriceMichael ProctorFelix PutzeManfred PützerYanmin QianYao QianThomas QuatieriCarl QuillenGanna RaboshchukTuomo RaitioPadmanabhan RajanNitendra RajputBhuvana RamabhadranVikram RamanarayananV RamasubramanianDaniel RamosVivek Kumar RangarajanSridharWei RaoK Sreenivasa RaoKanishka RaoRamya RasipuramAriya RastrowAnabela RatoAndreia RauberSuman RavuriChristian RaymondManny RaynerMelissa A. RedfordMario ReficeUwe ReichelPatrick ReidyEva ReinischNorbert ReithingerSteve RenalsSteven RennieFernando Gil Vianna ResendeJuniorDouglas ReynoldsDayana Ribas GonzalezRicardo RibeiroCarlos RibeiroGiuseppe RiccardiGaël RichardFred RichardsonKorin RichmondKorbinian RiedhammerVerena RieserLuca RigazioMichael RileyFabien RingevalChristian RitzTony RobinsonAmelie Rochet-CapellanEduardo Rodriguez BangaLuis Javier Rodriguez-FuentesAxel RoebelMikael RollRichard RoseOlivier Rosec

Andrew RosenbergSolange RossatoSophie RossetAntti-Veikko RostiJean-Luc RouasMickael RouvierAlexander RudnickyFrank RudziczVesa RuoppilaMartin RussellDavid RybachAnssi RämöOkko RäsänenSeyed Omid SadjadiRahim SaeidiSaeid SafaviYoshinori SagisakaLakshmi SaheerMd SahidullahTara SainathDaisuke SaitoSakriani SaktiGláucia Laís SalomãoElliot SaltzmanStan SalvadorK SamudravijayaRubén San Segundo HernándezJon SanchezVictoria SanchezJoan Andreu SánchezEmilio SanchisGermán Sanchis-TrillesBonny SandsAbhijeet SangwanAnanth SankarJoao Felipe SantosGeorge SaonMurat SaraclarIbon SaratxagaRuhi SarikayaAchintya sarkarPriyankoo SarmahMilton Sarria-PajaAkira SasouAntonio Satue VilarChristophe SavariauxMichelina SavinoAnn SawyerOscar SazThomas SchaafDietmar SchabusFelix SchaefflerEllika SchallingNicolas SchefferStefan SchererDavid SchlangenRalf SchlüterSven SchmeierJean SchoentgenJuergen SchroeterIris-Corinna Schwarz

35

Antje SchweitzerJim ScobbieJilt SebastianChandra Sekhar SeelamantulaEncarna SegarraJose SeguraFrank SeideEthan SelfridgeGregory SellMichael SeltzerChristine SenacGregory SenayCheol Jae SeongAntónio SerralheiroGuruprasad SeshadriVidhyasaharan SethuAbhinav SethyTuraj ShabestaryIzhak ShafranMatt ShannonNeeraj SharmaStefanie Shattuck-HufnagelSlava ShechtmanRavi Ranganath ShenoySven ShepstoneEthan Sherr-ZiarkoYoshinori ShigaTetsuya ShimamuraJiyoung ShinKoichi ShinodaTakahiro ShinozakiSayaka ShiotaSuwon ShonRyan ShostedIngo SiegertRosario SignorelloVered Silber-VarodJan SilovskyMichel SimardJuraj ŠimkoFlavio SimoesKonstantin SimonchikAdrian SimpsonElliot SingerRohit SinhaSabato Marco SiniscalchiOlivier SiohanSunayana SitaramMan-hung SiuSunil SivadasMatthias SjerpsJan SkoglundMalcolm SlaneyRaymond SlyhRudolph SockAlex SokolovMaria Josep SoléRubén Solera-UreñaHagen SoltauMitchell SommersFrank Soong

Victor SorokinRichard SproatKaavya SriskandarajaJacek StachurskiThemos StafylakisIan StavnessMark SteedmanStefan SteidlIngmar SteinerGeorg StemmerEvgeny StepanovRichard SternAndreas StolckeSimon StoneSvetlana StoyanchevStephanie StrasselHelmer StrikJane Stuart-SmithSebastian StükerYannis StylianouYi SuAswin ShanmugamSubramanianDavid Suendermann-OeftJun-Won SuhAlexandre SuireRafid SukkarXie SunMing SunQinghua SunHarshavardhan SundarShiva SundaramLoredana Sundberg CerratoMasayuki SuzukiPiergiorgio SvaizerMarc SwertsPawel SwietojanskiAnn SyrdalMichael Syskind PedersenEva SzekelyIgor SzokeMarija TabainMartha Yifiru TachbelieYuuki TachiokaSatoshi TakahashiToru TakahashiShinji TakakiTetsuya TakiguchiYik-Cheung TamFabio TamburiniTien Ping TanZheng-Hua TanKazuyo TanakaHiroki TanakaYun TangKevin TangJianhua TaoErin TavanoArvi TavastNaohiro TawaraAntónio Teixeira

Carlos TeixeiraJoão Paulo TeixeiraDominic TelaarLouis ten BoschJoseph TeppermanPr. Sten TernströmFabio TesserVeena ThenkanidiyoorBarry-John TheobaldSamuel ThomasWilliam ThorpeJill ThorsonMark TiedeMichael TjalveTomoki TodaMassimiliano TodiscoRoberto TogneriShinichi TokumaKanako TomaruLaura TomokiyoRong TongPedro Torres-CarrasquilloAttila Máté TóthLászló TóthAsterios ToutiosTien Dung TranDat TranJan TrmalJuergen TrouvainStavros TsakalidisYu TsaoShu-Chuan TsengChiu-yu TsengAndreas TsiartasKimiko TsukadaAlba TuninettiGokhan TurOytun TurkMichael TylerZoltán TüskeChristina TånnanderStefan UltesMasashi UnokiMaria UtherMartti VainioClaudio VairCassia Valentini-BotinhaoFrancisco J Valverde-AlbaceteDirk Van CompernolleRogier van DalenHenk van den HeuvelLaurens van der WerffWim van DommelenJan van DoornKristin Van EngenHugo Van hammeCharl van HeerdenAnnemie Van HirtumDavid van LeeuwenDaniel Van NiekerkJan van Santen

36

Rob van SonAmparo VaronaAdriana VasilacheIoana VasilescuNanette VeilleuxDimitra VergyriPieter VermeulenLyan VerwimpKlara VicsiMarina VigarioCoriandre VilainJesus VillalbaFernando VillavicencioEmmanuel VincentRavichander VipperlaTuomas VirtanenCarlos Vivaracho-PascualBogdan VlasenkoCarl VogelAdam VogelJan-Niklas Voigt-AntonsStephen VoranNgoc Thang VuAnil Kumar VuppalaMichael WagnerAnita WagnerAgnieszka WagnerPetra WagnerAlexander WaibelMichael WalshPatrick WambacqMichael WandLijuan WangHsin-Min WangHsiao-Chuan WangWilliam Yang WangYongqiang WangDongmei WangNigel WardShinji WatanabeCatherine WatsonOliver Watts

James WaymanJianguo WeiEugene WeinsteinMelanie WeirichBenjamin WeissChristian WellekensStanley WenndtStefan WernerAllison WetterlinRobert WhitmanDaniel WillettJason D WilliamsMats WirénMarcin WłodarczakWolfgang WokurekMatthias WolffKrzysztof WolkMaria WoltersPhil WoodlandChuck WootersJohan WoutersChung-Hsien WuZhizheng WuZhiyong WuChai WutiwiwatchaiRui XiaXiong XiaoShaofei XueJunichi YamagishiYoichi YamashitaYonghong YanUmit YapanelMahsa YarmohammadiKeiichi YasuGuoli YeBayya YegnanarayanaChing-Feng YehHani Camille YehiaSerdar YildirimEmre YilmazNestor Becerra YomaKiyoko Yoneyama

Chang YooDongsuk YookSu-Youn YoonKoichiro YoshinoChanghuai YouChengzhu YuKai YuDong YuJiahong YuanYoung-Sun YunFrançois YvonStephen ZahorianMilos ZeleznyIlija ZeljkovicMargaret ZellersHeiga ZenElisabeth ZetterholmAndrej ZgankWei ZhangYu Zhangpengyuan zhangJinsong ZhangZixing ZhangChunlei ZhangZhengchen ZhangChi (Leo) ZhangRui ZhaoYunxin ZhaoThomas Fang ZhengXinhui ZhouXiaodan ZhuXiaodan ZhuangAli ZiaeiFrank ZimmererImed ZitouniUdo ZoelzerHatice ZoraYuexian ZouEnrico ZovatoMarzena ZygisRobert Östling

37

Future INTERSPEECH Conferences

INTERSPEECH 2019

GRAZ – AUSTRIA SEPTEMBER 15th – 19th 2019

WWW.INTERSPEECH2019.ORG

»CROSSROADS OF SPEECH AND LANGUAGE«

GENERAL CHAIRS: Gernot Kubin (Graz), Zdravko Kacic (Maribor) TECHNICAL CHAIRS: Thomas Hain (Sheffield),

Björn Schuller (Passau/London)

© G

raz

Tour

ism

us -

Harr

y Sc

hiffe

r

38

39

Satellite Workshops and Events

Young researchers roundtable on spoken dialogue systems (YRRSDS 2017)

www.yrrsds.org

Saarland University, Saarbrücken, Germany13–14 August 2017

Organizers:Adriana Camacho, University of Texas at El Paso, USAIona Gessinger, Saarland University, GermanyIvan Gris, University of Texas at El Paso, USAJosé David Lopes, KTH Royal Institute of Technology, SwedenRamesh Manuvinakurike, University of Southern California, USAMaike Paetzel, Uppsala University, SwedenEran Raveh, Saarland University, GermanyZahra Razavi, University of Rochester, USAMaria Schmidt, KIT Karlsruhe Institute of Technology, GermanyTiancheng Zhao, Carnegie Mellon University, USA

18th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL 2017)

http://www.sigdial.org/workshops/conference18/

Saarland University, Saarbrücken, Germany15–17 August 2017

Organizers:Kristiina Jokinen, University of Helsinki, Finland and University of Tartu, EstoniaManfred Stede, University of Potsdam, GermanyDavid DeVault, University of Southern California, USAAnnie Louis, University of Essex, UKIvana Kruijff-Korbayova, University of Saarland and DFKI, GermanVolha Petukhova, University of Saarland, GermanyPierre Lison, Norwegian Computing Center, NorwayEthan Selfridge, Interactions Corporation, USAAmanda Stent, Bloomberg LP, USAJason Williams, Microsoft Research, USA

Disfluency in Spontaneous Speech (DiSS 2017)

http://diss2017.org/

KTH Royal Institute of Technology, Stockholm, Sweden18–19 August 2017

Organizers:Robert Eklund, Linköping University, SwedenRobin Lickley, Queen Margaret University, UK

40

HSCR 2017: Second International Workshop on the Historyof Speech Communication Research

https://hscr2017.org/

University of Helsinki, Finland18–19 August 2017

Organizers:Martti Vainio, University of Helsinki, FinlandReijo Aulanko, University of Helsinki, FinlandJuraj Šimko, University of Helsinki, FinlandMona Lehtinen, University of Helsinki, Finland

1st International Workshop on Challenges in HearingAssistive Technology (CHAT-2017)

http://spandh.dcs.shef.ac.uk/chat2017

Stockholm University, Sweden19 August 2017

Organizers:Jon Barker, University of Sheffeld, UKJohn Culling, University of Cardiff, UKJohn Hansen, University of Texas, USAAmir Hussain, University of Stirling, UKPeter Nordqvist, KTH Royal Institute of Technology, Sweden

The 3rd International Workshop on Affective SocialMultimedia Computing (ASMMC 2017)

http://www.nwpu-aslp.org/asmmc2017

Karolinska Institutet, Stockholm, Sweden25 August 2017

Organizers:Dong-Yan Huang, Institute for Infocomm Research, SingaporeBjörn Schuller, University of Passau, GermanyJianhua Tao, Chinese Academy of Sciences, ChinaLei Xie, Northwestern Polytechnical University, ChinaJie Yang, National Science Foundation, USASven Bölte, Karolinska Institutet, Stockholm, SwedenDongmei Jiang, Northwestern Polytechnical University, ChinaHaizhou Li, National University of Singapore, Singapore

41

Grounding Language Understanding (GLU2017)

http://www.speech.kth.se/glu2017

KTH Royal Institute of Technology, Stockholm, Sweden25 August 2017

Organizers:Giampiero Salvi, KTH Royal Institute of Technology, SwedenJean Rouat, Université de Sherbrooke, Canadawith the support of the CHIST-ERA IGLU consortium

AVSP 2017: International Conference on Auditory-Visual Speech Processing

http://avsp2017.loria.fr/

KTH Royal Institute of Technology, Stockholm, Sweden25–26 August 2017

Organizers:Christopher Davis, University of Western Sydney, AustraliaJonas Beskow, KTH Royal Institute of Technology, SwedenSlim Ouni , University of Lorraine, FranceAlexandra Jesse, University of Massachusetts, USA

SLaTE 2017: The seventh ISCA workshop on Speech and LanguageTechnology in Education

http://www.slate2017.org

Stockholm archipelago, Sweden25–26 August 2017

Organizers:Olov Engwall, KTH Royal Institute of Technology, SwedenIolanda Leite, KTH Royal Institute of Technology, SwedenHelmer Strik, Radboud University Nijmegen, the Netherlands

42

Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR-2017)

http://vihar-2017.vihar.org/

University of Skövde, Sweden25–26 August 2017

Organizers:Robert Eklund, Linköping University, SwedenAngela Dassow, Carthage College, Kenosha, USARicard Marxer, University of Sheffield, UKRoger K. Moore, University of Sheffield, UKBhiksha Raj, Carnegie Mellon University, USARita Singh, Carnegie Mellon University, USASerge Thill, University of Skövde, SwedenBenjamin Weiss, Technical University of Berlin, Germany

GESPIN 2017: 5th Gesture and Speech in Interaction conference

http://www.gespin.amu.edu.pl/

Adam Mickiewicz University, Poznan, Poland25–27 August 2017

Organizers:Maciej Karpinski, Adam Mickiewicz University, PolandMałgorzata Fabiszak, Adam Mickiewicz University, PolandEwa Jarmołowicz-Nowikow, Adam Mickiewicz University, PolandAnna Jelec, Adam Mickiewicz University, PolandKonrad Juszczyk, Adam Mickiewicz University, PolandKatarzyna Klessa, Adam Mickiewicz University, Poland

20th International Conference on Text, Speech and Dialogue (TSD 2017)

http://www.kiv.zcu.cz/tsd2017/

Charles University, Prague, Czech Republic27–31 August 2017

Organizers:Václav Matoušek, University of West Bohemia, Czech RepublicKamil Ekštein, University of West Bohemia, Czech RepublicMiloslav Konopík, University of West Bohemia, Czech RepublicRoman Moucek, University of West Bohemia, Czech RepublicTomáš Hercig, University of West Bohemia, Czech RepublicEva Hajicová, Charles University in Prague, Czech RepublicMarkéta Lopatková, Charles University in Prague, Czech RepublicAnna Kotešovcová, Charles University in Prague, Czech Republic

43

Tutorials

The INTERSPEECH 2017 Tutorial Committee chaired by Gabriel Skanze and Björn Grandströmis pleased to announce the following nine tutorials at the conference. They will be offered onSunday, 20 August 2017. All tutorials will be of 3.5 hours duration.

Deep Learning for Dialogue Systems

Sunday, 20 August, 09:00–12:30, B5

Organizers:Yun-Nung Chen, National Taiwan University, TaiwanAsli Celikyilmazy, Microsoft Research, USADilek Hakkani-Tur, Google Research, USA

In the past decade, goal-oriented spoken dialogue systems (SDS) have been the most promi-nent component in today’s virtual personal assistants (VPAs). Among these VPAs, Microsoft’sCortana, Apple’s Siri, Amazon Alexa, Google Home, and Facebook’s M, have incorporatedSDS modules in various devices, which allow users to speak naturally in order to finish tasksmore efficiently. The traditional conversational systems have rather complex and/or modularpipelines. The advance of deep learning technologies has recently risen the applications ofneural models to dialogue modeling. Nevertheless, applying deep learning technologies forbuilding robust and scalable dialogue systems is still a challenging task and an open researcharea as it requires deeper understanding of the classic pipelines as well as detailed knowledgeon the benchmark of the models of the prior work and the recent state-of-the-art work. Thus,this tutorial is designed to focus on an overview of the dialogue system development whiledescribing most recent research for building dialogue systems, and summarizing the challenges.The goal of this tutorial is to provide the audience with developing trend of the dialogue systems,and a roadmap to get them started with the related work.

Insights from Qualitative Research:An Introduction to the Phonetics of Talk-in-interaction

Sunday, 20 August, 09:00–12:30, C307

Organizers:Richard Ogden, University of York, UKJan Gorisch, Institute for the German Language (IDS), GermanyGareth Walker, University of Sheffield, UKMeg Zellers, University of Stuttgart, Germany

This tutorial will provide an overview of the methods and findings of Conversation Analysis(CA) through hands-on analysis of conversational data, exploring how qualitative analysis caninform quantitative analyses of speech. Analysis will focus on how speakers in conversation usethe phonetic shape of their talk to provide recognisable places for others to take turns and whichfeatures are recognised as providing such opportunities. The tutorial will be led by expertsworking at the interface of CA and phonetics.

44

Creating Speech Databases of Less-Resourced Languages:A CLARIN Hands-On Tutorial

Sunday, 20 August, 09:00–12:30, C397

Organizers:Christoph Draxler, Ludwig Maximilian University Munich, GermanyFlorian Schiel, Ludwig Maximilian University Munich, GermanyThomas Kisler, Ludwig Maximilian University Munich, Germany

The creation of speech databases for spoken language research and development, especiallyfor less-resourced languages, is a time-consuming and largely manual task. In this tutorialwe present a workflow comprising the specification, recording, transcription, segmentationand the publication of spoken language. In the tutorial, we will demonstrate how to use a)semi-automatic tools and b) crowdsourcing wherever possible to speed up the process. We willconclude by showing how such speech databases may be employed to adapt existing tools andservices to new languages, thus facilitating access to these languages.

Statistical Parametric Speech Processing:Solving Problems with the Model-based Approach

Sunday, 20 August, 09:00–12:30, B4

Organizers:Mads Græsbøll Christensen, Aalborg University, DenmarkJesper Rindom Jensen, Aalborg University, DenmarkJesper Kjær Nielsen, Aalborg University, Denmark

Parametric speech models have been around for many years but have always had their detractors.Two common arguments against such models are that it is too difficult to find their parametersand that the models do not take the complicated nature of real signals into account. In recentyears, significant advances have been made in speech models and robust and computationallyefficient estimation using statistical principles, and it has been demonstrated that, regardless ofany deficiencies in the model, the parametric methods outperform the more commonly used non-parametric methods (e.g., autocorrelation-based methods) for problems like pitch estimation.The application of these principles, however, extend way beyond that problem. In this tutorial,state-of-the-art parametric speech models and statistical estimators for finding their parameterswill be presented and their pros and cons discussed. The merits of the statistical, parametricapproach to speech modeling will be demonstrated via a number of well-known problems inspeech, audio and acoustic signal processing. Examples of such problems are pitch estimation fornon-stationary speech, distortion-less speech enhancement, noise statistics estimation, speechsegmentation, multi-channel modeling, and model-based localization and beamforming withmicrophone arrays.

45

Real-world Ambulatory Monitoring of Vocal Behavior

Sunday, 20 August, 09:00–12:30, C6

Organizers:Daryush D. Mehta, Massachusetts General Hospital, USA

Many of us often take verbal communication for granted. Individuals suffering from voice disor-ders experience significant communication disabilities with far-reaching social, professional, andpersonal consequences. This tutorial provides an overview of long-term, ambulatory monitoringof daily voice use and in-depth discussions of interdisciplinary research spanning biomedicaltechnology, signal processing, machine learning, and clinical voice assessment. Innovations inmobile and wearable sensor technologies continue to aid in the quantification of vocal behaviorthat can be used to provide real-time monitoring and biofeedback to facilitate the prevention,diagnosis, and treatment of behaviorally based voice disorders.

Deep Learning for Text-to-Speech Synthesis, using the Merlin toolkit

Sunday, 20 August, 13:30–17:00, B5

Organizers:Simon King, University of Edinburgh, UKOliver Watts, University of Edinburgh, UKSrikanth Ronanki, University of Edinburgh, UKZhizheng Wu, Apple Inc., USAFelipe Espic, University of Edinburgh, UK

This tutorial will combine the theory and practical application of Deep Neural Networks (DNNs)for Text-to-Speech (TTS). It will illustrate how DNNs are rapidly advancing the performance ofall areas of TTS, including waveform generation and text processing, using a variety of modelarchitectures. We will link the theory to implementation with the Open Source Merlin toolkit.

Computational Modeling of Language Acquisition

Sunday, 20 August, 13:30–17:00, B4

Organizers:Naomi Feldman, University of Maryland, USAEmmanuel Dupoux, Ecole des Hautes Etudes en Sciences Sociales, FranceOkko Räsänen, Aalto University, Finland

Children learn their native language simply by interacting with their environment. Computa-tional modeling of language acquisition aims to understand the information processing princi-ples underlying the human capability to learn spoken languages without formal instruction. Inaddition to its basic scientific value, understanding of human language acquisition may aid inthe development of more advanced spoken language capabilities for machines. The goal of thistutorial is to introduce participants to the basics of computational cognitive modeling, especiallyin the context of learning linguistic structures from real acoustic speech without labeled trainingdata, and to provide an overview of the ongoing state-of-the-art research in the area.

46

Latest Advances in Computational Speech and Audio Analysis:Big Data, Deep Learning, and Whatnots

Sunday, 20 August, 13:30–17:00, C6

Organizers:Björn W. Schuller, Univeristy of Passau, GermanyNicholas Cummins, Univeristy of Passau, Germany

Conventional speech-based recognition and classification systems learn from information cap-tured in hand-engineered features. These features have been purposely designed and metic-ulously refined over decades to capture certain aspects of either speech production, acousticproperties or phonetic information inherent in speech. However, the feature representationparadigm is currently changing: the advent of newer learning paradigms such as deep neu-ral networks and marked increases in computer power have resulted in a shift away fromhand-crafted feature representations – they can now be determined by the system itself duringthe learning process, albeit often at the requirement of large(r) amounts of data. At the sametime, speech and audio analysis is becoming broader and increasingly holistic, targeting theextraction of a broad range of aspects inherent in the signal simultaneously. In this regard, thistutorial will cover the most important aspects related to the latest advances around “big data”and “deep learning” to name but the two major aspects in recent computational speech andaudio analysis; from new feature representation paradigms through to tools needed to collectthe big data needed to fully harness and realise their potential. Besides covering these topicson a theoretical level, this tutorial will feature hands-on experience in which participants willreceive training to use relevant state-of-the-art toolkits. These include scripts for end-2-endlearning, openSMILE and openXBOW for feature representations, CURRENNT and others fordeep learning, and openCoSy and iHEARu-PLAY for rapid learning data acquisition by efficientsocial media mining and its annotation by gamified dynamic cooperative crowd-learning.

Modelling Situated Multi-modal Interaction with the Furhat Robot Head

Sunday, 20 August, 13:30–17:00, C307

Organizers:Gabriel Skantze, KTH Royal Institute of Technology, SwedenAndré Pereira, Furhat Robotics, Sweden

Spoken face-to-face communication is likely to be the most important means of interactionwith robots in the future. In addition to speech technology, this also requires the use of visualinformation in the form of facial expressions, lip movement and gaze. Human-robot interactionis also naturally situated, which means that the situation in which the interaction takes place is ofimportance. In such settings, there might be several speakers involved (multi-party interaction),and there might be objects in the shared space that can be referred to. Recent years have seen anincreased interest in modeling such communication for human-robot interaction.

In this tutorial, we will start by providing the theoretical background of spoken face-to-faceinteraction and how this applies to human-robot interaction. We will then go through thestate-of-the-art of the different technologies needed and how this kind of interaction can bemodeled. To make the tutorial as concrete as possible, we will use the Furhat social robotplatform in the tutorial, in order to show how different interaction patterns can be implemented,and (depending on the number of participants) give hands-on exercises on how to programhuman-robot interaction for a social robot.

47

ISCA Medalist

Fumitada ItakuraNagoya University, Japan

Monday, 21 August, 9:45–10:15, Aula Magna

Biography

Fumitada Itakura was born in Toyokawa, in Japan, in August 1940. He studied electronicengineering at Nagoya University, 1958–1963. He advanced to its graduate school and studiedinformation engineering with topics such as statistical optical character recognition and timeseries analysis of cardiac rhythmicity. After finishing his master’s degree in 1965, he worked onspeech signal processing using a statistical approach. He received his doctorate in engineeringfrom Nagoya University in 1971 for his work on a statistical method for speech analysis andsynthesis.

Itakura’s early work on speech spectral envelope and formant estimation using the maximumlikelihood methods (1967) laid the groundwork for much of the research in speech signal process-ing in the three subsequent decades, ranging from vocoder designs for low bit-rate transmissionto distance measures (Itakura-Saito distance) for speech pattern recognition. He introduced theconcepts of the auto-regressive model and the partial auto-correlation to the speech area anddeveloped the first mathematically tractable formulation of the speech recognition problembased on the minimum prediction residual principle, providing a solid framework for integrat-ing speech analysis, representation, and pattern matching into a complete engineering system.His work on autoregressive modeling of speech is used in almost every low-to-medium bit ratespeech transmission system. The Line Spectral Pair (LSP) representation, which he developedin the 1975, is now used in nearly every cellular phone system and handset. Itakura and HongWang’s recent work in sub-band dereverberation algorithms has also become the foundation formany new breakthroughs. His singular and yet broad contributions to speech signal processingearned him the IEEE Morris Liebmann Award in 1986, the most prestigious Society Award fromthe IEEE Signal Processing Society in 1996, IEEE Fellow in 2003, the Purple Ribbon Medal fromthe Japanese government in 2003 and the Distinguished Achievement and Contributions Awardfrom IEICE in 2003. These technical achievements were accomplished mainly at Nagoya Univer-sity (1965–1968, 1983–2003), the fourth research section of Musashino Electrical CommunicationLaboratory of NTT (1963–1973, 1975–1983), Acoustic Research Laboratory of Bell TelephoneLaboratories, Murray Hill (1973–1975), and Meijo University (2003–2011).

48

Keynote Speeches

James AllenProfessor of Computer Science, University of RochesterAssociate Director of the Institute for Human and MachineCognition in Pensacola Florida

Tuesday, 22 August, 8:30–9:30, Aula Magna

Dialogue as Collaborative Problem Solving

Abstract

I will describe the current status of a long-term effort at developing dialogue systems that gobeyond simple task execution models to systems that involve collaborative problem solving.Such systems involve open-ended discussion and the tasks cannot be accomplished withoutextensive interaction (e.g., 10 turns or more). The key idea is that dialogue itself arises from anagent’s ability for collaborative problem solving (CPS). In such dialogues, agents may introduce,modify and negotiate goals; propose and discuss the merits possible paths to solutions; explicitlydiscuss progress as the two agents work towards the goals; and evaluate how well a goal wasaccomplished. To complicate matters, user utterances in such settings are much more complexthan seen in simple task execution dialogues and requires full semantic parsing. A key questionwe have been exploring in the past few years is how much of dialogue can be accounted forby domain-independent mechanisms. I will discuss these issues and draw examples from adialogue system we have built that, except for the specialized domain reasoning required ineach case, uses the same architecture to perform three different tasks: collaborative blocks worldplanning, when the system and user build structures and may have differing goals; biocuration,in which a biologist and the system interact in order to build executable causal models ofbiological pathways; and collaborative composition, where the user and system collaborate tocompose simple pieces of music.

Biography

James Allen is the John H Dessauer Professor of Computer Science at the University of Rochester,and Associate Director of the Institute for Human and Machine Cognition in Pensacola Florida,He is a Founding Fellow of the American Association for Artificial Intelligence (AAAI) and aFellow of the Cognitive Science Society. He was editor-in-chief of the journal ComputationalLinguistics from 1983-1993, and authored the well-known textbook “Natural Language Under-standing”. His research concerns defining computational models of intelligent collaborative andconversational agents, with a strong focus on the connection between knowledge, reasoningand language comprehension and dialog.

49

Catherine PelachaudDirector of Research CNRS at ISIR,University of Pierre and Marie Curie

Wednesday, 23 August, 8:30–9:30, Aula Magna

Conversing with social agents that smile and laugh

Abstract

Our aim is to create virtual conversational partners. As such we have developed computationalmodels to enrich virtual characters with socio-emotional capabilities that are communicatedthrough multimodal behaviors. The approach we follow to build interactive and expressiveinteractants relies on theories from human and social sciences as well as data analysis and user-perception-based design. We have explored specific social signals such as smile and laughter,capturing their variation in production but also their different communicative functions andtheir impact in human-agent interaction. Lately we have been interested in modeling agents withsocial attitudes. Our aim is to model how social attitudes color the multimodal behaviors of theagents. We have gathered a corpus of dyads that was annotated along two layers: social attitudesand nonverbal behaviors. By applying sequence mining methods we have extracted behaviorpatterns involved in the change of perception of an attitude. We are particularly interested incapturing the behaviors that correspond to a change of perception of an attitude. In this talk Iwill present the GRETA/VIB platform where our research is implemented.

Biography

Catherine Pelachaud is a Director of Research at CNRS in the laboratory ISIR, University of Pierreand Marie Curie. Her research interests include embodied conversational agents, nonverbalcommunication (face, gaze, and gesture), expressive behaviors and socio-emotional agents.With her research team, she has been developing an interactive virtual agent platform GRETAthat can display emotional and communicative behaviors. She has been involved and is stillinvolved in several European projects related to believable embodied conversational agents,emotion and social behaviors. She is an associate editor of several journals, including IEEETransactions on Affective Computing, ACM Transactions on Interactive Intelligent Systemsand Journal on Multimodal User Interfaces. She has co-edited several books on virtual agentsand emotion-oriented systems. She has participated in the organization of such internationalconferences as IVA, ACII and AAMAS, virtual agent track. She is the recipient of the ACM –SIGAI Autonomous Agents Research Award 2015.

50

Björn LindblomProfessor emeritus University of Stockholm, SwedenProfessor emeritus University of Texas at Austin, USA

Thursday, 24 August, 8:30–9:30, Aula Magna

Re-inventing speech – the biological way

Abstract

The mapping of the Speech Chain has so far been focused on the experimentally more accessiblelinks – e.g., acoustics – whereas the brain’s activity during speaking and listening has under-standably received less attention. That state of affairs is about to change now thanks to the newsophisticated tools offered by brain imaging technology.

At present many key questions concerning human speech processes remain incompletely under-stood despite the significant research efforts of the past half century. As speech research goesneuro, we could do with some better answers.

In this paper I will attempt to shed some light on some of the issues. I will do so by heedingthe advice that Tinbergen1 once gave his fellow biologists on explaining behavior. I paraphrase:Nothing in biology makes sense unless you simultaneously look at it with the following ques-tions at the back of your mind: How did it evolve? How is it acquired? How does it work hereand now?

Applying the Tinbergen strategy to speech I will, in broad strokes, trace a path from the smalland fixed innate repertoires of non-human primates to the open-ended vocal systems thathumans learn today.

Such an agenda will admittedly identify serious gaps in our present knowledge but, importantly,it will also bring an overarching possibility:

It will strongly suggest the feasibility of bypassing the traditional linguistic operational approachto speech units and replacing it by a first-principles account anchored in biology.

I will argue that this is the road-map we need for a more profound understanding of the funda-mental nature spoken language and for educational, medical and technological applications.

Biography

I began by studying for a medical degree but gradually my focus shifted to music and languages.Planning to make a living as a foreign language teacher I attended classes that happened toinclude two lectures on acoustic phonetics by Gunnar Fant at KTH in Stockholm. “Anyoneinterested in s summer job? We could use people with a linguistics background”. He then wenton to describe the project. Although I cannot honestly say that I had understood much of thelectures, I volunteered and got lucky. I was completely blown away by the dynamics of the KTHlab and its research activities. This was the early sixties – the post-World War II era with lavishfunding on communications and computer technology.

Later in life, I came across an anecdote about Richard Feynman, famous physicist who is said tohave left the following formulation permanently on the blackboard of his office: “What I cannotcreate I do not understand!”

1Tinbergen Niko (1963) “On Aims and Methods of Ethology,” Zeitschrift für Tierpsychologie, 20: 410–433.

51

Bingo! Was he referring to the acoustic theory of speech production and copy speech synthesis?In a way, he could have been. More importantly I believe, in this short phrase, he managed tocapture the ultimate essence of good science – general knowledge based on first principles. Ithas been at the back of mind for over fifty years as I have studied how spoken language workson-line, how it is learned and how it came to be.

Applying the Feynman criterion to our own broad field shows that we still have a long way togo. There would be nothing wrong with embarking on that voyage equipped with the tools ofBig Data and modern hi-tech neuroscience – on the contrary. But ultimately the quality of ourapplications – e.g. clinical, educational – will be a function of how well we really understandhow humans do it.

End of sermon. Chop, chop.

52

Special Sessions

Speech Technologies for Code-Switching in Multilingual Communities

Monday, 21 August, 11:00–13:00, F11Monday, 21 August, 14:30–16:30, F11

Organizers:Kalika Bali, Microsoft Research IndiaAlan W Black, Carnegie Mellon UniversityMona Diab, George Washington UniversityJulia Hirschberg, Columbia UniversitySunayana Sitaram, Microsoft Research IndiaThamar Solorio, University of Houston

Speech technologies exist for many high resource languages, and attempts are being made toreach the next billion users by building resources and systems for many more languages. In thepast, the main focus of the speech community has been in building monolingual systems thatare capable of processing speech in a single language. Multilingual communities pose specialchallenges for the design and development of speech processing systems. One of these challengesis code-switching, which is the switching of two or more languages at the conversation, utteranceand sometimes even word level.

In addition to conversational speech, code-switching is now found in text in social media, instantmessaging and blogs in multilingual communities. Monolingual natural language and speechsystems fail when they encounter code-switched speech and text. There is also a lack of linguisticdata and resources for code-switched speech and text, although one or more of the languagesbeing mixed could be high-resource.

Code-switching provides various interesting challenges to the speech community, such aslanguage modeling for mixed languages, acoustic modeling of mixed language speech, pronun-ciation modeling and language identification from speech.

Topics of interest for this special session include: speech recognition of code-switched speech, lan-guage modeling for code-switched speech, speech synthesis of code-switched text, speech trans-lation of code-switched languages, spoken dialogue systems that can handle code-switching,speech data and resources for code-switching as well as language identification from speech.We expect participants from academic and industry spanning a wide variety of language pairsand data sets. We also expect discussions on how to create speech and language resources forcode-switching and sharing of data.

53

The 2nd Automatic Speaker Verification Spoofing and Countermeasures Challenge(ASVspoof 2017)

Monday, 21 August, 11:00–13:00, D8Monday, 21 August, 14:30–16:30, D8

Organizers:Tomi Kinnunen, University of Eastern FinlandM Sahidullah, University of Eastern FinlandHector Delgado, EurecomMassimiliano Todisco, EurecomNicholas Evans EurecomJunichi Yamagishi, National Institute of Informatics and University of EdinburghKong Aik Lee, Institute for Infocomm Research

Most research in automatic speaker verification (ASV) focuses on improving accuracy in the caseof casual impostors. With new applications in user authentication, the security of ASV solutionsis as important as their accuracy. The ASVspoof initiative (http://www.asvspoof.org) aims toimprove the robustness of ASV technology to spoofing, or presentation attacks. The ASVspoof2017 Challenge is the second in the series following the 2015 edition focused on artificial speechattacks. The ASVspoof 2017 Challenge focused on the most prolific form of spoofing attack,replay attacks. It features a new corpus initiated as part of the EU-funded OCTAVE project. Thedata is a replayed version of the original RedDots corpus. Key to the challenge was the study ofgeneralised spoofing detection across unknown environments, speakers and devices. The twoASVspoof 2017 Challenge Special Sessions at INTERSPEECH 2017 comprise an introductory talkby the organisers (30 min), 12 oral presentations given by challenge participants (15 min each)and a concluding discussion forum (30 min).

Speech & Human-Robot Interaction

Tuesday, 22 August, 10:00–12:00, F11

Organizers:Gérard Bailly, Université Grenoble-AlpesGabriel Skantze, KTH Royal Institute of TechnologySamer Al Moubayed, Furhat Robotics

The topic “Speech & Human-Robot Interaction” encompasses many research fields; e.g., thosewhich investigate speech in interaction, notably the characteristics of situated dialogs, turn-taking and accommodation; those interested in the relationship between speech and gesture; andthose who are working to develop platforms for human-robot communication and interaction(e.g., a key topic for sociable humanoid robots), just to name a few. The session brings togetherresearchers from many disciplines to share techniques and investigative methods as well asresearch findings. It provides a forum for researchers to explore the extent to which resultsconcerning human communication are important for enabling social machines. Conversely, itprovides an opportunity for researchers working with machines (e.g., computer vision; machinelearning, robot design, etc.) to showcase developments in their field. The feedback between thetwo communities will be stimulating and rewarding.

54

Incremental Processing and Responsive Behaviour

Tuesday, 22 August, 13:30–15:30, F11

Organizers:Timo Baumann, Universität HamburgTomas Hueber, GIPSA-lab, CNRSDavid Schlangen, Bielefeld University

Incremental processing is the online, and ideally real-time processing of streams of informationabout an ongoing event, as it happens. Particularly for speech, which is an inherently sequentialmedium, is relatively slow and where pauses inbetween contributions matter, incremental pro-cessing is a necessity in interactive situations, where responsivity is key. Incremental processingin humans is a psycholinguistic fact and in systems, it allows to fold processing time betweenmodules, to integrate information across modules early, for more natural behaviour based onpartial understanding (e.g. in interactive situations), or to shape interactions in collaborationbetween the system and interlocutor (e.g. using flexible turn-taking schemes). Incremental-ity is largely orthogonal to the typical session/topic layout at INTERSPEECH and our specialsession unites contributions in the areas of speech synthesis, multimodal coordination of ges-ture and speech, phonetical responsivity in spoken dialog systems, and incremental dialog actrecognition.

Acoustic Manifestations of Social Characteristics (AMSCh)

Tuesday, 22 August, 16:00–18:00, F11

Organizers:Melanie Weirich, Friedrich-Schiller Universität JenaStefanie Jannedy, ZAS Berlin

The objective of the special session is to bring together an interdisciplinary group of researchersand engineers (phoneticians, psychologists and others) working at the interface of speechproduction, perception, attitude and social identity to explore potential questions around thehuman-human and human-machine interface. We aim at exploring and discussing the method-ologies and results of multimodal experiments investigating macro-sociological categories orsocial roles, speech variability and speech interpretation and its impact on modelling potentialhuman-human and human-machine interaction. We hope to increase our knowledge of howthese two complex systems of human sociality and language interact to understand the workingsof these speech production and attribution processes, how and in what way stereotypes areexploited in speech perception, and just what kinds of cues trigger or undo social clichés thatare based on speech.

55

Data Collection, Transcription and Annotation Issues in Child LanguageAcquisition Settings

Wednesday, 23 August, 10:00–12:00, F11

Organizers:Alejandrina Cristia, École Normale SupérieureElika Bergelson, Duke UniversityTove Gerholm, Stockholm UniversityKristina Nilsson Björkenstam, Stockholm UniversityIris-Corinna Schwarz, Stockholm University

With the advent of more exhaustive recording possibilities such as daylong recordings of achild’s language environment, new avenues and challenges arise. Researchers within childlanguage acquisition are suddenly faced with Big Data and need to adapt their analysis methodsto meet the opportunities that come with this wealth of information. The optimal breadth anddepth of annotation in submitted materials, being limited by resources, and the reoccurring callfor automatization of at least parts of the tedious and costly transcription process are some ofthe issues. This special session also incorporates methods of automatized data analysis and laysthe foundation of the special session on computational modeling in child language acquisition.

Computational Models in Child Language Acquisition

Wednesday, 23 August, 13:30–15:30, F11

Organizers:Christina Bergmann, Ecole Normale SupérieureEmmanuel Dupoux, Ecole Normale SupérieureGintare Grigonyté, Stockholm UniversityMats Wirén, Stockholm UniversityEllen Marklund, Stockholm University

Databases containing data of large-scale longitudinal studies from across the world permit anew type of in-data to model child language acquisition, with greater complexity than everbefore. Modeling child language acquisition does no longer need to restrict itself to singlecomponent predictors, as with the advent of large data sets multiple component integrationbecomes possible. For weight modeling of multiple contributors to come closer to the complexityof child language acquisition, the field needs a synthesis at this very point. Computationalscientists need to get closer to dirty real data and child language researchers collecting these dataneed to gain a better understanding of what is required to feed models with optimal in-data. Thisspecial session intends to bring these two lines of research together and elicit fruitful discussionsacross the disciplinary boundaries.

56

Digital Revolution for Under-resourced Languages (DigRev-URL)

Wednesday, 23 August, 10:00–12:00, A2Wednesday, 23 August, 13:30–15:30, Poster 1Wednesday, 23 August, 16:00–18:00, A2

Organizers:Sakriani Sakti, Nara Institute of Science and TechnologyLaurent Besacier, Université Grenoble AlpesOddur Kjartansson, Google ResearchKristiina Jokinen, University of HelsinkiAlexey Karpov, Russian Academy of SciencesCharl van Heerden, TensorAIShyam Agrawal, Kamrah Institute of Information Technology

This special session aims to accelerate the research activities for under-resourced languages,and to provide a forum for linguistic and speech technology researchers, as well as academicand industrial counterparts to share achievements and challenges in all areas related to naturallanguage processing and spoken language processing of under-resourced languages, mainlyused in South, Southeast and West Asia; North and Sub-Africa; North and Eastern Europe.Particularly, as INTERSPEECH 2017 is held in Sweden, we highly encourage any submissionson under-resourced languages from Nordic, Uralic, and Slavic regions. The theme of thisspecial session will focus towards digital revolution for underresourced languages, includingbut not limited to: linguistic and cognitive studies, resources acquisition of text and speechcorpora, zero resource speech technologies, cross-lingual/multi-lingual acoustic and lexicalmodeling, code-switched Lexical modeling, speech-to-text and speech-to-speech translation,speech recognition, text-to-speech synthesis, and dialog system as well as applications of spokenlanguage technologies for under-resourced languages.

Voice Attractiveness

Wednesday, 23 August, 16:00–18:00, F11

Organizers:Melissa Barkat-Defradas, University of MontpellierBenjamin Weiss, TU BerlinJürgen Trouvain, Saarland UniversityJohn Ohala, ICSI

This special session on “voice attractiveness” would be the perfect setting for presenting re-search dealing with: perceived vocal preferences of men, women, and synthesized voices inwelldefined social situations, acoustic correlates of voice attractiveness/pleasantness/charisma,interrelations between vocal features and individual physical and physiological characteristics,consequences for sexual selection, predictive value of voice for personality and for other psy-chological traits, experimental definition of aesthetic standards for the vocal signal, culturalvariation of voice attractiveness/pleasantness and standards, the link between vocal pathologyand vocal characteristics.

57

State of the Art in Physics-based Voice Simulation

Thursday, 24 August, 10:00–12:00, F11

Organizers:Sten Ternström, KTH Royal Institute of TechnologyOriol Guasch, Universitat Ramon Llull

The physics of voice is very intricate, as it involves turbulent flows interacting with elastic solidsthat vibrate, deform and collide, generating acoustic waves which propagate through complex,time varying, contorted ducts. It is very easy to pronounce a simple sound like /a/, but whendoing so we are not aware of the tremendous amount of physical phenomena that occur inour voice organ. On the one hand, in this special session we welcome papers on numericalapproaches to voice production. These include finite element and finite difference methods, aswell as multimodal and waveguide approaches, among others. On the other hand, papers onexperimental mechanical replicas that can elucidate aspects related to voice generation will bealso appreciated. The scope of the session is wide and covers from the flow-driven oscillation ofthe vocal folds to the generation of static sounds like vowels, nasals and fricatives, the productionof dynamic sounds like plosives, diphthongs or syllables or considering expressivity effects thatmay be simulated relying on physical grounds.

INTERSPEECH 2017 Computational Paralinguistics ChallengE (ComParE)

Thursday, 24 August, 10:00–12:00, E10Thursday, 24 August, 13:30–15:30, E10

Organizers:Björn W. Schuller, University of PassauStefan Steidl, Friedrich-Alexander-UniversityAnton Batliner, University of PassauElika Bergelson, Duke UniversityJarek Krajewski, University of WuppertalChristoph Janott, Technische Universität München

The INTERSPEECH 2017 Computational Paralinguistics ChallengE (ComParE) is an open Chal-lenge dealing with states of speakers as manifested in their speech signal’s acoustic prop-erties. There have so far been eight consecutive Challenges at INTERSPEECH since 2009 (cf.http://compare.openaudio.eu/) but there still exists a multiplicity of not yet covered, but highlyrelevant paralinguistic phenomena. Thus, we introduce three new, so far less touched upontasks by the first of its kind Addressee Sub-Challenge directly contributing to INTERSPEECH

2017’s theme Situated Interaction and indirectly contributing by the novel health and well-beingprone Cold Sub-Challenge and Snoring Sub-Challenge. We further revisit sleepiness in theDrowsiness Sub-Challenge. Situated interaction benefits from knowing who is addressed incommunication. Obviously, it will also benefit efficient interaction if an interface is aware ofa user’s drowsiness or suffering from a cold. In addition, the value of the health-related tasksspeaks for itself. The Snoring SubChallenge introduces for the first time also a purely non-speech,yet vocal inspiratory sound. A challenge is usually a great occasion to increase attention in thetasks and unite expertise from different areas to advance the field.

58

Special Events

Show & Tell

Session 1: Monday, 21 August, 11:00–13:00 & 14:30–16:30, E306Session 2: Monday, 21 August, 11:00–13:00 & 14:30–16:30, E397Session 3: Tuesday, 22 August, 10:00–12:00 & 13:30–15:30, E306Session 4: Tuesday, 22 August, 10:00–12:00 & 13:30–15:30, E397Session 5: Wednesday, 23 August, 10:00–12:00 & 13:30–15:30, E306Session 6: Wednesday, 23 August, 10:00–12:00 & 13:30–15:30, E397Session 7: Thursday, 24 August, 10:00–12:00 & 13:30–15:30, E306

Show & Tell is a special event organized during the conference. Participants are given theopportunity to demonstrate their most recent progress or developments, and interact withthe conference attendees in an informal way, such as a poster, a mock-up, a demo, or anyadapted format of their own choice. These contributions must highlight the innovative sideof the concept and may relate to a regular paper. While the emphasis of Show & Tell is on theactual demonstration during the conference, all contributions are allocated two pages in theconference proceedings. Each submission has been peer-reviewed. Reviewers have judged theoriginality, significance, quality, and clarity of the proposed demonstration.

Swedish Kulning (SweKul): What’s so Special about Kulning – the SingingTechnique in Traditional Swedish Cattle Calls?

Monday, 21 August, 14:30–16:30, B3

Organizers:Anita McAllister, Karolinska Institutet, SwedenRobert Eklund, Linköping University, SwedenAnne-Maria Laukkanen, University of Tampere, FinlandAhmed Geneid, Helsinki University Hospital and University of Helsinki, FinlandFanny Pehrson, Kulning singerKajsa Dahlström, Kulning singer

Kulning is a special singing technique traditionally used in parts of Sweden and Norway to callfree gracing cattle or goats to the homestead for milking. The technique has developed to beheard over a large distance and has been reported to carry over 5 to 6 kilometers. The sessionwill focus on different aspects of the singing technique of kulning – which is traditionally taughtby imitation, rather than formal instruction. The session will deal with amplitude/loudness andassociated sound propagation aspects and describe and discuss glottal characteristics and vocaltract configuration. Data from videoendoscopy, electroglottography (EGG), high-speed videoof the larynx and even from MRI studies, which will be performed during winter 2016/2017,will be presented at the workshop. The session will also include live demonstrations by twoexperienced kulning singers and will provide an interactive session where participants will begiven the possibility to try kulning themselves. Weather allowing, the session will begin with anecologically valid demonstration in the forest, a mere ten minute walk from the university.

59

Speaker Comparison for Forensic and Investigative Applications III

Wednesday, 23 August, 13:30–15:30, B3

Organizers:Jean-François Bonastre, Université d’Avignon, FranceJoseph P. Campbell, MIT Lincoln Laboratory, USAAnders Eriksson, Stockholm University, SwedenReva Schwartz, National Institute of Standards and Technology, USA

The aim of this special event is to have several structured discussions on speaker comparisonfor forensic and investigative applications, where many international experts will present theirviews and participate in the free exchange of ideas. In speaker comparison, speech samples arecompared by humans and/or machines for use in investigations or in court to address questionsthat are of interest to the legal system. Speaker comparison is a high-stakes application thatcan change people’s lives and it demands the best that science has to offer; however, methods,processes, and practices vary widely. These variations are not necessarily for the better and,although recognized, are not generally appreciated and acted upon. Methods, processes, andpractices grounded in science are critical for the proper application (and non-application) ofspeaker comparison to a variety of international investigative and forensic applications. Thisevent follows the successful INTERSPEECH 2015 and 2016 special events of the same name.

Speaker Recognition for the Next Decade

Tuesday, 22 August, 13:30–15:30, B3

Organizer:Pedro A. Torres-Carrasquillo, MIT Lincoln Laboratory, USA

Panelists:Douglas Reynolds (Academic)John Hansen (Academic)Kevin Farrell (Industry)UPM Spain (Academic/Government Work)BUT Czech Republic (Academic/Industry)Agnitio (Industry)Forensic applications panelistInstitute for Infocomm Research (Academic/application dev)

The panel will present a discussion of the current state of speaker recognition and focus on whereto go from here. It will provide the speaker recognition community the opportunity to engagewith the broader speech community and discuss current state of affairs along with challengesgoing forward. The panel will benefit the general community by providing a summary of whatis happening and expected to happen in the speaker recognition area and benefit the speakerrecognition community by exposing their ideas and receiving an influx of fresh perspectives.

60

The Second Workshop for Young Female Researchers in SpeechScience & Technology (YFRSW)

https://sites.google.com/site/yfrsw2017/

Sunday, 20 August, 9:00–17:00Speech , Music and HearingKTH Royal Institute of TechnologyLindstedtsvägen 24

Organizers:Heidi Christensen, University of Sheffield, UKAbeer Alwan, University of California, USAKay Berkling, DHBW, GermanyCatia Cucchiarini, Radboud University, NetherlandsMilica Gašic, University of Cambridge, UKJulia Hirschberg, Columbia University, USAKaren Livescu, TTIC, USACatharine Oertel, KTH Royal Institute of Technology, SwedenOdette Scharenborg, Radboud University, NetherlandsIsabel Trancoso, INESC-ID, Portugal

The workshop is the second of its kind, after a successful inaugural event at INTERSPEECH

2016 in San Francisco and is designed to foster interest in research in our field in women at theundergraduate or master level who have not yet committed to getting a PhD in speech science ortechnology areas, but who have had some research experience in their college and universitiesvia individual or group projects. The workshop will include the following events: a welcomebreakfast with introductions; a panel of senior women talking about their own research andexperiences as women in the speech community; a panel of senior students who work in thespeech area to describe how they became interested in speech research; a poster session for thestudents to present their own research; a one-on-one coaching session between students andsenior women mentor; a networking lunch for students and senior women.

61

ISCA-SAC Special Events

3rd Doctoral Consortium

Speech, Music and HearingKTH Royal Institute of TechnologyLindstedtsvägen 24Sunday, 20 August 2017, 10:00-17:00

Following on from the success of the last two years, the Doctoral Consortium at INTERSPEECH

will be run again this year in Stockholm, Sweden. The Doctoral Consortium aims to providestudents working on speech-related topics with an opportunity to discuss their doctoral researchwith experts from their fields, and to receive feedback from experts and peers on their PhDprojects.

The format of the Doctoral Consortium will be a one-day workshop prior to the main conference.Participants will be asked to make short presentations summarizing their doctoral work whichwill then be followed by an intensive discussion with the experts and peers. This event isorganized by the Student Advisory Committee of the International Speech CommunicationAssociation (ISCA-SAC).

Students Meet Experts

Tuesday, 22 August, 16:00–18:00, B3

The Student Advisory Committee of the International Speech Communication Association(ISCA-SAC) is very proud to announce that this year the Student meet Expert Event will be backto INTERSPEECH in an exciting new format. Instead of splitting the students and experts intoseveral groups, there will be first a panel discussion, with and among several experts, from bothacademia and industry. Students will be given the opportunity to hear about the different careerpaths the experts have taken and afterwards will be given the possibility to talk to the expertsdirectly.

Open Doors Event

Furhat RoboticsSpeech, Music and HearingKTH – Royal Institute of TechnologyLindstedtsvägen 24

Thursday, 24 August, 13:00-15:30

Tobii ProKarlsrovägen 2D182 53 Danderyd

Thursday, 24 August, 13:00-15:30

ISCA-SAC is organising a company visit event on this year’s INTERSPEECH. Students will begiven the opportunity to visit the headquarters of Stockholm companies interested in speechcommunication. The companies will demonstrate their technologies and products and letstudents try out different equipment, following an open discussion and networking. This yearthe students will visit Furhat Robotics and Tobii Pro. The event will take place on Thursday,August 24 and aims to bring students and researchers together discussing potential collaborationor even possibly hiring opportunities.

62

Awards

ISCA Medal for Scientific Achievement

The ISCA Medal for Scientific Achievement 2017 will be awarded to Professor Fumitada Itakuraby the President of ISCA during the opening ceremony.

ISCA Best Student Paper Award

Each year, ISCA awards 3 best student papers at INTERSPEECH based on anonymous reviewingand presentation at the conference. This year, 12 papers are shortlisted for best student paper:

937 Elin Larsen, Alejandrina Cristia and Emmanuel DupouxRelating Unsupervised Word Segmentation to Reported Vocabulary AcquisitionWed-SS-7-11: Computational Models in Child Language AcquisitionWednesday, 23 August, 13:30–13:50

839 Katharina Zahner, Heather Kember and Bettina BraunMind the Peak: When Museum Is Temporarily Understood as Musical in AustralianEnglishTue-O-4-6: Prosody (Tone and Intonation)Tuesday, 22 August, 14:30–14:50

1494 Srinivas Parthasarathy and Carlos BussoJointly Predicting Arousal, Valence and Dominance with Multi-Task LearningTue-O-3-10: Emotion RecognitionTuesday, 22 August, 11:00–11:20

950 Arsha Nagrani, Joon Son Chung and Andrew ZissermanVoxCeleb: A Large-scale Speaker Identification DatasetWed-O-8-1: Speaker Database and Anti-spoofingWednesday, 23 August, 17:20–17:40

1160 Janek Ebbers, Jahn Heymann, Lukas Drude, Thomas Glarner, Reinhold Haeb-Umbach andBhiksha RajHidden Markov Model Variational Autoencoder for Acoustic Unit DiscoveryMon-P-1-2: Speech and Audio Segmentation and Classification 2Monday, 21 August, 11:00–13:00

187 Lukas Drude and Reinhold Haeb-UmbachTight Integration of Spatial and Spectral Features for BSS with Deep ClusteringEmbeddingsWed-O-8-6: Multi-channel Speech EnhancementWednesday, 23 August, 16:00–16:20

1410 Rachel Alexander, Tanner Sorensen, Asterios Toutios and Shrikanth NarayananVCV Synthesis using Task Dynamics to Animate a Factor-based Articulatory ModelMon-O-1-10: Multimodal and Articulatory SynthesisMonday, 21 August, 12:00–12:20

1073 Albert Zeyer, Eugen Beck, Ralf Schlüter and Hermann NeyCTC in the Context of Generalized Full-Sum HMM TrainingTue-O-3-1: Neural Network Acoustic Models for ASR ITuesday, 22 August, 10:20–10:40

63

1442 Karel Beneš, Murali Baskar and Lukáš BurgetResidual Memory Networks in Language Modeling: Improving the Reputation ofFeed-Forward NetworksMon-O-2-1: Neural Networks for Language ModelingMonday, 21 August, 16:10–16:30

1710 William Gale and Sarangarajan ParthasarathyExperiments in Character-level Neural Network Models for PunctuationWed-P-6-1: Speech Recognition: Technologies for New Applications and ParadigmsWednesday, 23 August, 10:00–12:00

1568 Zahra Rahimi, Anish Kumar, Diane Litman, Susannah Paletz and Mingzhi YuEntrainment in Multi-Party Spoken Dialogues at Multiple Linguistic LevelsTue-P-4-3: Dialog ModellingTuesday, 22 August, 13:30–15:30

1093 Chunxi Liu, Jan Trmal, Matthew Wiesner, Craig Harman and Sanjeev KhudanpurTopic Identification for Speech without ASRWed-O-7-4: Topic Spotting, Entity Extraction and Semantic AnalysisWednesday, 23 August, 15:10–15:30

Travel Grants

A number of 40 ISCA and 20 INTERSPEECH 2017 travel grants have been selected based on thetechnical quality of the papers. The travel awards recipients are:

Pablo BruscoTing DangWilliam GaleRakib HyderKiranpreet NaraJos Eduardo Novoa IlicYibin ZhengZhang KaileQizheng HuangZhaoqiong HuangChong CaoFeng GuoMa XiDanwei CaiXiao YujiaBo ChenQinyi LuoBin ZhaoXIAO WANGNicanor Garcia-Ospina

Omnia AbdoEdwin SimonnetAdriana Guevara-RukozRahma ChaabouniXiaoyu ShenKatharina ZahnerTorsten WrtweinAnna MorSaurabhchand BhatiKadiri Sudarsana ReddyAlluri K N R K RajuSishir KalitaAkshay Kalkunte SureshBaby ArunMadhu KambleKarthik Girija RamesanYaniv SheenaHidetsugu UchidaTakamichi ShinnosukeJudith Peters

Bo Ru LuYu-Hsuan WangWard LaurenLoweimi ErfanKhyathi ChanduWei LiToshniwal ShubhamShane SettleArindam JatiTanner SorensenAbhishek Avinash NarwekarSaurabh SahuGanesh SivaramanBeiming CaoAnish KumarMandy KorpusikSerim ParkNimisha PatilMorales MichelleTitouan Parcollet

In addition, Jinyu Li (Microsoft) and Florian Metze (CMU) set up the “Yajie Miao Memorial Stu-dent Travel Fund” to remember and to honor the work of Yajie Miao.

Yajie successfully defended his thesis on “Incorporating Context Information into Deep NeuralNetwork Acoustic Models” at Carnegie Mellon University, and was awarded the PhD degreein August 2016. He had accepted a position at Microsoft in Redmond, and was set to startwork there in October 2016. Unfortunately, he died tragically, while visiting his family in China,before he was able to do so.

64

The “Yajie Miao Memorial Student Travel Fund” supports additional ISCA student travel grantsto speech conferences. Depending on the availability of funds, one or more recipients will beselected by ISCA and the organizers. More information about the fund and the background canbe found at https://www.youcaring.com/iscainternationalspeechcommunicationassociation-815026.

This year, the recipients are:

• Wenpeng Li, Northwestern Polytechnical University, China

• Purvi Agrawal, Indian Institute of Science, India

• Junwei Yue, The University of Tokyo, Japan

EURASIP Best Paper Published in Speech Communication (2012–2015)

The best paper has been selected by the Speech Communication editorial board under the coor-dination of Bernd Möbius, Editor-in-Chief and the EUSIPCO Awards Chair, Kostas Berberidis.

The EURASIP best Speech Communication paper award will be presented at EUSIPCO 2017and announced during INTERSPEECH 2017.

ISCA Best Paper Published in Computer Speech and Language (2012–2016)

The best paper has been selected by the Computer Speech and Language editorial board underthe coordination of Roger K. Moore, Editor-in-Chief.

Christian Benoît Award

supported by the International Speech Communication Association,the Association Francophone de la Communication Parlée and GIPSA-lab

The international scientific Committee of the “9th Christian Benoît Award” has completed theevaluation of the applications and has taken a decision. The Winner is

Marcin WłodarczakStockholm University, Department of Linguistics

http://www.su.se/profiles/mwlodfor the research project entitled

“Hidden events in turn-taking”

In his research work Marcin will explore the respiratory signal as a means to infer turn-takingintentions in conversation. More specifically Marcin will look for hidden events that are notdirectly observable in the patterns of speech and silence, but can predict how the subjectwill manage the next turn-taking. In this aim, Marcin will analyze movements resulting frominhalations and exhalations measured by means of the Respiratory Inductance Plethysmograph,and develop a software for identification and visualisation of the hidden events, that will bereleased under a free licence as a project deliverable. The ultimate goal of the research project isto shed light on interactionally salient events overlooked by existing frameworks for studyingconversation, and to contribute to the growing body of research on the manifest events inturn-taking.

On behalf of the Association Christian Benoît and of the scientific committee; we would like tocongratulate Marcin and to thank all the applicants for the projects they submitted.

65

The Award will be officially delivered at the Interspeech 2017 Conference in Stockholm (Sweden).

In 2018, 20 years will have elapsed since Christian Benoît passed way. We think that it is time nowfor the Christian Benoît Award to evolve and possibly gain in visibility throughout the wholespeech communication scientific community. ISCA constitutes a natural framework. Hencewe asked the ISCA Board to consider incorporating the award within the association politics,possibly as a framework for an award directed towards young scientists. We also associated theAssociation Francophone de la Communication Parlée, which always supported the ChristianBenoit Award, to this question. We deeply hope that a solution will be found to extend thememory of Christian Benoît and keep the spirit of the Award to encourage young researchersin the growth of their projects and the realization of their first researcher dreams. We sincerelythank GIPSA-lab, ISCA and AFCP for their constant support in these 17 years of existence of theChristian Benoît Award.

On behalf of the Association Christian Benoît,Pascal Perrier & Jean-Luc Schwartz

66

Dai

lysc

hedu

les

Sund

ay,2

0A

ugus

t201

7

08:00

09:00

10:00

11:00

12:00

13:00

14:00

15:00

16:00

17:00

18:00

19:00

ACLA

RIN

Hands-onTutorial

ofLess-resou

rced

Lang

uages:

CreatingSp

eech

Databases

theMod

el-based

App

roach

StatisticalP

aram

etric

Speech

Processing:

SolvingProblem

swith

DeepLearning

for

Dialogu

eSy

stem

s

Tutorial6

continued

continued

Tutorial9

Tutorial7

continued

Tutorial1

continued

continued

Tutorial4

continued

Tutorial5

Tutorial2

continued

Tutorial8

continued

Data,

DeepLearning

andWhatnots

Speech

andAud

ioAnalysis:Big

Latest

Advancesin

Com

putation

al

Tutorial3

continued

withtheFu

rhat

Rob

otHead

Mod

ellingSituated

Multi-m

odal

Interaction

Coff

eeBreak

LunchBreak

Coff

eeBreak

B5

C6

C307

C397

B4

Real-w

orld

Ambu

latory

Mon

itoring

ofVocal

Behaviour

ofTalk-in-in

teraction

AnIntrod

uction

tothePho

netics

Insigh

tsfrom

Qualitatieve

Research:

Com

putation

alMod

elling

ofLa

nguage

Acquisition

usingtheMerlin

toolkit

DeepLearning

for

Text-to-speechSy

nthesis

SE1-X: The 2nd Workshop for Young Female

Researchers in Speech Science and Technology

KTH Royal Institute of Technology, Fantum

KTH Royal Institute of Technology, F0

SE2-X: 3rd Doctoral Consortium

Mon

day,

21A

ugus

t201

7

Show

&

Tell1

/2

Show

&

Tell1

/2Mon

-P-1-4

Mon

-P-1-1

Mon

-P-1-2

ISCAGeneral

Assem

bly&

Refreshments

Stockh

olmsstadshus

Wel

com

eRec

eption

Mon

-P-2

-2:Sp

eech

Produ

ctionandPe

rceptio

nM

on-P

-2-3

:Multi-lingu

alMod

elsandAdaptationforASR

Mon

-P-2

-1:Sp

eech

Perception

Mon

-P-2

-4:Prosody

andTe

xtProcessing

09:00

10:00

11:00

11:00

12:00

13:00

14:00

15:00

16:00

17:00

18:00

19:00

20:00

21:00

AulaMagna

Mon

-O-2-2:Patho

logical

Speech

andLa

nguage

Mon

-O-2-1:NerualN

etworks

forLa

nguage

Mod

elling

AulaMagna

AulaMagna

DialectsandL2

Mon

-O-2-6:Perceptionof

Verification

Spoo

fingand

Cou

ntermeasuresChalleng

e2

Exhibition

Cou

ntermeasuresChalleng

e1

andSp

eech

SS-1-8:Autom

atic

Speaker

Verification

Spoo

fingand

B4

C6

D8

E10

F11

E306/E

397

Posters

Exhibition

ArticulatoryPho

netics

Mon

-O-1-2:Multimod

al

A2

Paralingu

istics

Mon

-O-2-4:Sp

eech

Analysis

andRepresentation1

B3

Kulning

(SweK

ul)

Mon

-SE3-3:

Swedish

Mon

-O-1-4:Dereverberation

,

EchoCancellation

Mon

-O-1-6:Acousticand

SS-2-8:Autom

atic

Speaker

Mon

-O-1-1:Con

versational

Telepho

neSp

eech

Recognition

andArticulatorySy

nthesis

Mon

-O-1-10:

Multimod

al

Speech

Recognition

Mon

-O-2-10:

Far-field

ISCAMedal

2017

Cerem

ony

Coff

eeBreak

Wel

com

eCer

emon

y

Mon

-P-1

-1:Sp

eech

AnalysisandRepresentation2

Mon

-P-2-3

Mon

-P-2-2

Mon

-P-2-4

Mon

-P-2-1

Mon

-P-1

-2:Sp

eech

andAud

ioSegm

entatio

nandClassificatio

n2

Mon

-P-1

-4:Search,Com

putatio

nalS

trategiesandLa

nguage

Mod

elling

forCod

e-sw

itching

inMultilingu

alCom

mun

ities1

LunchBreak

forCod

e-sw

itching

SS-2-11:

Speech

Techn

ology

inMultilingu

alCom

mun

ities2

SS-1-11:

Speech

Techn

ology

Tues

day,

22A

ugus

t201

7

Tue-P-4-3

Tue-P-4-2

Tue-P-4-1

09:00

10:00

10:00

11:00

12:00

13:00

14:00

15:00

16:00

17:00

18:00

19:00

20:00

21:00

08:00

AulaMagna

AcousticMod

elsforASR

2

Tue-O

-5-1:NeuralN

etwork

AulaMagna

Tue-O

-3-1:NeuralN

etwork

AcousticMod

elsforASR

1

B3

fortheNextDecade

SE4-3:

Speech

Recognition

Ton

eandIntonation

Tue-O

-4-6:Prosody:

Tue-O

-4-4:So

urce

Separation

forLa

nguage

Learning

Tue-O

-5-8:Sp

eech

Recognition

CredibilityandDeception

Tue-O

-5-10:

Stance

B4

C6

D8

E10

Tue-O

-3-8:Sp

eech

SynthesisProsody

A2

andAud

itorySc

eneAnalysis

Speech

Perception

Tue-O

-4-2:Mod

elsof

Recognition

Evaluation

Tue-O

-5-2:Sp

eaker

Exhibition Exhibition

F11

E306/E

397

Posters

Exhibition

Tell3

/4

Show

&

Tell3

/4

Show

&Tue-O

-4-1:WaveN

etand

Novel

Paradigms

Tue-O

-3-2:Mod

elsof

Speech

Produ

ction

SE5-3:

Stud

ents

MeetExp

erts

Recognition

Tue-O

-3-4:Sp

eaker

Source

Mod

elling

Tue-O

-5-4:Glottal

andVoice

Quality

Tue-O

-3-6:Pho

nation

Tue-O

-5-6:Prosody:Rhythm

Stress,Quantity,Phrasing

Tue-O

-4-8:Emotion

Mod

elling

EmotionRecognition

Voice

Con

version1

Tue-O

-3-10:

Tue-O

-4-10:

Hum

an-Rob

otInteraction

SS-3-11:

Speech

and

SS-4-11:

Increm

ental

ProcessingandRespo

nsive

Behaviour

SS-5-11:

Acoustic

Manifestations

ofSo

cial

Characteristics

Tue-P-3-2

Tue-P-3-1

Coff

eeBreak

Tue-P-5-1

Tue-P-5-2

Tue-P-5-3

Tue-P-5-4

Key

note

:James

Allen

Coff

eeBreak

LunchBreak

Tue

-P-5

-1:L1

andL2

Acquisitio

n

Tue

-P-3

-1:Sh

ortUtterancesSp

eakerRecognitio

nTue

-P-4

-1:Acoustic

Mod

elsforASR

1Tue

-P-5

-3:So

urce

Separatio

nandVoice

activ

ityDetectio

nTue

-P-5

-4:Sp

eech-enh

ancement

Tue

-P-4

-3:DialogMod

elling

Stud

ent

Rec

eption

Kägelbanan,

SödraTe

atern

Tue

-P-4

-2:Acoustic

Mod

elsforASR

2Tue

-P-3

-2:Sp

eakerCharacterizationandRecognitio

n

Tue

-P-5

-2:Voice,Sp

eech

andHearin

gDisorders

Wed

nesd

ay,2

3A

ugus

t201

7

Wed

-P-6

-1:Sp

eech

Recognitio

n:Te

chno

logies

forNew

App

lications

andParadigms

LunchBreak

Key

note

:Catherin

ePelachaud

Coff

eeBreak

Coff

eeBreak

Tekn

iska

museet

Stan

ding

Ban

quet

Wed

-P-7

-3:Music

andAud

ioProcessing

Wed

-P-7

-4:Disorders

Related

toSp

eech

andLa

nguage

Wed

-SS-

7-1:

Digita

lRevolutionforUnd

er-resou

rced

Lang

uages2

Wed

-P-7

-2:ArticulatoryandAcoustic

Pho

netic

s

Wed

-P-8

-4:Voice

Con

version2

Wed

-P-8

-3:La

nguage

Und

erstanding

andGeneration

Wed

-P-8

-2:Sp

eakerStates

andTraits

Wed

-P-8

-1:Prosody

09:00

10:00

11:00

12:00

13:00

14:00

15:00

16:00

17:00

18:00

19:00

20:00

21:00

08:00

AulaMagna

A2

10:00

AulaMagna

23:00

22:00

Produ

ctionandPhysiology

Brain

Stud

ies

Wed-O

-7-1:Cognition

and

andAnti-spo

ofing

SS-6-2:Digital

Revolutionfor

Wed-O

-8-1:Sp

eakerDatabase

Wed-O

-6-1:Sp

eech

Und

er-resou

rced

Lang

uages1

Wed-O

-7-2:Noise

Rob

ust

Speech

Recognition

SS-8-2:Digital

Revolutionfor

Und

er-resou

rced

Lang

uages3

Wed

-P-6

-3:Sp

oken

Docum

entProcessing

Wed

-P-6

-4:Sp

eech

Intelligibility

Wed

-P-6

-2:Sp

eakerandLa

nguage

Recognitio

nApp

lications

Wed-O

-7-4:Top

icSp

otting

,

B4

C6

D8

E10

F11

E306/E

397

Posters

Exhibition Exhibition

Wed-O

-6-8:So

cial

Sign

als,

andSemanticAnalysis

Entity

Extraction

inMedical

Practice

Recognition

:App

lications

Wed-O

-8-10:

Lang

uage

Mod

elsforASR

Exhibition

Wed-O

-6-4:Sp

eech

and

Harmon

icAnalysis

Speech

Translation

Wed-O

-8-4:

Dialogu

eandProsody

Wed-O

-6-6:

Dialogu

esystem

sPronu

nciation

Mod

elling

Wed-O

-7-8:Lexicala

ndWed-O

-7-6:

Wed-O

-8-6:Multi-chann

el

Speech

Enh

ancement

AcousticMod

elAdaptation

Wed-O

-6-10:

Styles

andInteraction

Lang

uage

Recognition

Wed-O

-7-10:

inChild

Lang

uage

Acquisition

SS-6-11:

DataCollection,

Transcription

andAnn

otation

Voice

Attractiveness

SS-8-11:

SS-7-1

Wed-P-7-2

Wed-P-7-3

Wed-P-7-4

Wed-P-8-1

Wed-P-8-4

Wed-P-8-2

Wed-P-8-3

Wed-P-6-1

Wed-P-6-4

Wed-P-6-2

Wed-P-6-3

Wed-O

-8-8:Sp

eech

Lang

uage

Acquisition

Mod

elsin

Child

SS-7-11:

Com

putation

al

Show

&

Tell5

/6

Show

&

Tell5

/6App

lications

3

forFo

rensic

andInvestigative

SE6-3:

SpeakerCom

parison

B3

Thu

rsda

y,24

Aug

ust2

017

Exhibition

09:00

10:00

11:00

12:00

13:00

14:00

15:00

16:00

17:00

18:00

19:00

20:00

21:00

08:00

AulaMagna

A2

B4

C6

D8

E10

F11

LunchBreak

E306

/E39

7Po

sters

Exhibition

10:00

SS-9-10:

2017

Com

putation

al

Key

note

:Björn

Lind

blom

Coff

eeBreak

Coff

eeBreak

Clo

sing

sess

ion

TrainingforASR

Thu

-O-9-1:Discrim

inative

AcousticMod

elsforASR

3

Thu

-O-9-4:Sp

oken

Term

Detection

Resou

rces

andAnn

otation

Thu

-O-10-4:

Multimod

al

Voice

Simulation

SS-9-11:

Stateof

theArt

inPhysics-based

Paralingu

isticChalleng

e1

Paralingu

isticChalleng

e2

SS-10-10

:20

17Com

putation

al

AulaMagna

Show

&

Tell7

Thu

-O-10-1:

NeuralN

etwork

SpeakerDiariz

ation

Thu

-O-9-2:

Noise

Reduction

Thu

-O-9-6:

Thu

-P-9

-1:Noise

Rob

ustandFar-field

ASR

Thu

-O-10-2:

Rob

ust

SpeakerRecog

nition

Thu

-O-10-8:

Forensic

Sociop

honeticVarieties

Pho

netics

and

Thu

-O-9-8:Sp

eech

System

s

Recog

nition

:Multimod

al

SE7-X:Open

Doo

rsEvent

Thu

-P-9

-3:Styles,Varietie

s,Fo

rensicsandTo

ols

Thu

-P-9

-4:Sp

eech

Synthesis:Data,

Evaluatio

nandNovel

Paradigms

Thu

-P-9-1

Thu

-P-9-3

Thu

-P-9-4

Show

&

Tell7

Thu

-O-10-11

:Sp

eech

and

Aud

ioSe

gmentation

andClassification

1

Session Index

Monday, 21 August 2017

Mon-K1-1 ISCA Medal 2017 Ceremony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Mon-SS-1-8 Special Session: Interspeech 2017 Automatic Speaker Verification Spoofing andCountermeasures Challenge 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Mon-SS-1-11 Special Session: Speech Technology for Code-Switching in Multilingual Communities . . . . . 77

Mon-SS-2-8 Special Session: Interspeech 2017 Automatic Speaker Verification Spoofing andCountermeasures Challenge 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Mon-O-1-1 Conversational Telephone Speech Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Mon-O-1-2 Multimodal Paralinguistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Mon-O-1-4 Dereverberation, Echo Cancellation and Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Mon-O-1-6 Acoustic and Articulatory Phonetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Mon-O-1-10 Multimodal and Articulatory Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Mon-O-2-1 Neural Networks for Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Mon-O-2-2 Pathological Speech and Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Mon-O-2-4 Speech Analysis and Representation 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Mon-O-2-6 Perception of Dialects and L2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Mon-O-2-10 Far-field Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Mon-P-1-1 Speech Analysis and Representation 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Mon-P-1-2 Speech and Audio Segmentation and Classification 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Mon-P-1-4 Search, Computational Strategies and Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Mon-P-2-1 Speech Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Mon-P-2-2 Speech Production and Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Mon-P-2-3 Multi-lingual Models and Adaptation for ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Mon-P-2-4 Prosody and Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Mon-S&T-1/2-A Show & Tell 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

Mon-S&T-1/2-B Show & Tell 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

72

Tuesday, 22 August 2017

Tue-K2-1 Keynote 1: James Allen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Tue-SS-3-11 Special Session: Speech and Human-Robot Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Tue-SS-4-11 Special Session: Incremental Processing and Responsive Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . 112

Tue-SS-5-11 Special Session: Acoustic Manifestations of Social Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Tue-O-3-1 Neural Network Acoustic Models for ASR 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Tue-O-3-2 Models of Speech Production. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Tue-O-3-4 Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Tue-O-3-6 Phonation and Voice Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Tue-O-3-8 Speech Synthesis Prosody . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Tue-O-3-10 Emotion Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Tue-O-4-1 WaveNet and Novel Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Tue-O-4-2 Models of Speech Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

Tue-O-4-4 Source Separation and Auditory Scene Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Tue-O-4-6 Prosody: Tone and Intonation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Tue-O-4-8 Emotion Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Tue-O-4-10 Voice Conversion 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Tue-O-5-1 Neural Network Acoustic Models for ASR 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Tue-O-5-2 Speaker Recognition Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Tue-O-5-4 Glottal Source Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Tue-O-5-6 Prosody: Rhythm, Stress, Quantity and Phrasing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Tue-O-5-8 Speech Recognition for Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Tue-O-5-10 Stance, Credibility, and Deception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Tue-P-3-1 Short Utterances Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Tue-P-3-2 Speaker Characterization and Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Tue-P-4-1 Acoustic Models for ASR 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Tue-P-4-2 Acoustic Models for ASR 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Tue-P-4-3 Dialog Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Tue-P-5-1 L1 and L2 Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

Tue-P-5-2 Voice, Speech and Hearing Disorders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Tue-P-5-3 Source Separation and Voice Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Tue-P-5-4 Speech-enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

Tue-S&T-3/4-A Show & Tell 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

Tue-S&T-3/4-B Show & Tell 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

73

Language

Wednesday, 23 August 2017

Wed-K3-1 Keynote 2: Catherine Pelachaud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

Wed-SS-6-2 Special Session: Digital Revolution for Under-resourced Languages 1 . . . . . . . . . . . . . . . . . . . . . . . 160

Wed-SS-6-11 Special Session: Data Collection, Transcription and Annotation Issues in Child LanguageAcquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Wed-SS-7-1 Special Session: Digital Revolution for Under-resourced Languages 2 . . . . . . . . . . . . . . . . . . . . . . . 163

Wed-SS-7-11 Special Session: Computational Models in Child Language Acquisition. . . . . . . . . . . . . . . . . . . . . . 166

Wed-SS-8-11 Special Session: Voice Attractiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

Wed-O-6-1 Speech Production and Physiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

Wed-O-6-4 Speech and Harmonic Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

Wed-O-6-6 Dialog and Prosody . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

Wed-O-6-8 Social Signals, Styles, and Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

Wed-O-6-10 Acoustic Model Adaptation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

Wed-O-7-1 Cognition and Brain Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

Wed-O-7-2 Noise Robust Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Wed-O-7-4 Topic Spotting, Entity Extraction and Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

Wed-O-7-6 Dialog Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

Wed-O-7-8 Lexical and Pronunciation Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

Wed-O-7-10 Language Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

Wed-O-8-1 Speaker Database and Anti-spoofing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

Wed-O-8-4 Speech Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

Wed-O-8-6 Multi-channel Speech Enhancement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

Wed-O-8-8 Speech Recognition: Applications in Medical Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

Wed-O-8-10 Language models for ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

Wed-P-6-1 Speech Recognition: Technologies for New and Paradigms 188

Wed-P-6-2 Speaker and Language Recognition Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

Wed-P-6-3 Spoken Document Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

Wed-P-6-4 Speech Intelligibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

Wed-P-7-2 Articulatory and Acoustic Phonetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

Wed-P-7-3 Music and Audio Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

Wed-P-7-4 Disorders Related to Speech and Language. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

Wed-P-8-1 Prosody. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

Wed-P-8-2 Speaker States and Traits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

Wed-P-8-3 Language Understanding and Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

Wed-P-8-4 Voice Conversion 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

Wed-S&T-6/7-A Show & Tell 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

Wed-S&T-6/7-B Show & Tell 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

74

Applications . . . . . . . . . . . . . . . . . . . .

Thursday, 24 August 2017

Thu-K4-1 Keynote 3: Björn Lindblom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

Thu-SS-9-10 Special Session: Interspeech 2017 Computational Paralinguistics ChallengE (ComParE) 1. . 219

Thu-SS-9-11 Special Session: State of the Art in Physics-based Voice Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 220

Thu-SS-10-10 Special Session: Interspeech 2017 Computational Paralinguistics ChallengE (ComParE) 2. . 222

Thu-O-9-1 Discriminative Training for ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

Thu-O-9-2 Speaker Diarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

Thu-O-9-4 Spoken Term Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

Thu-O-9-6 Noise Reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

Thu-O-9-8 Speech Recognition: Multimodal Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

Thu-O-10-1 Neural Network Acoustic Models for ASR 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

Thu-O-10-2 Robust Speaker Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

Thu-O-10-4 Multimodal Resources and Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

Thu-O-10-8 Forensic Phonetics and Sociophonetic Varieties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

Thu-O-10-11 Speech and Audio Segmentation and Classification 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

Thu-P-9-1 Noise Robust and Far-field ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

Thu-P-9-3 Styles, Varieties, Forensics and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

Thu-P-9-4 Speech Synthesis: Data, Evaluation, and Novel Paradigms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

Thu-S&T-9/10-AShow & Tell 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

75

AbstractsISCA Medal 2017 CeremonyAula Magna, 09:45–10:15, Monday, 21 Aug. 2017Chair: Haizhou Li

ISCA Medal for Scientific Achievement

Haizhou Li; NUS, SingaporeMon-K1-1, Time: 09:45–10:15

The ISCA Medal for Scientific Achievement 2017 will be awardedto Professor Fumitada Itakura by the President of ISCA during theopening ceremony.

Mon-SS-1-8 : Special Session: Interspeech2017 Automatic Speaker VerificationSpoofing and Countermeasures Challenge 1D8, 11:00–13:00, Monday, 21 Aug. 2017Chairs: Tomi Kinnunen, Junichi Yamagishi

The ASVspoof 2017 Challenge: Assessing the Limitsof Replay Spoofing Attack Detection

Tomi Kinnunen 1, Md. Sahidullah 1, Héctor Delgado 2,Massimiliano Todisco 2, Nicholas Evans 2, JunichiYamagishi 3, Kong Aik Lee 4; 1University of EasternFinland, Finland; 2EURECOM, France; 3NII, Japan;4A*STAR, SingaporeMon-SS-1-8-1, Time: 11:00–11:30

The ASVspoof initiative was created to promote the development ofcountermeasures which aim to protect automatic speaker verifica-tion (ASV) from spoofing attacks. The first community-led, commonevaluation held in 2015 focused on countermeasures for speechsynthesis and voice conversion spoofing attacks. Arguably, however,it is replay attacks which pose the greatest threat. Such attacksinvolve the replay of recordings collected from enrolled speakersin order to provoke false alarms and can be mounted with greaterease using everyday consumer devices. ASVspoof 2017, the secondin the series, hence focused on the development of replay attackcountermeasures. This paper describes the database, protocolsand initial findings. The evaluation entailed highly heterogeneousacoustic recording and replay conditions which increased the equalerror rate (EER) of a baseline ASV system from 1.76% to 31.46%.Submissions were received from 49 research teams, 20 of whichimproved upon a baseline replay spoofing detector EER of 24.77%, interms of replay/non-replay discrimination. While largely successful,the evaluation indicates that the quest for countermeasures whichare resilient in the face of variable replay attacks remains very muchalive.

Experimental Analysis of Features for Replay AttackDetection — Results on the ASVspoof 2017 Challenge

Roberto Font, Juan M. Espín, María José Cano; BiometricVox, SpainMon-SS-1-8-2, Time: 11:30–11:45

This paper presents an experimental comparison of different fea-tures for the detection of replay spoofing attacks in AutomaticSpeaker Verification systems. We evaluate the proposed counter-measures using two recently introduced databases, including thedataset provided for the ASVspoof 2017 challenge. This challengeprovides researchers with a common framework for the evaluation

of replay attack detection systems, with a particular focus on thegeneralization to new, unknown conditions (for instance, replaydevices different from those used during system training). Ourcross-database experiments show that, although achieving thislevel of generalization is indeed a challenging task, it is possibleto train classifiers that exhibit stable and consistent results acrossdifferent experiments. The proposed approach for the ASVspoof2017 challenge consists in the score-level fusion of several baseclassifiers using logistic regression. These base classifiers are 2-classGaussian Mixture Models (GMMs) representing genuine and spoofedspeech respectively. Our best system achieves an Equal Error Rateof 10.52% on the challenge evaluation set. As a result of this set ofexperiments, we provide some general conclusions regarding featureextraction for replay attack detection and identify which featuresshow the most promising results.

Novel Variable Length Teager Energy SeparationBased Instantaneous Frequency Features for ReplayDetection

Hemant A. Patil, Madhu R. Kamble, Tanvina B. Patel,Meet H. Soni; DA-IICT, IndiaMon-SS-1-8-3, Time: 11:45–12:00

Replay attacks presents a great risk for Automatic Speaker Verifica-tion (ASV) system. In this paper, we propose a novel replay detectorbased on Variable length Teager Energy Operator-Energy SeparationAlgorithm-Instantaneous Frequency Cosine Coefficients (VESA-IFCC)for the ASV spoof 2017 challenge. The key idea here is to exploitthe contribution of IF in each subband energy via ESA to capturepossible changes in spectral envelope (due to transmission andchannel characteristics of replay device) of replayed speech. The IFis computed from narrowband components of speech signal, andDCT is applied in IF to get proposed feature set. We compare theperformance of the proposed VESA-IFCC feature set with the featuresdeveloped for detecting synthetic and voice converted speech. Thisincludes the CQCC, CFCCIF and prosody-based features. On thedevelopment set, the proposed VESA-IFCC features when fused atscore-level with a variant of CFCCIF and prosody-based features gavethe least EER of 0.12%. On the evaluation set, this combination gavean EER of 18.33%. However, post-evaluation results of challengeindicate that VESA-IFCC features alone gave the relatively least EERof 14.06% (i.e., relatively 16.11% less compared to baseline CQCC)and hence, is a very useful countermeasure to detect replay attacks.

Countermeasures for Automatic Speaker VerificationReplay Spoofing Attack : On Data Augmentation,Feature Representation, Classification and Fusion

Weicheng Cai 1, Danwei Cai 1, Wenbo Liu 1, Gang Li 2,Ming Li 1; 1Sun Yat-sen University, China; 2JSC, ChinaMon-SS-1-8-4, Time: 12:00–12:15

The ongoing ASVspoof 2017 challenge aims to detect replay attacksfor text dependent speaker verification. In this paper, we proposemultiple replay spoofing countermeasure systems, with some ofthem boosting the CQCC-GMM baseline system after score level fu-sion. We investigate different steps in the system building pipeline,including data augmentation, feature representation, classificationand fusion. First, in order to augment training data and simulate theunseen replay conditions, we converted the raw genuine training datainto replay spoofing data with parametric sound reverberator andphase shifter. Second, we employed the original spectrogram ratherthan CQCC as input to explore the end-to-end feature representationlearning methods. The spectrogram is randomly cropped into fixedsize segments, and then fed into a deep residual network (ResNet).Third, upon the CQCC features, we replaced the subsequent GMMclassifier with deep neural networks including fully-connected deep

Notes

76

neural network (FDNN) and Bidirectional Long Short Term Memoryneural network (BLSTM). Experiments showed that data augmenta-tion strategy can significantly improve the system performance. Thefinal fused system achieves to 16.39% EER on the test set of ASVspoof2017 for the common task.

Spoof Detection Using Source, InstantaneousFrequency and Cepstral Features

Sarfaraz Jelil, Rohan Kumar Das, S.R. MahadevaPrasanna, Rohit Sinha; IIT Guwahati, IndiaMon-SS-1-8-5, Time: 12:15–12:30

This work describes the techniques used for spoofed speech de-tection for the ASVspoof 2017 challenge. The main focus of thiswork is on exploiting the differences in the speech-specific natureof genuine speech signals and spoofed speech signals generated byreplay attacks. This is achieved using glottal closure instants, epochstrength, and the peak to side lobe ratio of the Hilbert envelope oflinear prediction residual. Apart from these source features, theinstantaneous frequency cosine coefficient feature, and two cepstralfeatures namely, constant Q cepstral coefficients and mel frequencycepstral coefficients are used. A combination of all these features isperformed to obtain a high degree of accuracy for spoof detection.Initially, efficacy of these features are tested on the development setof the ASVspoof 2017 database with Gaussian mixture model basedsystems. The systems are then fused at score level which acts as thefinal combined system for the challenge. The combined system isable to outperform the individual systems by a significant margin.Finally, the experiments are repeated on the evaluation set of thedatabase and the combined system results in an equal error rate of13.95%.

Audio Replay Attack Detection Using High-FrequencyFeatures

Marcin Witkowski, Stanisław Kacprzak, Piotr Zelasko,Konrad Kowalczyk, Jakub Gałka; AGH UST, PolandMon-SS-1-8-6, Time: 12:30–12:45

This paper presents our contribution to the ASVspoof 2017 Chal-lenge. It addresses a replay spoofing attack against a speakerrecognition system by detecting that the analysed signal has passedthrough multiple analogue-to-digital (AD) conversions. Specifically,we show that most of the cues that enable to detect the replayattacks can be found in the high-frequency band of the replayedrecordings. The described anti-spoofing countermeasures are basedon (1) modelling the subband spectrum and (2) using the proposedfeatures derived from the linear prediction (LP) analysis. The resultsof the investigated methods show a significant improvement incomparison to the baseline system of the ASVspoof 2017 Challenge.A relative equal error rate (EER) reduction by 70% was achieved forthe development set and a reduction by 30% was obtained for theevaluation set.

Feature Selection Based on CQCCs for AutomaticSpeaker Verification Spoofing

Xianliang Wang 1, Yanhong Xiao 2, Xuan Zhu 1; 1BeijingSamsung Telecom R&D Center, China; 2Beijing Instituteof Technology, ChinaMon-SS-1-8-7, Time: 12:45–13:00

The ASVspoof 2017 challenge aims to assess spoofing and coun-termeasures attack detection accuracy for automatic speakerverification. It has been proven that constant Q cepstral coefficients(CQCCs) processes speech in different frequencies with variableresolution and performs much better than traditional features.When coupled with a Gaussian mixture model (GMM), it is an excel-

lently effective spoofing countermeasure. The baseline CQCC+GMMsystem considers short-term impacts while ignoring the wholeinfluence of channel. In the meanwhile, dimension of the featureis relatively higher than the traditional feature and usually with ahigher variance. This paper explores different features for ASVspoof2017 challenge. The mean and variance of the CQCC features ofan utterance is used as the representation of the whole utterance.Feature selection method is introduced to avoid high variance andoverfitting for spoofing detection. Experimental results on ASVspoof2017 dataset show that feature selection followed by Support VectorMachine (SVM) gets an improvement compared to the baseline. Itis also shown that pitch feature contributes to the performanceimprovement, and it obtains a relative improvement of 37.39% overthe baseline CQCC+GMM system.

Mon-SS-1-11 : Special Session: SpeechTechnology for Code-Switching inMultilingual CommunitiesF11, 11:00–13:00, Monday, 21 Aug. 2017Chairs: Kalika Bali, Alan W. Black

IntroductionMon-SS-1-11-10, Time: 11:00–11:20

(No abstract available at the time of publication)

Longitudinal Speaker Clustering and VerificationCorpus with Code-Switching Frisian-Dutch Speech

Emre Yılmaz 1, Jelske Dijkstra 2, Hans Van de Velde 2,Frederik Kampstra 3, Jouke Algra 3, Henkvan den Heuvel 1, David Van Leeuwen 1; 1RadboudUniversiteit Nijmegen, The Netherlands; 2FryskeAkademy, The Netherlands; 3Omrop Fryslân, TheNetherlandsMon-SS-1-11-1, Time: 11:20–11:40

In this paper, we present a new longitudinal and bilingual broadcastdatabase designed for speaker clustering and text-independentverification research. The broadcast data is extracted from thearchives of Omrop Fryslân which is the regional broadcaster in theprovince of Fryslân, located in the north of the Netherlands. Twospeaker verification tasks are provided in a standard enrollment-testsetting with language consistent trials. The first task containstarget trials from all speakers available appearing in at least twodifferent programs, while the second task contains target trials froma subgroup of speakers appearing in programs recorded in multipleyears. The second task is designed to investigate the effects of ageingon the accuracy of speaker verification systems. This database alsocontains unlabeled spoken segments from different radio programsfor speaker clustering research. We provide the output of an existingspeaker diarization system for baseline verification experiments.Finally, we present the baseline speaker verification results using theKaldi GMM- and DNN-UBM speaker verification system. This databasewill be an extension to the recently presented open source Frisiandata collection and it is publicly available for research purposes.

Exploiting Untranscribed Broadcast Data forImproved Code-Switching Detection

Emre Yılmaz, Henk van den Heuvel, David Van Leeuwen;Radboud Universiteit Nijmegen, The NetherlandsMon-SS-1-11-2, Time: 11:40–12:00

We have recently presented an automatic speech recognition (ASR)system operating on Frisian-Dutch code-switched speech. This type

Notes

77

of speech requires careful handling of unexpected language switchesthat may occur in a single utterance. In this paper, we extend thiswork by using some raw broadcast data to improve multilinguallytrained deep neural networks (DNN) that have been trained on 11.5hours of manually annotated bilingual speech. For this purpose,we apply the initial ASR to the untranscribed broadcast data andautomatically create transcriptions based on the recognizer outputusing different language models for rescoring. Then, we train newacoustic models on the combined data, i.e., the manually and auto-matically transcribed bilingual broadcast data, and investigate theautomatic transcription quality based on the recognition accuracieson a separate set of development and test data. Finally, we reportcode-switching detection performance elaborating on the correlationbetween the ASR and the code-switching detection performance.

Jee haan, I’d like both, por favor: Elicitation of aCode-Switched Corpus of Hindi–English andSpanish–English Human–Machine Dialog

Vikram Ramanarayanan, David Suendermann-Oeft;Educational Testing Service, USAMon-SS-1-11-3, Time: 12:00–12:20

We present a database of code-switched conversational human–machine dialog in English–Hindi and English–Spanish. We leveragedHALEF, an open-source standards-compliant cloud-based dialogsystem to capture audio and video of bilingual crowd workers asthey interacted with the system. We designed conversational itemswith intra-sentential code-switched machine prompts, and examineits efficacy in eliciting code-switched speech in a total of over 700dialogs. We analyze various characteristics of the code-switchedcorpus and discuss some considerations that should be takeninto account while collecting and processing such data. Such adatabase can be leveraged for a wide range of potential applications,including automated processing, recognition and understanding ofcode-switched speech and language learning applications for newlanguage learners.

On Building Mixed Lingual Speech Synthesis Systems

SaiKrishna Rallabandi, Alan W. Black; Carnegie MellonUniversity, USAMon-SS-1-11-4, Time: 12:20–12:40

Codemixing — phenomenon where lexical items from one languageare embedded in the utterance of another — is relatively frequentin multilingual communities. However, TTS systems today are notfully capable of effectively handling such mixed content despiteachieving high quality in the monolingual case. In this paper, weinvestigate various mechanisms for building mixed lingual systemswhich are built using a mixture of monolingual corpora and arecapable of synthesizing such content. First, we explore the pos-sibility of manipulating the phoneme representation: using targetword to source phone mapping with the aim of emulating the nativespeaker intuition. We then present experiments at the acousticstage investigating training techniques at both spectral and prosodiclevels. Subjective evaluation shows that our systems are capable ofgenerating high quality synthesis in codemixed scenarios.

Speech Synthesis for Mixed-Language NavigationInstructions

Khyathi Raghavi Chandu 1, SaiKrishna Rallabandi 1,Sunayana Sitaram 2, Alan W. Black 1; 1Carnegie MellonUniversity, USA; 2Microsoft, IndiaMon-SS-1-11-5, Time: 12:40–13:00

Text-to-Speech (TTS) systems that can read navigation instructionsare one of the most widely used speech interfaces today. Text in

the navigation domain may contain named entities such as locationnames that are not in the language that the TTS database is recordedin. Moreover, named entities can be compound words where individ-ual lexical items belong to different languages. These named entitiesmay be transliterated into the script that the TTS system is trainedon. This may result in incorrect pronunciation rules being used forsuch words. We describe experiments to extend our previous work ingenerating code-mixed speech to synthesize navigation instructions,with a mixed-lingual TTS system. We conduct subjective listeningtests with two sets of users, one being students who are nativespeakers of an Indian language and very proficient in English, andthe other being drivers with low English literacy, but familiaritywith location names. We find that in both sets of users, there isa significant preference for our proposed system over a baselinesystem that synthesizes instructions in English.

Addressing Code-Switching in French/AlgerianArabic Speech

Djegdjiga Amazouz 1, Martine Adda-Decker 1, LoriLamel 2; 1LPP (UMR 7018), France; 2LIMSI, FranceMon-SS-1-11-6, Time: 14:30–14:50

This study focuses on code-switching (CS) in French/Algerian Arabicbilingual communities and investigates how speech technologies,such as automatic data partitioning, language identification andautomatic speech recognition (ASR) can serve to analyze and classifythis type of bilingual speech. A preliminary study carried out using acorpus of Maghrebian broadcast data revealed a relatively high pres-ence of CS Algerian Arabic as compared to the neighboring countriesMorocco and Tunisia. Therefore this study focuses on code switchingproduced by bilingual Algerian speakers who can be considered na-tive speakers of both Algerian Arabic and French. A specific corpusof four hours of speech from 8 bilingual French Algerian speakerswas collected. This corpus contains read speech and conversationalspeech in both languages and includes stretches of code-switching.We provide a linguistic description of the code-switching stretchesin terms of intra-sentential and inter-sentential switches, the speechduration in each language. We report on some initial studies tolocate French, Arabic and the code-switched stretches, using ASRsystem word posteriors for this pair of languages.

Metrics for Modeling Code-Switching Across Corpora

Gualberto Guzmán, Joseph Ricard, Jacqueline Serigos,Barbara E. Bullock, Almeida Jacqueline Toribio;University of Texas at Austin, USAMon-SS-1-11-7, Time: 14:50–15:10

In developing technologies for code-switched speech, it would bedesirable to be able to predict how much language mixing might beexpected in the signal and the regularity with which it might occur.In this work, we offer various metrics that allow for the classificationand visualization of multilingual corpora according to the ratio oflanguages represented, the probability of switching between them,and the time-course of switching. Applying these metrics to corporaof different languages and genres, we find that they display distinctprobabilities and periodicities of switching, information useful forspeech processing of mixed-language data.

Synthesising isiZulu-English Code-Switch BigramsUsing Word Embeddings

Ewald van der Westhuizen, Thomas Niesler; StellenboschUniversity, South AfricaMon-SS-1-11-8, Time: 15:10–15:30

Code-switching is prevalent among South African speakers, andpresents a challenge to automatic speech recognition systems. It is

Notes

78

predominantly a spoken phenomenon, and generally does not occurin textual form. Therefore a particularly serious challenge is theextreme lack of training material for language modelling. We inves-tigate the use of word embeddings to synthesise isiZulu-to-Englishcode-switch bigrams with which to augment such sparse languagemodel training data. A variety of word embeddings are trained ona monolingual English web text corpus, and subsequently queriedto synthesise code-switch bigrams. Our evaluation is performed onlanguage models trained on a new, although small, English-isiZulucode-switch corpus compiled from South African soap operas.This data is characterised by fast, spontaneously spoken speechcontaining frequent code-switching. We show that the augmentationof the training data with code-switched bigrams synthesised in thisway leads to a reduction in perplexity.

Crowdsourcing Universal Part-of-Speech Tags forCode-Switching

Victor Soto, Julia Hirschberg; Columbia University, USAMon-SS-1-11-9, Time: 15:30–15:50

Code-switching is the phenomenon by which bilingual speakersswitch between multiple languages during communication. Theimportance of developing language technologies for code-switchingdata is immense, given the large populations that routinely code-switch. High-quality linguistic annotations are extremely valuablefor any NLP task, and performance is often limited by the amountof high-quality labeled data. However, little such data exists forcode-switching. In this paper, we describe crowd-sourcing universalpart-of-speech tags for the Miami Bangor Corpus of Spanish-Englishcode-switched speech. We split the annotation task into threesubtasks: one in which a subset of tokens are labeled automatically,one in which questions are specifically designed to disambiguatea subset of high frequency words, and a more general cascadedapproach for the remaining data in which questions are displayedto the worker following a decision tree structure. Each subtask isextended and adapted for a multilingual setting and the universaltagset. The quality of the annotation process is measured usinghidden check questions annotated with gold labels. The overallagreement between gold standard labels and the majority vote isbetween 0.95 and 0.96 for just three labels and the average recallacross part-of-speech tags is between 0.87 and 0.99, depending onthe task.

DiscussionMon-SS-1-11-11, Time: 15:50–16:30

(No abstract available at the time of publication)

Mon-SS-2-8 : Special Session: Interspeech2017 Automatic Speaker VerificationSpoofing and Countermeasures Challenge 2D8, 14:30–16:30, Monday, 21 Aug. 2017Chairs: Nicholas Evans, Kong Aik Lee

Audio Replay Attack Detection with Deep LearningFrameworks

Galina Lavrentyeva 1, Sergey Novoselov 1, Egor Malykh 1,Alexander Kozlov 2, Oleg Kudashev 1, VadimShchemelinin 1; 1ITMO University, Russia;2STC-innovations, RussiaMon-SS-2-8-1, Time: 14:30–14:45

Nowadays spoofing detection is one of the priority research areas inthe field of automatic speaker verification. The success of Automatic

Speaker Verification Spoofing and Countermeasures (ASVspoof)Challenge 2015 confirmed the impressive perspective in detectionof unforeseen spoofing trials based on speech synthesis and voiceconversion techniques. However, there is a small number of re-searches addressed to replay spoofing attacks which are more likelyto be used by non-professional impersonators. This paper describesthe Speech Technology Center (STC) anti-spoofing system submittedfor ASVspoof 2017 which is focused on replay attacks detection.Here we investigate the efficiency of a deep learning approach forsolution of the mentioned-above task. Experimental results obtainedon the Challenge corpora demonstrate that the selected approachoutperforms current state-of-the-art baseline systems in terms ofspoofing detection quality. Our primary system produced an EERof 6.73% on the evaluation part of the corpora which is 72% relativeimprovement over the ASVspoof 2017 baseline system.

Ensemble Learning for Countermeasure of AudioReplay Spoofing Attack in ASVspoof2017

Zhe Ji 1, Zhi-Yi Li 2, Peng Li 1, Maobo An 1, ShengxiangGao 1, Dan Wu 1, Faru Zhao 1; 1CNCERT, China;2CreditEase, ChinaMon-SS-2-8-2, Time: 14:45–15:00

To enhance the security and reliability of automatic speaker ver-ification (ASV) systems, ASVspoof 2017 challenge focuses on thedetection problem of known and unknown audio replay attacks. Weproposed an ensemble learning classifier for CNCB team’s submittedsystem scores, which across uses a variety of acoustic featuresand classifiers. An effective post-processing method is studied toimprove the performance of Constant Q cepstral coefficients (CQCC)and to form a base feature set with some other classical acousticfeatures. We also proposed using an ensemble classifier set, whichincludes multiple Gaussian Mixture Model (GMM) based classifiersand two novel GMM mean supervector-Gradient Boosting DecisionTree (GSV-GBDT) and GSV-Random Forest (GSV-RF) classifiers. Ex-perimental results have shown that the proposed ensemble learningsystem can provide substantially better performance than baseline.On common training condition of the challenge, Equal Error Rate(EER) of primary system on development set is 1.5%, compared tobaseline 10.4%. EER of primary system (S02 in ASVspoof 2017 board)on evaluation data set are 12.3% (with only train dataset) and 10.8%(with train+dev dataset), which are also much better than baseline30.6% and 24.8%, given by ASVSpoof 2017 organizer, with 59.7% and56.4% relative performance improvement.

A Study on Replay Attack and Anti-Spoofing forAutomatic Speaker Verification

Lantian Li, Yixiang Chen, Dong Wang, Thomas FangZheng; Tsinghua University, ChinaMon-SS-2-8-3, Time: 15:00–15:15

For practical automatic speaker verification (ASV) systems, replayattack poses a true risk. By replaying a pre-recorded speech signalof the genuine speaker, ASV systems tend to be easily fooled. Aneffective replay detection method is therefore highly desirable. Inthis study, we investigate a major difficulty in replay detection: theover-fitting problem caused by variability factors in speech signal.An F-ratio probing tool is proposed and three variability factors areinvestigated using this tool: speaker identity, speech content andplayback & recording device. The analysis shows that device is themost influential factor that contributes the highest over-fitting risk.A frequency warping approach is studied to alleviate the over-fittingproblem, as verified on the ASV-spoof 2017 database.

Notes

79

Replay Attack Detection Using DNN for ChannelDiscrimination

Parav Nagarsheth, Elie Khoury, Kailash Patil, MattGarland; Pindrop, USAMon-SS-2-8-4, Time: 15:15–15:30

Voice is projected to be the next input interface for portable devices.The increased use of audio interfaces can be mainly attributed tothe success of speech and speaker recognition technologies. Withthese advances comes the risk of criminal threats where attackersare reportedly trying to access sensitive information using diversevoice spoofing techniques. Among them, replay attacks pose a realchallenge to voice biometrics. This paper addresses the problem byproposing a deep learning architecture in tandem with low-level cep-stral features. We investigate the use of a deep neural network (DNN)to discriminate between the different channel conditions available inthe ASVSpoof 2017 dataset, namely recording, playback and sessionconditions. The high-level feature vectors derived from this networkare used to discriminate between genuine and spoofed audio. Twokinds of low-level features are utilized: state-of-the-art constant-Qcepstral coefficients (CQCC), and our proposed high-frequencycepstral coefficients (HFCC) that derive from the high-frequencyspectrum of the audio. The fusion of both features proved to beeffective in generalizing well across diverse replay attacks seen inthe evaluation of the ASVSpoof 2017 challenge, with an equal errorrate of 11.5%, that is 53% better than the baseline Gaussian MixtureModel (GMM) applied on CQCC.

ResNet and Model Fusion for Automatic SpoofingDetection

Zhuxin Chen, Zhifeng Xie, Weibin Zhang, Xiangmin Xu;SCUT, ChinaMon-SS-2-8-5, Time: 15:30–15:45

Speaker verification systems have achieved great progress in recentyears. Unfortunately, they are still highly prone to different kindsof spoofing attacks such as speech synthesis, voice conversion, andfake audio recordings etc. Inspired by the success of ResNet inimage recognition, we investigated the effectiveness of using ResNetfor automatic spoofing detection. Experimental results on theASVspoof2017 data set show that ResNet performs the best amongall the single-model systems. Model fusion is a good way to furtherimprove the system performance. Nevertheless, we found that if thesame feature is used for different fused models, the resulting systemcan hardly be improved. By using different features and models, ourbest fused model further reduced the Equal Error Rate (EER) by 18%relatively, compared with the best single-model system.

SFF Anti-Spoofer: IIIT-H Submission for AutomaticSpeaker Verification Spoofing and CountermeasuresChallenge 2017

K.N.R.K. Raju Alluri, Sivanand Achanta,Sudarsana Reddy Kadiri, Suryakanth V. Gangashetty,Anil Kumar Vuppala; IIIT Hyderabad, IndiaMon-SS-2-8-6, Time: 15:45–16:00

The ASVspoof 2017 challenge is about the detection of replayedspeech from human speech. The proposed system makes use of thefact that when the speech signals are replayed, they pass throughmultiple channels as opposed to original recordings. This channel in-formation is typically embedded in low signal to noise ratio regions.A speech signal processing method with high spectro-temporalresolution is required to extract robust features from such regions.The single frequency filtering (SFF) is one such technique, which wepropose to use for replay attack detection. While SFF based feature

representation was used at front-end, Gaussian mixture model andbi-directional long short-term memory models are investigated atthe backend as classifiers. The experimental results on ASVspoof2017 dataset reveal that, SFF based representation is very effectivein detecting replay attacks. The score level fusion of back endclassifiers further improved the performance of the system whichindicates that both classifiers capture complimentary information.

DiscussionMon-SS-2-8-7, Time: 16:00–16:30

(No abstract available at the time of publication)

Mon-O-1-1 : Conversational TelephoneSpeech RecognitionAula Magna, 11:00–13:00, Monday, 21 Aug. 2017Chairs: Penny Karanasou, Ralf Schlüter

Improved Single System Conversational TelephoneSpeech Recognition with VGG Bottleneck Features

William Hartmann, Roger Hsiao, Tim Ng, Jeff Ma,Francis Keith, Man-Hung Siu; Raytheon BBNTechnologies, USAMon-O-1-1-1, Time: 11:00–11:20

On small datasets, discriminatively trained bottleneck features fromdeep networks commonly outperform more traditional spectralor cepstral features. While these features are typically trainedwith small, fully-connected networks, recent studies have usedmore sophisticated networks with great success. We use the re-cent deep CNN (VGG) network for bottleneck feature extraction —previously used only for low-resource tasks — and apply it to theSwitchboard English conversational telephone speech task. Unlikefeatures derived from traditional MLP networks, the VGG featuresoutperform cepstral features even when used with BLSTM acousticmodels trained on large amounts of data. We achieve the best BBNsingle system performance when combining the VGG features witha BLSTM acoustic model. When decoding with an n-gram languagemodel, which are used for deployable systems, we have a realisticproduction system with a WER of 7.4%. This result is competitivewith the current state-of-the-art in the literature. While our focus ison realistic single system performance, we further reduce the WERto 6.1% through system combination and using expensive neuralnetwork language model rescoring.

Student-Teacher Training with Diverse Decision TreeEnsembles

Jeremy H.M. Wong, Mark J.F. Gales; University ofCambridge, UKMon-O-1-1-2, Time: 11:20–11:40

Student-teacher training allows a large teacher model or ensembleof teachers to be compressed into a single student model, for thepurpose of efficient decoding. However, current approaches inautomatic speech recognition assume that the state clusters, oftendefined by Phonetic Decision Trees (PDT), are the same across allmodels. This limits the diversity that can be captured within theensemble, and also the flexibility when selecting the complexity ofthe student model output. This paper examines an extension tostudent-teacher training that allows for the possibility of havingdifferent PDTs between teachers, and also for the student to have adifferent PDT from the teacher. The proposal is to train the studentto emulate the logical context dependent state posteriors of theteacher, instead of the frame posteriors. This leads to a method ofmapping frame posteriors from one PDT to another. This approach

Notes

80

is evaluated on three speech recognition tasks: the Tok Pisin andJavanese low resource conversational telephone speech tasks fromthe IARPA Babel programme, and the HUB4 English broadcast newstask.

Embedding-Based Speaker Adaptive Training of DeepNeural Networks

Xiaodong Cui, Vaibhava Goel, George Saon; IBM, USAMon-O-1-1-3, Time: 11:40–12:00

An embedding-based speaker adaptive training (SAT) approach isproposed and investigated in this paper for deep neural networkacoustic modeling. In this approach, speaker embedding vectors,which are a constant given a particular speaker, are mapped througha control network to layer-dependent element-wise affine transfor-mations to canonicalize the internal feature representations at theoutput of hidden layers of a main network. The control network forgenerating the speaker-dependent mappings are jointly estimatedwith the main network for the overall speaker adaptive acousticmodeling. Experiments on large vocabulary continuous speechrecognition (LVCSR) tasks show that the proposed SAT scheme canyield superior performance over the widely-used speaker-awaretraining using i-vectors with speaker-adapted input features.

Improving Deliverable Speech-to-Text Systems withMultilingual Knowledge Transfer

Jeff Ma, Francis Keith, Tim Ng, Man-Hung Siu, OwenKimball; Raytheon BBN Technologies, USAMon-O-1-1-4, Time: 12:00–12:20

This paper reports our recent progress on using multilingual data forimproving speech-to-text (STT) systems that can be easily delivered.We continued the work BBN conducted on the use of multilingualdata for improving Babel evaluation systems, but focused on trainingtime-delay neural network (TDNN) based chain models. As donefor the Babel evaluations, we used multilingual data in two ways:first, to train multilingual deep neural networks (DNN) for extractingbottle-neck (BN) features, and second, for initializing training ontarget languages.

Our results show that TDNN chain models trained on multilingualDNN bottleneck features yield significant gains over their counter-parts trained on MFCC plus i-vector features. By initializing frommodels trained on multilingual data, TDNN chain models can achievegreat improvements over random initializations of the networkweights on target languages. Two other important findings are: 1)initialization with multilingual TDNN chain models produces largergains on target languages that have less training data; 2) inclusionof target languages in multilingual training for either BN featureextraction or initialization have limited impact on performancemeasured on the target languages. Our results also reveal that forTDNN chain models, the combination of multilingual BN featuresand multilingual initialization achieves the best performance on alltarget languages.

English Conversational Telephone SpeechRecognition by Humans and Machines

George Saon 1, Gakuto Kurata 2, Tom Sercu 1, KartikAudhkhasi 1, Samuel Thomas 1, Dimitrios Dimitriadis 1,Xiaodong Cui 1, Bhuvana Ramabhadran 1, MichaelPicheny 1, Lynn-Li Lim 3, Bergul Roomi 3, Phil Hall 3;1IBM, USA; 2IBM, Japan; 3Appen, AustraliaMon-O-1-1-5, Time: 12:20–12:40

Word error rates on the Switchboard conversational corpus thatjust a few years ago were 14% have dropped to 8.0%, then 6.6%

and most recently 5.8%, and are now believed to be within strikingrange of human performance. This then raises two issues: what ishuman performance, and how far down can we still drive speechrecognition error rates? In trying to assess human performance, weperformed an independent set of measurements on the Switchboardand CallHome subsets of the Hub5 2000 evaluation and foundthat human accuracy may be considerably better than what wasearlier reported, giving the community a significantly harder goal toachieve. We also report on our own efforts in this area, presentinga set of acoustic and language modeling techniques that loweredthe WER of our system to 5.5%/10.3% on these subsets, which is anew performance milestone (albeit not at what we measure to behuman performance). On the acoustic side, we use a score fusion ofone LSTM with multiple feature inputs, a second LSTM trained withspeaker-adversarial multi-task learning and a third convolutionalresidual net (ResNet). On the language modeling side, we use wordand character LSTMs and convolutional WaveNet-style languagemodels.

Comparing Human and Machine Errors inConversational Speech Transcription

Andreas Stolcke, Jasha Droppo; Microsoft, USAMon-O-1-1-6, Time: 12:40–13:00

Recent work in automatic recognition of conversational telephonespeech (CTS) has achieved accuracy levels comparable to humantranscribers, although there is some debate how to precisely quantifyhuman performance on this task, using the NIST 2000 CTS evaluationset. This raises the question what systematic differences, if any, maybe found differentiating human from machine transcription errors.In this paper we approach this question by comparing the outputof our most accurate CTS recognition system to that of a standardspeech transcription vendor pipeline. We find that the most frequentsubstitution, deletion and insertion error types of both outputsshow a high degree of overlap. The only notable exception is thatthe automatic recognizer tends to confuse filled pauses (“uh”) andbackchannel acknowledgments (“uhhuh”). Human tend not to makethis error, presumably due to the distinctive and opposing pragmaticfunctions attached to these words. Furthermore, we quantify thecorrelation between human and machine errors at the speaker level,and investigate the effect of speaker overlap between training andtest data. Finally, we report on an informal “Turing test” askinghumans to discriminate between automatic and human transcriptionerror cases.

Mon-O-1-2 : Multimodal ParalinguisticsA2, 11:00–13:00, Monday, 21 Aug. 2017Chairs: Paula Lopez-Otero, Elizabeth Shriberg

Multimodal Markers of Persuasive Speech: Designinga Virtual Debate Coach

Volha Petukhova 1, Manoj Raju 1, Harry Bunt 2;1Universität des Saarlandes, Germany; 2TilburgUniversity, The NetherlandsMon-O-1-2-1, Time: 11:00–11:20

The study presented in this paper is carried out to support debateperformance assessment in the context of debate skills training.The perception of good performance as a debater is influencedby how believable and convincing the debater’s argumentation is.We identified a number of features that are useful for explainingperceived properties of persuasive speech and for defining rules andstrategies to produce and assess debate performance. We collectedand analysed multimodal and multisensory data of the trainees de-bate behaviour, and contrasted it with those of skilled professional

Notes

81

debaters. Observational, correlation and machine learning studieswere performed to identify multimodal markers of persuasive speechand link them to experts’ assessments. A combination of multimodalin- and out-of-domain debate data, and various non-verbal, prosodic,lexical, linguistic and structural features has been computed basedon our analysis and assessed used to , and several classificationprocedures has been applied achieving an accuracy of 0.79 on spokendebate data.

Acoustic-Prosodic and Physiological Response toStressful Interactions in Children with AutismSpectrum Disorder

Daniel Bone 1, Julia Mertens 2, Emily Zane 2, SungbokLee 1, Shrikanth S. Narayanan 1, Ruth Grossman 2;1University of Southern California, USA; 2EmersonCollege, USAMon-O-1-2-2, Time: 11:20–11:40

Social anxiety is a prevalent condition affecting individuals tovarying degrees. Research on autism spectrum disorder (ASD), agroup of neurodevelopmental disorders marked by impairments insocial communication, has found that social anxiety occurs morefrequently in this population. Our study aims to further understandthe multimodal manifestation of social stress for adolescents withASD versus neurotypically developing (TD) peers. We investigatethis through objective measures of speech behavior and physiology(mean heart rate) acquired during three tasks: a low-stress conver-sation, a medium-stress interview, and a high-stress presentation.Measurable differences are found to exist for speech behavior andheart rate in relation to task-induced stress. Additionally, we findthe acoustic measures are particularly effective for distinguishingbetween diagnostic groups. Individuals with ASD produced higherprosodic variability, agreeing with previous reports. Moreover, themost informative features captured an individual’s vocal changes be-tween low and high social-stress, suggesting an interaction betweenvocal production and social stressors in ASD.

A Stepwise Analysis of Aggregated CrowdsourcedLabels Describing Multimodal Emotional Behaviors

Alec Burmania, Carlos Busso; University of Texas atDallas, USAMon-O-1-2-3, Time: 11:40–12:00

Affect recognition is a difficult problem that most often relies onhuman annotated data to train automated systems. As humans per-ceive emotion differently based on personality, cognitive state andpast experiences, it is important to collect rankings from multipleindividuals to assess the emotional content in corpora, which arelater aggregated with rules such as majority vote. With the increaseduse of crowdsourcing services for perceptual evaluations, collectinglarge amount of data is now feasible. It becomes important toquestion the amount of data needed to create well-trained classi-fiers. How different are the aggregated labels collected from fiveraters compared to the ones obtained from twenty evaluators? Is itworthwhile to spend resources to increase the number of evaluatorsbeyond those used in conventional/laboratory studies? This studyevaluates the consensus labels obtained by incrementally addingnew evaluators during perceptual evaluations. Using majority voteover categorical emotional labels, we compare the changes in the ag-gregated labels starting with one rater, and finishing with 20 raters.The large number of evaluators in a subset of the MSP-IMPROVdatabase and the ability to filter annotators by quality allows us tobetter understand label aggregation as a function of the number ofannotators.

An Information Theoretic Analysis of the TemporalSynchrony Between Head Gestures and ProsodicPatterns in Spontaneous Speech

Gaurav Fotedar, Prasanta Kumar Ghosh; IndianInstitute of Science, IndiaMon-O-1-2-4, Time: 12:00–12:20

We analyze the temporal co-ordination between head gestures andprosodic patterns in spontaneous speech in a data-driven manner.For this study, we consider head motion and speech data from 24subjects while they tell a fixed set of five stories. The head motion,captured using a motion capture system, is converted to Euler anglesand translations in X, Y and Z-directions to represent head gestures.Pitch and short-time energy in voiced segments are used to representthe prosodic patterns. To capture the statistical relationship betweenhead gestures and prosodic patterns, mutual information (MI) iscomputed at various delays between the two using data from 24subjects in six native languages. The estimated MI, averaged acrossall subjects, is found to be maximum when the head gestures lag theprosodic patterns by 30msec. This is found to be true when subjectstell stories in English as well as in their native language. We observea similar pattern in the root mean squared error of predicting headgestures from prosodic patterns using Gaussian mixture model.These results indicate that there could be an asynchrony betweenhead gestures and prosody during spontaneous speech where headgestures follow the corresponding prosodic patterns.

Multimodal Prediction of Affective Dimensions viaFusing Multiple Regression Techniques

D.-Y. Huang 1, Wan Ding 2, Mingyu Xu 1, HuaipingMing 1, Minghui Dong 1, Xinguo Yu 2, Haizhou Li 3;1A*STAR, Singapore; 2Central China Normal University,China; 3NUS, SingaporeMon-O-1-2-5, Time: 12:20–12:40

This paper presents a multimodal approach to predict affectivedimensions, that makes full use of features from audio, video,Electrodermal Activity (EDA) and Electrocardiogram (ECG) usingthree regression techniques such as support vector regression (SVR),partial least squares regression (PLS), and a deep bidirectional longshort-term memory recurrent neural network (DBLSTM-RNN) regres-sion. Each of the three regression techniques performs multimodalaffective dimension prediction followed by a fusion of differentmodels on features of four modalities using a support vector regres-sion. A support vector regression is also applied for a final fusion ofthe three regression systems. Experiments show that our proposedapproach obtains promising results on the AVEC 2015 benchmarkdataset for prediction of multimodal affective dimensions. Forthe development set, the concordance correlation coefficient (CCC)achieves results of 0.856 for arousal and 0.720 for valence, whichincreases 3.88% and 4.66% of the top-performer of AVEC 2015 inarousal and valence, respectively.

Co-Production of Speech and Pointing Gestures inClear and Perturbed Interactive Tasks: MultimodalDesignation Strategies

Marion Dohen 1, Benjamin Roustan 2; 1GIPSA, France;2UroMems, FranceMon-O-1-2-6, Time: 12:40–13:00

Designation consists in attracting an interlocutor’s attention on aspecific object and/or location. It is most often achieved using bothspeech (e.g., demonstratives) and gestures (e.g., manual pointing).This study aims at analyzing how speech and pointing gestures areco-produced in a semi-directed interactive task involving designation.

Notes

82

20 native speakers of French were involved in a cooperative task inwhich they provided instructions to a partner for her to reproducea model she could not see on a grid both of them saw. They hadto use only sentences of the form ‘The [target word] goes there.’.They did this in two conditions: silence and noise. Their speech andarticulatory/hand movements (motion capture) were recorded. Theanalyses show that the participants’ speech features were modifiedin noise (Lombard effect). They also spoke slower and made morepauses and errors. Their pointing gestures lasted longer and startedlater showing an adaptation of gesture production to speech. Thecondition did not influence speech/gesture coordination. The apex(part of the gesture that shows) mainly occurred at the same time asthe target word and not as the demonstrative showing that speakersgroup speech and gesture carrying complementary rather thanredundant information.

Mon-O-1-4 : Dereverberation, EchoCancellation and SpeechB4, 11:00–13:00, Monday, 21 Aug. 2017Chairs: Stephen Zahorian , Bernd T. Meyer

Improving Speaker Verification for ReverberantConditions with Deep Neural NetworkDereverberation Processing

Peter Guzewich, Stephen A. Zahorian; BinghamtonUniversity, USAMon-O-1-4-1, Time: 11:00–11:20

We present an improved method for training Deep Neural Networksfor dereverberation and show that it can improve performancefor the speech processing tasks of speaker verification and speechenhancement. We replicate recently proposed methods for dere-verberation using Deep Neural Networks and present our improvedmethod, highlighting important aspects that influence performance.We then experimentally evaluate the capabilities and limitations ofthe method with respect to speech quality and speaker verificationto show that ours achieves better performance than other proposedmethods.

Stepsize Control for Acoustic Feedback CancellationBased on the Detection of Reverberant Signal Periodsand the Estimated System Distance

Philipp Bulling 1, Klaus Linhard 1, Arthur Wolf 1, GerhardSchmidt 2; 1Daimler, Germany;2Christian-Albrechts-Universität zu Kiel, GermanyMon-O-1-4-2, Time: 11:20–11:40

A new approach for acoustic feedback cancellation is presented. Thechallenge in acoustic feedback cancellation is a strong correlationbetween the local speech and the loudspeaker signal. Due to thiscorrelation, the convergence rate of adaptive algorithms is limited.Therefore, a novel stepsize control of the adaptive filter is presented.The stepsize control exploits reverberant signal periods to updatethe adaptive filter. As soon as local speech stops, the reverberationenergy of the system decays exponentially. This means that duringreverberation there is only excitation of the filter but no local speech.Thus, signals are not correlated and the filter can converge withoutcorrelation problems. Consequently, the stepsize control acceleratesthe adaption process during reverberation and slows it down atthe beginning of speech activity. It is shown, that with a particulargain control, the reverb-based stepsize control can be interpretedas the theoretical optimum stepsize. However, for this purpose aprecise estimation of the system distance is required. One estimation

method is presented. The proposed estimator has a rescue mech-anism to detect enclosure dislocations. Both, simulations and realworld testing show that the acoustic feedback canceler is capable ofimproving stability and convergence rate, even at high system gains.

A Delay-Flexible Stereo Acoustic Echo Cancellationfor DFT-Based In-Car Communication (ICC) Systems

Jan Franzen, Tim Fingscheidt; Technische UniversitätBraunschweig, GermanyMon-O-1-4-3, Time: 11:40–12:00

In-car communication (ICC) systems supporting speech communica-tion in noise by reproducing amplified speech from the car cabin inthe car cabin ask for low-delay acoustic echo cancellation (AEC). Inthis paper we propose a delay-flexible DFT-based stereo AEC capableof cancelling also the echoes stemming from the audio player or FMradio. For the price of a somewhat higher complexity we are able toreduce the 32 ms delay of the baseline down to 4 ms, loosing only 1dB in ERLE while even preserving system distance properties.

Speech Enhancement Based on Harmonic EstimationCombined with MMSE to Improve SpeechIntelligibility for Cochlear Implant Recipients

Dongmei Wang, John H.L. Hansen; University of Texas atDallas, USAMon-O-1-4-4, Time: 12:00–12:20

In this paper, a speech enhancement algorithm is proposed toimprove the speech intelligibility for cochlear implant recipients.Our method is based on combination of harmonic estimation andtraditional statistical method. Traditional statistical based speechenhancement method is effective only for stationary noise suppres-sion, but not non-stationary noise. To address more complex noisescenarios, we explore the harmonic structure of target speech toobtain a more accurate noise estimation. The estimated noise isthen employed in the MMSE framework to obtain the gain functionfor recovering the target speech. Listening test experiments show asubstantial speech intelligibility improvement for cochlear implantrecipients in noisy environments.

Improving Speech Intelligibility in Binaural HearingAids by Estimating a Time-Frequency Mask with aWeighted Least Squares Classifier

David Ayllón 1, Roberto Gil-Pita 2, Manuel Rosa-Zurera 2;1Fonetic, Spain; 2Universidad de Alcalá, SpainMon-O-1-4-5, Time: 12:20–12:40

An efficient algorithm for speech enhancement in binaural hearingaids is proposed. The algorithm is based on the estimation of a time-frequency mask using supervised machine learning. The standardleast-squares linear classifier is reformulated to optimize a metricrelated to speech/noise separation. The method is energy-efficient intwo ways: the computational complexity is limited and the wirelessdata transmission optimized. The ability of the algorithm to enhancespeech contaminated with different types of noise and low SNR hasbeen evaluated. Objective measures of speech intelligibility andspeech quality demonstrate that the algorithm increments both thehearing comfort and speech understanding of the user. These resultsare supported by subjective listening tests.

Notes

83

Simulations of High-Frequency Vocoder on MandarinSpeech Recognition for Acoustic Hearing PreservedCochlear Implant

Tsung-Chen Wu 1, Tai-Shih Chi 1, Chia-Fone Lee 2;1National Chiao Tung University, Taiwan; 2Hualien TzuChi Hospital, TaiwanMon-O-1-4-6, Time: 12:40–13:00

Vocoder simulations are generally adopted to simulate the electricalhearing induced by the cochlear implant (CI). Our research groupis developing a new four-electrode CI microsystem which induceshigh-frequency electrical hearing while preserving low-frequencyacoustic hearing. To simulate the functionality of this CI, a previouslydeveloped hearing-impaired (HI) hearing model is combined with a4-channel vocoder in this paper to respectively mimic the perceivedacoustic hearing and electrical hearing. Psychoacoustic experimentsare conducted on Mandarin speech recognition for determiningparameters of electrodes for this CI. Simulation results show thatinitial consonants of Mandarin are more difficult to recognize thanfinal vowels of Mandarin via acoustic hearing of HI patients. Afterelectrical hearing being induced through logarithmic-frequencydistributed electrodes, speech intelligibility of HI patients is boostedfor all Mandarin phonemes, especially for initial consonants. Similarresults are consistently observed in clean and noisy test conditions.

Mon-O-1-6 : Acoustic and ArticulatoryPhoneticsC6, 11:00–13:00, Monday, 21 Aug. 2017

Chairs: Marzena Zygis, Štefan Benuš

Phonetic Correlates of Pharyngeal andPharyngealized Consonants in Saudi, Lebanese, andJordanian Arabic: An rt-MRI Study

Zainab Hermes, Marissa Barlaz, Ryan Shosted, Zhi-PeiLiang, Brad Sutton; University of Illinois atUrbana-Champaign, USAMon-O-1-6-1, Time: 11:00–11:20

The phonemic inventory of Arabic includes sounds that involve apharyngeal constriction. Sounds referred to as ‘pharyngeal’ (/Q/ and/è/) are reported to have a primary constriction in the pharynx,while sounds referred to as ‘pharyngealized’ (/sQ/, /tQ/, /dQ/, and/ðQ/ or /zQ/) are reported to have a secondary constriction in thepharynx. Some studies propose grouping both types of soundstogether, citing phonetic and phonological evidence. Phonetically,pharyngeal consonants are argued to have a primary constrictionbelow the pharynx, and are thus posited to be pharyngealized laryn-geals. Under this view, the pharyngeal constriction is secondary, notprimary. Phonologically, it has been established that pharyngealizedsounds trigger pharyngealization spread, and proposals for groupingpharyngeal and pharyngealized consonants together cite similar,but not identical, spread patterns triggered by pharyngeals. Inthis study, Real-time Magnetic Resonance Imaging is employed toinvestigate the phonetic correlates of the pharyngeal constriction inboth pharyngeal and pharyngealized sounds in Saudi, Lebanese, andJordanian Arabic as exemplified by one speaker from each dialect.Our findings demonstrate a difference in the location of constrictionamong both types of sounds. These distinctions in place possiblyaccount for the differences in the spread patterns triggered by eachtype of sound.

Glottal Opening and Strategies of Production ofFricatives

Benjamin Elie, Yves Laprie; INRIA, FranceMon-O-1-6-2, Time: 11:20–11:40

This work investigates the influence of the gradual opening ofthe glottis along its length during the production of fricatives inintervocalic contexts. Acoustic simulations reveal the existence ofa transient zone in the articulatory space where the frication noiselevel is very sensitive to small perturbations of the glottal opening.This corresponds to the configurations where both frication noiseand voiced contributions are present in the speech signal. To avoidthis unstability, speakers may adopt different strategies to ensurethe voiced/voiceless contrast of fricatives. This is evidenced byexperimental data of simultaneous glottal opening measurements,performed with ePGG, and audio recordings of vowel-fricative-vowelpseudowords. Voiceless fricatives are usually longer, in order tomaximize the number of voiceless time frames over voiced framesdue to the crossing of the transient regime. For voiced fricatives,the speaker may avoid the unstable regime by keeping low fricationnoise level, and thus by favoring the voicing characteristic, or bydoing very short crossings into the unstable regime. It is also shownthat when speakers are asked to sustain voiced fricatives longer thanin natural speech, they adopt the strategy of keeping low fricationnoise level to avoid the unstable regime.

Acoustics and Articulation of Medial versus FinalCoronal Stop Gemination Contrasts in MoroccanArabic

Mohamed Yassine Frej, Christopher Carignan,Catherine T. Best; Western Sydney University, AustraliaMon-O-1-6-3, Time: 11:40–12:00

This paper presents results of a simultaneous acoustic and articula-tory investigation of word-medial and word-final geminate/singletoncoronal stop contrasts in Moroccan Arabic (MA). The acoustic anal-ysis revealed that, only for the word-medial contrast, the two MAspeakers adopted comparable strategies in contrasting geminateswith singletons, mainly by significantly lengthening closure durationin geminates, relative to singletons. In word-final position, twospeaker-specific contrasting patterns emerged. While one speakeralso lengthened the closure duration for final geminates, the otherspeaker instead lengthened only the release duration for finalgeminates, relative to singletons. Consonant closure and precedingvowel were significantly longer for the geminate only in medialposition, not in final position. These temporal differences wereeven more clearly delineated in the articulatory signal, capturedvia ultrasound, to which we applied the novel approach of usingTRACTUS [Temporally Resolved Articulatory Configuration Trackingof UltraSound: 15] to index temporal properties of closure gesturesfor these geminate/singleton contrasts.

How are Four-Level Length Distinctions Produced?Evidence from Moroccan Arabic

Giuseppina Turco, Karim Shoul, Rachid Ridouane; LPP(UMR 7018), FranceMon-O-1-6-4, Time: 12:00–12:20

We investigate the durational properties of Moroccan Arabic identi-cal consonant sequences contrasting singleton (S) and geminate (G)dental fricatives, in six combinations of four-level length contrastsacross word boundaries (#) (one timing slot for #S, two for #Gand S#S, three for S#G and G#S, and four for G#G). The aim is todetermine the nature of the mapping between discrete phonologicaltiming units and phonetic durations. Acoustic results show that thelargest and most systematic jump in duration is displayed between

Notes

84

the singleton fricative on the one hand and the other sequences onthe other hand. Looking at these sequences, S#S is shown to havethe same duration as #G. When a geminate is within the sequence, atemporal reorganization is observed: G#S is not significantly longerthan S#S and #G; and G#G is only slightly longer than S#G. Insteadof a four-way hierarchy, our data point towards a possible upperlimit of three-way length contrasts for consonants: S < G=S#S=G#S< S#G=G#G. The interplay of a number of factors resulting in thismismatch between phonological length and phonetic duration arediscussed, and a working hypothesis is provided for why durationcontrasts are rarely ternary, and almost never quaternary.

Vowels in the Barunga Variety of North AustralianKriol

Caroline Jones 1, Katherine Demuth 2, Weicong Li 1,Andre Almeida 1; 1Western Sydney University, Australia;2Macquarie University, AustraliaMon-O-1-6-5, Time: 12:20–12:40

North Australian Kriol is an English based creole spoken widelyby Indigenous people in northern Australia in areas where thetraditional languages are endangered or no longer spoken. Thispaper offers the first acoustic description of the vowel phonology ofRoper Kriol, within a variety spoken at Barunga Community, east ofthe town of Katherine in the Northern Territory.

Drawing on a new corpus for Barunga Kriol, the paper presents anal-yses of the short and long monophthongs, as well as the diphthongsin the spontaneous speech of young adults. The results show thedurations and spectral characteristics of the vowels, including majorpatterns of allophony (i.e. coarticulation and context effects). Thisupdates the phonology over the previous description from the 1970s,showing that there is an additional front low vowel phoneme in thespeech of young people today, as well as a vowel length contrast.Interestingly there are points of similarity with the vowel acousticsfor traditional Aboriginal languages of the region, for example ina relatively compact vowel space and in the modest trajectories ofdiphthongs.

Nature of Contrast and Coarticulation: Evidence fromMizo Tones and Assamese Vowel Harmony

Indranil Dutta 1, Irfan S. 2, Pamir Gogoi 3, PriyankooSarmah 4; 1EFLU, India; 2University of Illinois atUrbana-Champaign, USA; 3University of Florida, USA;4IIT Guwahati, IndiaMon-O-1-6-6, Time: 12:40–13:00

Tonal coarticulation is universally found to be greater in extent in thecarryover direction compared to the anticipatory direction ([1], [2],[3], [4], [5]) leading to assimilatory processes. In general, carryovercoarticulation has been understood to be due to intertio-mechanicalforces, and, anticipatory effects are seen to be a consequence of par-allel activation of articulatory plans ([6]). In this paper, we report onresults from a set of Artificial Neural Networks (ANN) trained to pre-dict adjacent tones in disyllabic sequences. Our results confirm theuniversal pattern of greater carryover effects in Mizo leading to tonalassimilation. In addition, we report on results from single-layeredANN models and Support Vector Machines (SVM) that predict theidentity of V2 from V1 (anticipatory) consistently better than V1 fromV2 (carryover) in Assamese non-harmonic #. . .V1CV2. . .# sequences.The directionality in the performance of the V1 and V2 models,help us conclude that the directionality effect of coarticulation inAssamese non-harmonic sequences is greater in the anticipatorydirection, which is the same direction as in the harmonic sequences.We argue that coarticulatory propensity exhibits a great deal ofsensitivity to the nature of contrast in a language.

Mon-O-1-10 : Multimodal and ArticulatorySynthesisE10, 11:00–13:00, Monday, 21 Aug. 2017Chairs: Ingmar Steiner, Korin Richmond

The Influence of Synthetic Voice on the Evaluation ofa Virtual Character

João Paulo Cabral 1, Benjamin R. Cowan 2, KatjaZibrek 1, Rachel McDonnell 1; 1Trinity College Dublin,Ireland; 2University College Dublin, IrelandMon-O-1-10-1, Time: 11:00–11:20

Graphical realism and the naturalness of the voice used are impor-tant aspects to consider when designing a virtual agent or character.In this work, we evaluate how synthetic speech impacts people’sperceptions of a rendered virtual character. Using a controlledexperiment, we focus on the role that speech, in particular voiceexpressiveness in the form of personality, has on the assessment ofvoice level and character level perceptions. We found that peoplerated a real human voice as more expressive, understandable andlikeable than the expressive synthetic voice we developed. Con-trary to our expectations, we found that the voices did not have asignificant impact on the character level judgments; people in thevoice conditions did not significantly vary on their ratings of appeal,credibility, human-likeness and voice matching the character. Theimplications this has for character design and how this compareswith previous work are discussed.

Articulatory Text-to-Speech Synthesis Using theDigital Waveguide Mesh Driven by a Deep NeuralNetwork

Amelia J. Gully 1, Takenori Yoshimura 2, Damian T.Murphy 1, Kei Hashimoto 2, Yoshihiko Nankaku 2, KeiichiTokuda 2; 1University of York, UK; 2Nagoya Institute ofTechnology, JapanMon-O-1-10-2, Time: 11:20–11:40

Following recent advances in direct modeling of the speech wave-form using a deep neural network, we propose a novel methodthat directly estimates a physical model of the vocal tract from thespeech waveform, rather than magnetic resonance imaging data.This provides a clear relationship between the model and the sizeand shape of the vocal tract, offering considerable flexibility interms of speech characteristics such as age and gender. Initial testsindicate that despite a highly simplified physical model, intelligiblesynthesized speech is obtained. This illustrates the potential of thecombined technique for the control of physical models in general,and hence the generation of more natural-sounding synthetic speech.

An HMM/DNN Comparison for SynchronizedText-to-Speech and Tongue Motion Synthesis

Sébastien Le Maguer, Ingmar Steiner, Alexander Hewer;Universität des Saarlandes, GermanyMon-O-1-10-3, Time: 11:40–12:00

We present an end-to-end text-to-speech (TTS) synthesis systemthat generates audio and synchronized tongue motion directly fromtext. This is achieved by adapting a statistical shape space modelof the tongue surface to an articulatory speech corpus and traininga speech synthesis system directly on the tongue model parameterweights. We focus our analysis on the application of two standardmethodologies, based on Hidden Markov Models (HMMs) and Deep

Notes

85

Neural Networks (DNNs), respectively, to train both acoustic mod-els and the tongue model parameter weights. We evaluate bothmethodologies at every step by comparing the predicted articulatorymovements against the reference data. The results show that evenwith less than 2h of data, DNNs already outperform HMMs.

VCV Synthesis Using Task Dynamics to Animate aFactor-Based Articulatory Model

Rachel Alexander, Tanner Sorensen, Asterios Toutios,Shrikanth S. Narayanan; University of SouthernCalifornia, USAMon-O-1-10-4, Time: 12:00–12:20

This paper presents an initial architecture for articulatory synthesiswhich combines a dynamical system for the control of vocal tractshaping with a novel MATLAB implementation of an articulatorysynthesizer. The dynamical system controls a speaker-specific vocaltract model derived by factor analysis of mid-sagittal real-timeMRI data and provides input to the articulatory synthesizer, whichsimulates the propagation of sound waves in the vocal tract. First,parameters of the dynamical system are estimated from real-timeMRI data of human speech production. Second, vocal-tract dynamicsis simulated for vowel-consonant-vowel utterances using a sequenceof two dynamical systems: the first one starts from a vowel vocal-tract configuration and achieves a vocal-tract closure; the secondone starts from the closure and achieves the target configurationof the second vowel. Third, vocal-tract dynamics is converted toarea function dynamics and is input to the synthesizer to generatethe acoustic signal. Synthesized vowel-consonant-vowel examplesdemonstrate the feasibility of the method.

Beyond the Listening Test: An Interactive Approachto TTS Evaluation

Joseph Mendelson 1, Matthew P. Aylett 2; 1KTH, Sweden;2CereProc, UKMon-O-1-10-5, Time: 12:20–12:40

Traditionally, subjective text-to-speech (TTS) evaluation is performedthrough audio-only listening tests, where participants evaluate un-related, context-free utterances. The ecological validity of thesetests is questionable, as they do not represent real-world end-usescenarios. In this paper, we examine a novel approach to TTSevaluation in an imagined end-use, via a complex interaction withan avatar. 6 different voice conditions were tested: Natural speech,Unit Selection and Parametric Synthesis, in neutral and expressiverealizations. Results were compared to a traditional audio-onlyevaluation baseline. Participants in both studies rated the voices fornaturalness and expressivity. The baseline study showed canonicalresults for naturalness: Natural speech scored highest, followed byUnit Selection, then Parametric synthesis. Expressivity was clearlydistinguishable in all conditions. In the avatar interaction study,participants rated naturalness in the same order as the baseline,though with smaller effect size; expressivity was not distinguishable.Further, no significant correlations were found between cognitiveor affective responses and any voice conditions. This highlights2 primary challenges in designing more valid TTS evaluations: inreal-world use-cases involving interaction, listeners generally interactwith a single voice, making comparative analysis unfeasible, andin complex interactions, the context and content may confoundperception of voice quality.

Integrating Articulatory Information in DeepLearning-Based Text-to-Speech Synthesis

Beiming Cao 1, Myungjong Kim 1, Jan van Santen 2, TedMau 3, Jun Wang 1; 1University of Texas at Dallas, USA;2Oregon Health & Science University, USA; 3UTSouthwestern, USAMon-O-1-10-6, Time: 12:40–13:00

Articulatory information has been shown to be effective in improvingthe performance of hidden Markov model (HMM)-based text-to-speech (TTS) synthesis. Recently, deep learning-based TTS hasoutperformed HMM-based approaches. However, articulatory infor-mation has rarely been integrated in deep learning-based TTS. Thispaper investigated the effectiveness of integrating articulatory move-ment data to deep learning-based TTS. The integration of articulatoryinformation was achieved in two ways: (1) direct integration, wherearticulatory and acoustic features were the output of a deep neuralnetwork (DNN), and (2) direct integration plus forward-mapping,where the output articulatory features were mapped to acousticfeatures by an additional DNN; These forward-mapped acousticfeatures were then combined with the output acoustic features toproduce the final acoustic features. Articulatory (tongue and lip) andacoustic data collected from male and female speakers were used inthe experiment. Both objective measures and subjective judgmentby human listeners showed the approaches integrated articulatoryinformation outperformed the baseline approach (without usingarticulatory information) in terms of naturalness and speaker voiceidentity (voice similarity).

Mon-O-2-1 : Neural Networks for LanguageModelingAula Magna, 14:30–16:30, Monday, 21 Aug. 2017Chairs: Tanel Alumäe, Xunying Liu

Approaches for Neural-Network Language ModelAdaptation

Min Ma 1, Michael Nirschl 2, Fadi Biadsy 2, ShankarKumar 2; 1CUNY Graduate Center, USA; 2Google, USAMon-O-2-1-1, Time: 14:30–14:50

Language Models (LMs) for Automatic Speech Recognition (ASR) aretypically trained on large text corpora from news articles, books andweb documents. These types of corpora, however, are unlikely tomatch the test distribution of ASR systems, which expect spokenutterances. Therefore, the LM is typically adapted to a smallerheld-out in-domain dataset that is drawn from the test distribution.We propose three LM adaptation approaches for Deep NN and LongShort-Term Memory (LSTM): (1) Adapting the softmax layer in theNeural Network (NN); (2) Adding a non-linear adaptation layer beforethe softmax layer that is trained only in the adaptation phase; (3)Training the extra non-linear adaptation layer in pre-training andadaptation phases. Aiming to improve upon a hierarchical MaximumEntropy (MaxEnt) second-pass LM baseline, which factors the modelinto word-cluster and word models, we build an NN LM that predictsonly word clusters. Adapting the LSTM LM by training the adaptationlayer in both training and adaptation phases (Approach 3), we reducethe cluster perplexity by 30% on a held-out dataset compared to anunadapted LSTM LM. Initial experiments using a state-of-the-art ASRsystem show a 2.3% relative reduction in WER on top of an adaptedMaxEnt LM.

Notes

86

A Batch Noise Contrastive Estimation Approach forTraining Large Vocabulary Language Models

Youssef Oualil, Dietrich Klakow; Universität desSaarlandes, GermanyMon-O-2-1-2, Time: 14:50–15:10

Training large vocabulary Neural Network Language Models (NNLMs)is a difficult task due to the explicit requirement of the outputlayer normalization, which typically involves the evaluation of thefull softmax function over the complete vocabulary. This paperproposes a Batch Noise Contrastive Estimation (B-NCE) approach toalleviate this problem. This is achieved by reducing the vocabulary,at each time step, to the target words in the batch and then replacingthe softmax by the noise contrastive estimation approach, wherethese words play the role of targets and noise samples at the sametime. In doing so, the proposed approach can be fully formulatedand implemented using optimal dense matrix operations. ApplyingB-NCE to train different NNLMs on the Large Text CompressionBenchmark (LTCB) and the One Billion Word Benchmark (OBWB)shows a significant reduction of the training time with no noticeabledegradation of the models performance. This paper also presents anew baseline comparative study of different standard NNLMs on thelarge OBWB on a single Titan-X GPU.

Investigating Bidirectional Recurrent Neural NetworkLanguage Models for Speech Recognition

X. Chen 1, A. Ragni 1, X. Liu 2, Mark J.F. Gales 1;1University of Cambridge, UK; 2Chinese University ofHong Kong, ChinaMon-O-2-1-3, Time: 15:10–15:30

Recurrent neural network language models (RNNLMs) are powerfullanguage modeling techniques. Significant performance improve-ments have been reported in a range of tasks including speechrecognition compared to n-gram language models. Conventionaln-gram and neural network language models are trained to predictthe probability of the next word given its preceding context history.In contrast, bidirectional recurrent neural network based languagemodels consider the context from future words as well. This com-plicates the inference process, but has theoretical benefits for taskssuch as speech recognition as additional context information can beused. However to date, very limited or no gains in speech recognitionperformance have been reported with this form of model. Thispaper examines the issues of training bidirectional recurrent neuralnetwork language models (bi-RNNLMs) for speech recognition. Abi-RNNLM probability smoothing technique is proposed, that ad-dresses the very sharp posteriors that are often observed in thesemodels. The performance of the bi-RNNLMs is evaluated on threespeech recognition tasks: broadcast news; meeting transcription(AMI); and low-resource systems (Babel data). On all tasks gains areobserved by applying the smoothing technique to the bi-RNNLM. Inaddition consistent performance gains can be obtained by combiningbi-RNNLMs with n-gram and uni-directional RNNLMs.

Fast Neural Network Language Model Lookups atN-Gram Speeds

Yinghui Huang, Abhinav Sethy, Bhuvana Ramabhadran;IBM, USAMon-O-2-1-4, Time: 15:30–15:50

Feed forward Neural Network Language Models (NNLM) have shownconsistent gains over backoff word n-gram models in a varietyof tasks. However, backoff n-gram models still remain dominantin applications with real time decoding requirements as wordprobabilities can be computed orders of magnitude faster than theNNLM. In this paper, we present a combination of techniques that

allows us to speed up the probability computation from a neural netlanguage model to make it comparable to the word n-gram modelwithout any approximations. We present results on state of the artsystems for Broadcast news transcription and conversational speechwhich demonstrate the speed improvements in real time factor andprobability computation while retaining the WER gains from NNLM.

Empirical Exploration of Novel Architectures andObjectives for Language Models

Gakuto Kurata 1, Abhinav Sethy 2, BhuvanaRamabhadran 2, George Saon 2; 1IBM, Japan; 2IBM, USAMon-O-2-1-5, Time: 15:50–16:10

While recurrent neural network language models based on LongShort Term Memory (LSTM) have shown good gains in many au-tomatic speech recognition tasks, Convolutional Neural Network(CNN) language models are relatively new and have not been studiedin-depth. In this paper we present an empirical comparison of LSTMand CNN language models on English broadcast news and variousconversational telephone speech transcription tasks. We also presenta new type of CNN language model that leverages dilated causalconvolution to efficiently exploit long range history. We propose anovel criterion for training language models that combines word andclass prediction in a multi-task learning framework. We apply thiscriterion to train word and character based LSTM language modelsand CNN language models and show that it improves performance.Our results also show that CNN and LSTM language models arecomplementary and can be combined to obtain further gains.

Residual Memory Networks in Language Modeling:Improving the Reputation of Feed-Forward Networks

Karel Beneš, Murali Karthick Baskar, Lukáš Burget; BrnoUniversity of Technology, Czech RepublicMon-O-2-1-6, Time: 16:10–16:30

We introduce the Residual Memory Network (RMN) architecture tolanguage modeling. RMN is an architecture of feed-forward neuralnetworks that incorporates residual connections and time-delayconnections that allow us to naturally incorporate information froma substantial time context. As this is the first time RMNs are appliedfor language modeling, we thoroughly investigate their behaviouron the well studied Penn Treebank corpus. We change the modelslightly for the needs of language modeling, reducing both its timeand memory consumption. Our results show that RMN is a suitablechoice for small-sized neural language models: With test perplexity112.7 and as few as 2.3M parameters, they out-perform both a muchlarger vanilla RNN (PPL 124, 8M parameters) and a similarly sizedLSTM (PPL 115, 2.08M parameters), while being only by less than 3perplexity points worse than twice as big LSTM.

Mon-O-2-2 : Pathological Speech andLanguageA2, 14:30–16:30, Monday, 21 Aug. 2017Chairs: Heidi Christensen, Rafa Orozco

Dominant Distortion Classification for Pre-Processingof Vowels in Remote Biomedical Voice Analysis

Amir Hossein Poorjam 1, Jesper Rindom Jensen 1, Max A.Little 2, Mads Græsbøll Christensen 1; 1AalborgUniversity, Denmark; 2MIT, USAMon-O-2-2-1, Time: 14:30–14:50

Advances in speech signal analysis facilitate the development oftechniques for remote biomedical voice assessment. However, the

Notes

87

performance of these techniques is affected by noise and distor-tion in signals. In this paper, we focus on the vowel /a/ as themost widely-used voice signal for pathological voice assessmentsand investigate the impact of four major types of distortion thatare commonly present during recording or transmission in voiceanalysis, namely: background noise, reverberation, clipping andcompression, on Mel-frequency cepstral coefficients (MFCCs) — themost widely-used features in biomedical voice analysis. Then, wepropose a new distortion classification approach to detect the mostdominant distortion in such voice signals. The proposed methodinvolves MFCCs as frame-level features and a support vector machineas classifier to detect the presence and type of distortion in frames ofa given voice signal. Experimental results obtained from the healthyand Parkinson’s voices show the effectiveness of the proposedapproach in distortion detection and classification.

Automatic Paraphasia Detection from AphasicSpeech: A Preliminary Study

Duc Le, Keli Licata, Emily Mower Provost; University ofMichigan, USAMon-O-2-2-2, Time: 14:50–15:10

Aphasia is an acquired language disorder resulting from brain dam-age that can cause significant communication difficulties. Aphasicspeech is often characterized by errors known as paraphasias, theanalysis of which can be used to determine an appropriate courseof treatment and to track an individual’s recovery progress. Beingable to detect paraphasias automatically has many potential clinicalbenefits; however, this problem has not previously been investigatedin the literature. In this paper, we perform the first study on de-tecting phonemic and neologistic paraphasias from scripted speechsamples in AphasiaBank. We propose a speech recognition systemwith task-specific language models to transcribe aphasic speechautomatically. We investigate features based on speech duration,Goodness of Pronunciation, phone edit distance, and Dynamic TimeWarping on phoneme posteriorgrams. Our results demonstrate thefeasibility of automatic paraphasia detection and outline the pathtoward enabling this system in real-world clinical applications.

Evaluation of the Neurological State of People withParkinson’s Disease Using i-Vectors

N. Garcia 1, Juan Rafael Orozco-Arroyave 1, L.F.D’Haro 2, Najim Dehak 3, Elmar Nöth 4; 1Universidad deAntioquia, Colombia; 2A*STAR, Singapore; 3JohnsHopkins University, USA; 4FAU Erlangen-Nürnberg,GermanyMon-O-2-2-3, Time: 15:10–15:30

The i-vector approach is used to model the speech of PD patientswith the aim of assessing their condition. Features related to thearticulation, phonation, and prosody dimensions of speech wereused to train different i-vector extractors. Each i-vector extractor istrained using utterances from both PD patients and healthy controls.The i-vectors of the healthy control (HC) speakers are averaged toform a single i-vector that represents the HC group, i.e., the referencei-vector. A similar process is done to create a reference of the groupwith PD patients. Then the i-vectors of test speakers are comparedto these reference i-vectors using the cosine distance. Three analysesare performed using this distance: classification between PD patientsand HC, prediction of the neurological state of PD patients accordingto the MDS-UPDRS-III scale, and prediction of a modified version ofthe Frenchay Dysarthria Assessment. The Spearman’s correlationbetween this cosine distance and the MDS-UPDRS-III scale was 0.63.These results show the suitability of this approach to monitor theneurological state of people with Parkinson’s Disease.

Objective Severity Assessment from DisorderedVoice Using Estimated Glottal Airflow

Yu-Ren Chien, Michal Borský, Jón Guðnason; ReykjavikUniversity, IcelandMon-O-2-2-4, Time: 15:30–15:50

In clinical practice, the severity of disordered voice is typicallyrated by a professional with auditory-perceptual judgment. Thepresent study aims to automate this assessment procedure, in anattempt to make the assessment objective and less labor-intensive.In the automated analysis, glottal airflow is estimated from theanalyzed voice signal with an inverse filtering algorithm. Automaticassessment is realized by a regressor that predicts from temporaland spectral features of the glottal airflow. A regressor trainedon overtone amplitudes and harmonic richness factors extractedfrom a set of continuous-speech utterances was applied to a set ofsustained-vowel utterances, giving severity predictions (on a scale ofratings from 0 to 100) with an average error magnitude of 14.

Earlier Identification of Children with AutismSpectrum Disorder: An Automatic Vocalisation-BasedApproach

Florian B. Pokorny 1, Björn Schuller 2, Peter B. Marschik 1,Raymond Brueckner 3, Pär Nyström 4, NicholasCummins 2, Sven Bölte 5, Christa Einspieler 1, TerjeFalck-Ytter 5; 1Medizinische Universität Graz, Austria;2Universität Passau, Germany; 3Technische UniversitätMünchen, Germany; 4Uppsala University, Sweden;5Karolinska Institute, SwedenMon-O-2-2-5, Time: 15:50–16:10

Autism spectrum disorder (ASD) is a neurodevelopmental disorderusually diagnosed in or beyond toddlerhood. ASD is defined byrepetitive and restricted behaviours, and deficits in social communi-cation. The early speech-language development of individuals withASD has been characterised as delayed. However, little is knownabout ASD-related characteristics of pre-linguistic vocalisations atthe feature level. In this study, we examined pre-linguistic vocali-sations of 10-month-old individuals later diagnosed with ASD anda matched control group of typically developing individuals (N =20). We segmented 684 vocalisations from parent-child interactionrecordings. All vocalisations were annotated and signal-analyticallydecomposed. We analysed ASD-related vocalisation specificities onthe basis of a standardised set (eGeMAPS) of 88 acoustic featuresselected for clinical speech analysis applications. 54 features showedevidence for a differentiation between vocalisations of individualslater diagnosed with ASD and controls. In addition, we evaluated thefeasibility of automated, vocalisation-based identification of individ-uals later diagnosed with ASD. We compared linear kernel supportvector machines and a 1-layer bidirectional long short-term memoryneural network. Both classification approaches achieved an accuracyof 75% for subject-wise identification in a subject-independent 3-foldcross-validation scheme. Our promising results may be an importantcontribution en-route to facilitate earlier identification of ASD.

Convolutional Neural Network to Model ArticulationImpairments in Patients with Parkinson’s Disease

J.C. Vásquez-Correa 1, Juan Rafael Orozco-Arroyave 1,Elmar Nöth 2; 1Universidad de Antioquia, Colombia;2FAU Erlangen-Nürnberg, GermanyMon-O-2-2-6, Time: 16:10–16:30

Speech impairments are one of the earliest manifestations in patientswith Parkinson’s disease. Particularly, articulation deficits related to

Notes

88

the capability of the speaker to start/stop the vibration of the vocalfolds have been observed in the patients. Those difficulties can beassessed by modeling the transitions between voiced and unvoicedsegments from speech. A robust strategy to model the articulatorydeficits related to the starting or stopping vibration of the vocalfolds is proposed in this study. The transitions between voiced andunvoiced segments are modeled by a convolutional neural networkthat extracts suitable information from two time-frequency repre-sentations: the short time Fourier transform and the continuouswavelet transform. The proposed approach improves the resultspreviously reported in the literature. Accuracies of up to 89% areobtained for the classification of Parkinson’s patients vs. healthyspeakers. This study is a step towards the robust modeling of thespeech impairments in patients with neuro-degenerative disorders.

Mon-O-2-4 : Speech Analysis andRepresentation 1B4, 14:30–16:30, Monday, 21 Aug. 2017Chairs: Hema Murthy, Jon Barker

Phone Classification Using a Non-Linear Manifold withBroad Phone Class Dependent DNNs

Linxue Bai, Peter Jancovic, Martin Russell, Philip Weber,Steve Houghton; University of Birmingham, UKMon-O-2-4-1, Time: 14:30–14:50

Most state-of-the-art automatic speech recognition (ASR) systems usea single deep neural network (DNN) to map the acoustic space to thedecision space. However, different phonetic classes employ differentproduction mechanisms and are best described by different typesof features. Hence it may be advantageous to replace this singleDNN with several phone class dependent DNNs. The appropriatemathematical formalism for this is a manifold. This paper assessesthe use of a non-linear manifold structure with multiple DNNs forphone classification. The system has two levels. The first comprisesa set of broad phone class (BPC) dependent DNN-based mappings andthe second level is a fusion network. Various ways of designing andtraining the networks in both levels are assessed, including varyingthe size of hidden layers, the use of the bottleneck or softmax out-puts as input to the fusion network, and the use of different broadclass definitions. Phone classification experiments are performedon TIMIT. The results show that using the BPC-dependent DNNsprovides small but significant improvements in phone classificationaccuracy relative to a single global DNN. The paper concludes withvisualisations of the structures learned by the local and global DNNsand discussion of their interpretations.

An Investigation of Crowd Speech for RoomOccupancy Estimation

Siyuan Chen, Julien Epps, Eliathamby Ambikairajah,Phu Ngoc Le; University of New South Wales, AustraliaMon-O-2-4-2, Time: 14:50–15:10

Room occupancy estimation technology has been shown to reducebuilding energy cost significantly. However speech-based occupancyestimation has not been well explored. In this paper, we investigateenergy mode and babble speaker count methods for estimatingboth small and large crowds in a party-mode room setting. We alsoexamine how distance between speakers and microphone affectstheir estimation accuracies. Then we propose a novel entropy-basedmethod, which is invariant to different speakers and their differentpositions in a room. Evaluations on synthetic crowd speech gener-ated using the TIMIT corpus show that acoustic volume features areless affected by distance, and our proposed method outperformsexisting methods across a range of different conditions.

Time-Frequency Coherence for Periodic-AperiodicDecomposition of Speech Signals

Karthika Vijayan 1, Jitendra Kumar Dhiman 2,Chandra Sekhar Seelamantula 2; 1NUS, Singapore;2Indian Institute of Science, IndiaMon-O-2-4-3, Time: 15:10–15:30

Decomposing speech signals into periodic and aperiodic compo-nents is an important task, finding applications in speech synthesis,coding, denoising, etc. In this paper, we construct a time-frequencycoherence function to analyze spectro-temporal signatures of speechsignals for distinguishing between deterministic and stochasticcomponents of speech. The narrowband speech spectrogram issegmented into patches, which are represented as 2-D cosine carriersmodulated in amplitude and frequency. Separation of carrier andamplitude/frequency modulations is achieved by 2-D demodula-tion using Riesz transform, which is the 2-D extension of Hilberttransform. The demodulated AM component reflects contributionsof the vocal tract to spectrogram. The frequency modulated car-rier (FM-carrier) signal exhibits properties of the excitation. Thetime-frequency coherence is defined with respect to FM-carrier anda coherence map is constructed, in which highly coherent regionsrepresent nearly periodic and deterministic components of speech,whereas the incoherent regions correspond to unstructured com-ponents. The coherence map shows a clear distinction betweendeterministic and stochastic components in speech characterized byjitter, shimmer, lip radiation, type of excitation, etc. Binary masksprepared from the time-frequency coherence function are used forperiodic-aperiodic decomposition of speech. Experimental resultsare presented to validate the efficiency of the proposed method.

Musical Speech: A New Methodology for TranscribingSpeech Prosody

Alexsandro R. Meireles 1, Antônio R.M. Simões 2,Antonio Celso Ribeiro 1, Beatriz Raposo de Medeiros 3;1Universidade Federal do Espírito Santo, Brazil;2University of Kansas, USA; 3Universidade de São Paulo,BrazilMon-O-2-4-4, Time: 15:30–15:50

Musical Speech is a new methodology for transcribing speechprosody using musical notation. The methodology presented inthis paper is an updated version of our work [12]. Our work issituated in a historical context with a brief survey of the literatureof speech melodies, in which we highlight the pioneering worksof John Steele, Leoš Janávcek, Engelbert Humperdinck, and ArnoldSchoenberg, followed by a linguistic view of musical notation in theanalysis of speech. Finally, we present the current state-of-the-artof our innovative methodology that uses a quarter-tone scale fortranscribing speech, and shows some initial results of the applicationof this methodology to prosodic transcription.

Estimation of Place of Articulation of Fricatives fromSpectral Characteristics for Speech Training

K.S. Nataraj, Prem C. Pandey, Hirak Dasgupta; IITBombay, IndiaMon-O-2-4-5, Time: 15:50–16:10

A visual feedback of the place of articulation is considered to beuseful for speech training aids for hearing-impaired children andfor learners of second languages in helping them in improvingpronunciation. For such applications, the relation between placeof articulation of fricatives and their spectral characteristics isinvestigated using English fricatives available in the XRMB database,which provides simultaneously acquired speech signal and artic-

Notes

89

ulogram. Place of articulation is estimated from the articulogramas the position of maximum constriction in the oral cavity, usingan automated graphical technique. The magnitude spectrum issmoothed by critical band based median and mean filters for im-proving the consistency of the spectral parameters. Out of severalspectral parameters investigated, spectral moments and spectralslope appear to be related to the place of articulation of the fricativesegment of the utterances as measured from articulogram. The dataare used to train and test a Gaussian mixture model to estimate theplace of articulation with spectral parameters as the inputs. Theestimated values showed a good match with those obtained from thearticulograms.

Estimation of the Probability Distribution of SpectralFine Structure in the Speech Source

Tom Bäckström; Aalto University, FinlandMon-O-2-4-6, Time: 16:10–16:30

The efficiency of many speech processing methods rely on accuratemodeling of the distribution of the signal spectrum and a majority ofprior works suggest that the spectral components follow the Laplacedistribution. To improve the probability distribution models basedon our knowledge of speech source modeling, we argue that themodel should in fact be a multiplicative mixture model, includingterms for voiced and unvoiced utterances. While prior works haveapplied Gaussian mixture models, we demonstrate that a mixtureof generalized Gaussian models more accurately follows the obser-vations. The proposed estimation method is based on measuringthe ratio of Lp-norms between spectral bands. Such ratios followthe Beta-distribution when the input signal is generalized Gaussian,whereby the estimated parameters can be used to determine theunderlying parameters of the mixture of generalized Gaussiandistributions.

Mon-O-2-6 : Perception of Dialects and L2C6, 14:30–16:30, Monday, 21 Aug. 2017Chairs: Marija Tabain, Felicitas Kleber

End-to-End Acoustic Feedback in Language Learningfor Correcting Devoiced French Final-Fricatives

Sucheta Ghosh 1, Camille Fauth 2, Yves Laprie 1, AghilasSini 1; 1LORIA, France; 2LiLPa, FranceMon-O-2-6-1, Time: 14:30–14:50

This work aims at providing an end-to-end acoustic feedback frame-work to help learners of French to pronounce voiced fricatives.A classifier ensemble detects voiced/unvoiced utterances, thena correction method is proposed to improve the perception andproduction of voiced fricatives in a word-final position. Realizationsof voiced fricatives contained in French sentences uttered by Frenchand German speakers were analyzed to find out the deviations be-tween the acoustic cues realized by the two groups of speakers. Thecorrection method consists in substituting the erroneous devoicedfricative by TD-PSOLA concatenative synthesis that uses exemplarsof voiced fricatives chosen from a French speaker corpus. To achievea seamless concatenation the energy of the replacement fricativewas adjusted with respect to the energy levels of the learner’s andFrench speaker’s preceding vowels. Finally, a perception experimentwith the corrected stimuli has been carried out with French nativespeakers to check the appropriateness of the fricative revoicing. Theresults showed that the proposed revoicing strategy proved to bevery efficient and can be used as an acoustic feedback.

Dialect Perception by Older Children

Ewa Jacewicz, Robert A. Fox; Ohio State University, USAMon-O-2-6-2, Time: 14:50–15:10

The acquisition of regional dialect variation is an inherent partof the language learning process that takes place in the specificenvironments in which the child participates. This study examineddialect perception by 9–12-year-olds who grew up in two very diversedialect regions in the United States, Western North Carolina (NC)and Southeastern Wisconsin (WI). In a dialect identification task,each group of children responded to 120 talkers from the samedialects representing three generations, ranging in age from oldadults to children. There was a robust discrepancy in the children’sdialect identification performance: WI children were able to identifytalker dialect quite well (although still not as well as the adults)whereas NC children were at chance level. WI children were alsomore sensitive to cross-generational changes in both dialects asa function of diachronic sound change. It is concluded that bothgroups of children demonstrated their sociolinguistic awarenessin very different ways, corresponding to relatively stable (WI) andchanging (NC) socio-cultural environments in their respective speechcommunities.

Perception of Non-Contrastive Variations in AmericanEnglish by Japanese Learners: Flaps are Less FavoredThan Stops

Kiyoko Yoneyama 1, Mafuyu Kitahara 2, Keiichi Tajima 3;1Daito Bunka University, Japan; 2Sophia University,Japan; 3Hosei University, JapanMon-O-2-6-3, Time: 15:10–15:30

Alveolar flaps are non-contrastive allophonic variants of alveolarstops in American English. A lexical decision experiment was con-ducted with Japanese learners of English (JE) to investigate whethersecond-language (L2) learners are sensitive to such allophonic vari-ations when recognizing words in L2. The stimuli consisted of 36isolated bisyllabic English words containing word-medial /t/, halfof which were flap-favored words, e.g. city, and the other half were[t]-favored words, e.g. faster. All stimuli were recorded with twosurface forms: /t/ as a flap, e.g. city with a flap, or as [t], e.g. citywith [t]. The stimuli were counterbalanced so that participants onlyheard one of the two surface forms of each word. The accuracydata indicated that flap-favored words pronounced with a flap, e.g.city with a flap, were recognized significantly less accurately thanflap-favored words with [t], e.g. city with [t], and [t]-favored wordswith [t], e.g. faster with [t]. These results suggest that JE learnersprefer canonical forms over frequent forms produced with context-dependent allophonic variations. These results are inconsistent withprevious studies that found native speakers’ preference for frequentforms, and highlight differences in the effect of allophonic variationson the perception of native-language and L2 speech.

L1 Perceptions of L2 Prosody: The Interplay BetweenIntonation, Rhythm, and Speech Rate and TheirContribution to Accentedness and Comprehensibility

Lieke van Maastricht 1, Tim Zee 2, Emiel Krahmer 1, MarcSwerts 1; 1Tilburg University, The Netherlands;2Radboud Universiteit Nijmegen, The NetherlandsMon-O-2-6-4, Time: 15:30–15:50

This study investigates the cumulative effect of (non-)native into-nation, rhythm, and speech rate in utterances produced by Spanishlearners of Dutch on Dutch native listeners’ perceptions. In order toassess the relative contribution of these language-specific propertiesto perceived accentedness and comprehensibility, speech produced

Notes

90

by Spanish learners of Dutch was manipulated using transplantationand resynthesis techniques. Thus, eight manipulation conditionsreflecting all possible combinations of L1 and L2 intonation, rhythm,and speech rate were created, resulting in 320 utterances that wererated by 50 Dutch natives on their degree of foreign accent and easeof comprehensibility.

Our analyses show that all manipulations result in lower accented-ness and higher comprehensibility ratings. Moreover, both measuresare not affected in the same way by different combinations ofprosodic features: For accentedness, Dutch listeners appear mostinfluenced by intonation, and intonation combined with speech rate.This holds for comprehensibility ratings as well, but here the com-bination of all three properties, including rhythm, also significantlyaffects ratings by native speakers. Thus, our study reaffirms theimportance of differentiating between different aspects of perceptionand provides insight into those features that are most likely to affecthow native speakers perceive second language learners.

Effects of Pitch Fall and L1 on Vowel LengthIdentification in L2 Japanese

Izumi Takiguchi; Bunkyo Gakuin University, JapanMon-O-2-6-5, Time: 15:50–16:10

This study investigated whether and how the role of pitch fall in thefirst language (L1) interacts with its use as a cue for Japanese phono-logical vowel length in the second language (L2). Native listenersof Japanese (NJ) and L2 learners of Japanese with L1 backgroundsin Mandarin Chinese (NC), Seoul Korean (NK), American English(NE), and French (NFr) participated in a perception experiment. Theresults showed that the proportion of “long” responses increased asa function of vowel duration for all groups, giving s-shaped curves.Meanwhile, the presence or absence of a pitch fall within a syllableaffected only NJ and NC’s perception. Their category boundaryoccurred at a shorter duration for vowels with a pitch fall thanwithout a pitch fall. Among the four groups of L2 learners, only NCuse pitch fall to distinguish words in the L1. Thus, it is possible tothink that the role of pitch fall as an L1 cue relates to its use as acue for L2 length identification. L2 learners tend to attend to animportant phonetic feature as a cue for perceiving an L1 categorydifferentiating L1 words even in the L2 as implied by the FeatureHypothesis.

A Preliminary Study of Prosodic Disambiguation byChinese EFL Learners

Yuanyuan Zhang, Hongwei Ding; Shanghai Jiao TongUniversity, ChinaMon-O-2-6-6, Time: 16:10–16:30

This study investigated whether Chinese learners of English as aforeign language (EFL learners hereafter) could use prosodic cues toresolve syntactically ambiguous sentences in English. 8 sentenceswith 3 types of syntactic ambiguity were adopted. They were far/nearPP attachment, left/right word attachment and wide/narrow scope.In the production experiment, 15 Chinese college students whopassed the annual national examination CET (College English Test)Band 4 and 5 native English speakers from America were recruited.They were asked to read the 8 target sentences after hearing thecontexts spoken by a Native American speaker, which clarified theintended meaning of the ambiguous sentences. The preliminaryresults showed that, as the native speakers did, Chinese EFL learnersemployed different durational patterns to express the alternativemeanings of the ambiguous sentences by altering prosodic phrasing.That is, the duration of the pre-boundary items were lengthenedand pause were inserted at the boundary. But the perceptionexperiment showed that the utterances produced by Chinese EFLlearners couldn’t be effectively perceived by the native speakers due

to their different use of pre-boundary lengthening and pause. Theconclusion is that Chinese EFL learners find prosodic disambiguationdifficult.

Mon-O-2-10 : Far-field Speech RecognitionE10, 14:30–16:30, Monday, 21 Aug. 2017Chairs: Thomas Hain, Zheng-Hua Tan

Generation of Large-Scale Simulated Utterances inVirtual Rooms to Train Deep-Neural Networks forFar-Field Speech Recognition in Google Home

Chanwoo Kim, Ananya Misra, Kean Chin, Thad Hughes,Arun Narayanan, Tara N. Sainath, Michiel Bacchiani;Google, USAMon-O-2-10-1, Time: 14:30–14:50

We describe the structure and application of an acoustic roomsimulator to generate large-scale simulated data for training deepneural networks for far-field speech recognition. The system sim-ulates millions of different room dimensions, a wide distributionof reverberation time and signal-to-noise ratios, and a range ofmicrophone and sound source locations. We start with a relativelyclean training set as the source and artificially create simulated databy randomly sampling a noise configuration for every new trainingexample. As a result, the acoustic model is trained using examplesthat are virtually never repeated. We evaluate performance of thisapproach based on room simulation using a factored complex FastFourier Transform (CFFT) acoustic model introduced in our earlierwork, which uses CFFT layers and LSTM AMs for joint multichannelprocessing and acoustic modeling. Results show that the simulator-driven approach is quite effective in obtaining large improvementsnot only in simulated test conditions, but also in real / rerecordedconditions. This room simulation system has been employed intraining acoustic models including the ones for the recently releasedGoogle Home.

Neural Network-Based Spectrum Estimation forOnline WPE Dereverberation

Keisuke Kinoshita 1, Marc Delcroix 1, Haeyong Kwon 2,Takuma Mori 1, Tomohiro Nakatani 1; 1NTT, Japan;2Sogang University, KoreaMon-O-2-10-2, Time: 14:50–15:10

In this paper, we propose a novel speech dereverberation frameworkthat utilizes deep neural network (DNN)-based spectrum estimationto construct linear inverse filters. The proposed dereverberationframework is based on the state-of-the-art inverse filter estimationalgorithm called weighted prediction error (WPE) algorithm, whichis known to effectively reduce reverberation and greatly boost theASR performance in various conditions. In WPE, the accuracy of theinverse filter estimation, and thus the dereverberation performance,is largely dependent on the estimation of the power spectral density(PSD) of the target signal. Therefore, the conventional WPE iterativelyperforms the inverse filter estimation, actual dereverberation andthe PSD estimation to gradually improve the PSD estimate. How-ever, while such iterative procedure works well when sufficientlylong acoustically-stationary observed signals are available, WPE’sperformance degrades when the duration of observed/accessibledata is short, which typically is the case for real-time applicationsusing online block-batch processing with small batches. To solvethis problem, we incorporate the DNN-based spectrum estimatorinto the framework of WPE, because a DNN can estimate the PSDrobustly even from very short observed data. We experimentallyshow that the proposed framework outperforms the conventional

Notes

91

WPE, and improves the ASR performance in real noisy reverberantenvironments in both single-channel and multichannel cases.

Factorial Modeling for Effective Suppression ofDirectional Noise

Osamu Ichikawa 1, Takashi Fukuda 1, Gakuto Kurata 1,Steven J. Rennie 2; 1IBM, Japan; 2IBM, USAMon-O-2-10-3, Time: 15:10–15:30

The assumed scenario is transcription of a face-to-face conversation,such as in the financial industry when an agent and a customertalk over a desk with microphones placed between the speakersand then it is transcribed. From the automatic speech recognition(ASR) perspective, one of the speakers is the target speaker, and theother speaker is a directional noise source. When the number ofmicrophones is small, we often accept microphone intervals that arelarger than the spatial aliasing limit because the performance of thebeamformer is better. Unfortunately, such a configuration results insignificant leakage of directional noise in certain frequency bandsbecause the spatial aliasing makes the beamformer and post-filterinaccurate there. Thus, we introduce a factorial model to compensateonly the degraded bands with information from the reliable bandsin a probabilistic framework integrating our proposed metrics andspeech model. In our experiments, the proposed method reducedthe errors from 29.8% to 24.9%.

On Design of Robust Deep Models for CHiME-4Multi-Channel Speech Recognition with MultipleConfigurations of Array Microphones

Yan-Hui Tu 1, Jun Du 1, Lei Sun 1, Feng Ma 2, Chin-HuiLee 3; 1USTC, China; 2iFLYTEK, China; 3Georgia Instituteof Technology, USAMon-O-2-10-4, Time: 15:30–15:50

We design a novel deep learning framework for multi-channel speechrecognition in two aspects. First, for the front-end, an iterativemask estimation (IME) approach based on deep learning is presentedto improve the beamforming approach based on the conventionalcomplex Gaussian mixture model (CGMM). Second, for the back-end,deep convolutional neural networks (DCNNs), with augmentation ofboth noisy and beamformed training data, are adopted for acousticmodeling while the forward and backward long short-term memoryrecurrent neural networks (LSTM-RNNs) are used for language model-ing. The proposed framework can be quite effective to multi-channelspeech recognition with random combinations of fixed microphones.Testing on the CHiME-4 Challenge speech recognition task with asingle set of acoustic and language models, our approach achievesthe best performance of all three tracks (1-channel, 2-channel, and6-channel) among submitted systems.

Acoustic Modeling for Google Home

Bo Li, Tara N. Sainath, Arun Narayanan, Joe Caroselli,Michiel Bacchiani, Ananya Misra, Izhak Shafran, HasimSak, Golan Pundak, Kean Chin, Khe Chai Sim, Ron J.Weiss, Kevin W. Wilson, Ehsan Variani, Chanwoo Kim,Olivier Siohan, Mitchel Weintraub, Erik McDermott,Richard Rose, Matt Shannon; Google, USAMon-O-2-10-5, Time: 15:50–16:10

This paper describes the technical and system building advancesmade to the Google Home multichannel speech recognition system,which was launched in November 2016. Technical advances includean adaptive dereverberation frontend, the use of neural networkmodels that do multichannel processing jointly with acoustic mod-

eling, and Grid-LSTMs to model frequency variations. On the systemlevel, improvements include adapting the model using Google Homespecific data. We present results on a variety of multichannel sets.The combination of technical and system advances result in a reduc-tion of WER of 8–28% relative compared to the current productionsystem.

On Multi-Domain Training and Adaptation ofEnd-to-End RNN Acoustic Models for Distant SpeechRecognition

Seyedmahdad Mirsamadi, John H.L. Hansen; Universityof Texas at Dallas, USAMon-O-2-10-6, Time: 16:10–16:30

Recognition of distant (far-field) speech is a challenge for ASR due tomismatch in recording conditions resulting from room reverberationand environment noise. Given the remarkable learning capacity ofdeep neural networks, there is increasing interest to address thisproblem by using a large corpus of reverberant far-field speech totrain robust models. In this study, we explore how an end-to-end RNNacoustic model trained on speech from different rooms and acousticconditions (different domains) achieves robustness to environmentalvariations. It is shown that the first hidden layer acts as a domainseparator, projecting the data from different domains into differentsubspaces. The subsequent layers then use this encoded domainknowledge to map these features to final representations that areinvariant to domain change. This mechanism is closely related tonoise-aware or room-aware approaches which append manually-extracted domain signatures to the input features. Additionally,we demonstrate how this understanding of the learning procedureprovides useful guidance for model adaptation to new acousticconditions. We present results based on AMI corpus to demonstratethe propagation of domain information in a deep RNN, and performrecognition experiments which indicate the role of encoded domainknowledge on training and adaptation of RNN acoustic models.

Mon-P-1-1 : Speech Analysis andRepresentation 2Poster 1, 11:00–13:00, Monday, 21 Aug. 2017Chair: Sekhar Seelamantula

Low-Dimensional Representation of SpectralEnvelope Without Deterioration for Full-Band SpeechAnalysis/Synthesis System

Masanori Morise, Genta Miyashita, Kenji Ozawa;University of Yamanashi, JapanMon-P-1-1-1, Time: 11:00–13:00

A speech coding for a full-band speech analysis/synthesis system isdescribed. In this work, full-band speech is defined as speech with asampling frequency above 40 kHz, whose Nyquist frequency coversthe audible frequency range. In prior works, speech coding has gen-erally focused on the narrow-band speech with a sampling frequencybelow 16 kHz. On the other hand, statistical parametric speechsynthesis currently uses the full-band speech, and low-dimensionalrepresentation of speech parameters is being used. The purpose ofthis study is to achieve speech coding without deterioration for full-band speech. We focus on a high-quality speech analysis/synthesissystem and mel-cepstral analysis using frequency warping. In thefrequency warping function, we directly use three auditory scales.We carried out a subjective evaluation using the WORLD vocoderand found that the optimum number of dimensions was around 50.The kind of frequency warping did not significantly affect the soundquality in the dimensions.

Notes

92

Robust Source-Filter Separation of Speech Signal inthe Phase Domain

Erfan Loweimi, Jon Barker, Oscar Saz Torralba, ThomasHain; University of Sheffield, UKMon-P-1-1-2, Time: 11:00–13:00

In earlier work we proposed a framework for speech source-filterseparation that employs phase-based signal processing. This pa-per presents a further theoretical investigation of the model andoptimisations that make the filter and source representations lesssensitive to the effects of noise and better matched to downstreamprocessing. To this end, first, in computing the Hilbert transform,the log function is replaced by the generalised logarithmic function.This introduces a tuning parameter that adjusts both the dynamicrange and distribution of the phase-based representation. Second,when computing the group delay, a more robust estimate for thederivative is formed by applying a regression filter instead of usingsample differences. The effectiveness of these modifications isevaluated in clean and noisy conditions by considering the accuracyof the fundamental frequency extracted from the estimated source,and the performance of speech recognition features extracted fromthe estimated filter. In particular, the proposed filter-based front-endreduces Aurora-2 WERs by 6.3% (average 0–20 dB) compared withpreviously reported results. Furthermore, when tested in a LVCSRtask (Aurora-4) the new features resulted in 5.8% absolute WERreduction compared to MFCCs without performance loss in theclean/matched condition.

A Time-Warping Pitch Tracking AlgorithmConsidering Fast f0 Changes

Simon Stone, Peter Steiner, Peter Birkholz; TechnischeUniversität Dresden, GermanyMon-P-1-1-3, Time: 11:00–13:00

Accurately tracking the fundamental frequency (f0) or pitch inspeech data is of great interest in numerous contexts. All currentlyavailable pitch tracking algorithms perform a short-term analysisof a speech signal to extract the f0 under the assumption that thepitch does not change within a single analysis frame, a simplificationthat introduces errors when the f0 changes rather quickly over time.This paper proposes a new algorithm that warps the time axis ofan analysis frame to counteract intra-frame f0 changes and thus toimprove the total tracking results. The algorithm was evaluated on aset of 4718 sentences from 20 speakers (10 male, 10 female) and withadded white and babble noise. It was comparative in performanceto the state-of-the-art algorithms RAPT and PRAAT to Pitch (ac)under clean conditions and outperformed both of them under noisyconditions.

A Modulation Property of Time-FrequencyDerivatives of Filtered Phase and its Application toAperiodicity and fo Estimation

Hideki Kawahara 1, Ken-Ichi Sakakibara 2, MasanoriMorise 3, Hideki Banno 4, Tomoki Toda 5; 1WakayamaUniversity, Japan; 2Health Science University ofHokkaido, Japan; 3University of Yamanashi, Japan;4Meijo University, Japan; 5Nagoya University, JapanMon-P-1-1-4, Time: 11:00–13:00

We introduce a simple and linear SNR (strictly speaking, periodic torandom power ratio) estimator (0 dB to 80 dB without additionalcalibration/linearization) for providing reliable descriptions ofaperiodicity in speech corpus. The main idea of this method isto estimate the background random noise level without directlyextracting the background noise. The proposed method is applicable

to a wide variety of time windowing functions with very low sidelobelevels. The estimate combines the frequency derivative and the time-frequency derivative of the mapping from filter center frequencyto the output instantaneous frequency. This procedure can replacethe periodicity detection and aperiodicity estimation subsystems ofrecently introduced open source vocoder, YANG vocoder. Sourcecode of MATLAB implementation of this method will also be opensourced.

Non-Local Estimation of Speech Signal for VowelOnset Point Detection in Varied Environments

Avinash Kumar, S. Shahnawazuddin, GayadharPradhan; NIT Patna, IndiaMon-P-1-1-5, Time: 11:00–13:00

Vowel onset point (VOP) is an important information extensivelyemployed in speech analysis and synthesis. Detecting the VOPs in agiven speech sequence, independent of the text contexts and record-ing environments, is a challenging area of research. Performanceof existing VOP detection methods have not yet been extensivelystudied in varied environmental conditions. In this paper, we haveexploited the non-local means estimation to detect those regionsin the speech sequence which are of high signal-to-noise ratio andexhibit periodicity. Mostly, those regions happen to be the vowelregions. This helps in overcoming the ill-effects of environmentaldegradations. Next, for each short-time frame of estimated speechsequence, we cumulatively sum the magnitude of the correspondingFourier transform spectrum. The cumulative sum is then used asthe feature to detect the VOPs. The experiments conducted on TIMITdatabase show that the proposed approach provides better resultsin terms of detection and spurious rate when compared to a fewexisting methods under clean and noisy test conditions.

Time-Domain Envelope Modulating the NoiseComponent of Excitation in a ContinuousResidual-Based Vocoder for Statistical ParametricSpeech Synthesis

Mohammed Salah Al-Radhi, Tamás Gábor Csapó, GézaNémeth; BME, HungaryMon-P-1-1-6, Time: 11:00–13:00

In this paper, we present an extension of a novel continuous residual-based vocoder for statistical parametric speech synthesis. Previouswork has shown the advantages of adding envelope modulated noiseto the voiced excitation, but this has not been investigated yet inthe context of continuous vocoders, i.e. of which all parameters arecontinuous. The noise component is often not accurately modeledin modern vocoders (e.g. STRAIGHT). For more natural soundingspeech synthesis, four time-domain envelopes (Amplitude, Hilbert,Triangular and True) are investigated and enhanced, and thenapplied to the noise component of the excitation in our continuousvocoder. The performance evaluation is based on the study of timeenvelopes. In an objective experiment, we investigated the PhaseDistortion Deviation of vocoded samples. A MUSHRA type subjectivelistening test was also conducted comparing natural and vocodedspeech samples. Both experiments have shown that the proposedframework using Hilbert and True envelopes provides high-qualityvocoding while outperforming the two other envelopes.

Notes

93

Wavelet Speech Enhancement Based on RobustPrincipal Component Analysis

Chia-Lung Wu 1, Hsiang-Ping Hsu 1, Syu-Siang Wang 2,Jeih-Weih Hung 3, Ying-Hui Lai 4, Hsin-Min Wang 2, YuTsao 2; 1Ministry of Justice, Taiwan; 2Academia Sinica,Taiwan; 3National Chi Nan University, Taiwan; 4YuanZe University, TaiwanMon-P-1-1-7, Time: 11:00–13:00

Most state-of-the-art speech enhancement (SE) techniques preferto enhance utterances in the frequency domain rather than in thetime domain. However, the overlap-add (OLA) operation in theshort-time Fourier transform (STFT) for speech signal processingpossibly distorts the signal and limits the performance of the SEtechniques. In this study, a novel SE method that integrates the dis-crete wavelet packet transform (DWPT) and a novel subspace-basedmethod, robust principal component analysis (RPCA), is proposedto enhance noise-corrupted signals directly in the time domain. Weevaluate the proposed SE method on the Mandarin hearing in noisetest (MHINT) sentences. The experimental results show that thenew method reduces the signal distortions dramatically, therebyimproving speech quality and intelligibility significantly. In addition,the newly proposed method outperforms the STFT-RPCA-basedspeech enhancement system.

Vowel Onset Point Detection Using SonorityInformation

Bidisha Sharma, S.R. Mahadeva Prasanna; IIT Guwahati,IndiaMon-P-1-1-8, Time: 11:00–13:00

Vowel onset point (VOP) refers to the starting event of a vowel, thatmay be reflected in different aspects of the speech signal. The majorissue in VOP detection using existing methods is the confusionamong the vowels and other categories of sounds preceding them.This work explores the usefulness of sonority information to reducethis confusion and improve VOP detection. Vowels are the mostsonorant sounds followed by semivowels, nasals, voiced fricatives,voiced stops. The sonority feature is derived from the vocal-tractsystem, excitation source and suprasegmental aspects. As thisfeature has the capability to discriminate among different sonorantsound units, it reduces the confusion among onset of vowels withthat of other sonorant sounds. This results in improved detectionand resolution of VOP detection for continuous speech. The per-formance of proposed sonority information based VOP detectionis found to be 92.4%, compared to 85.2% by the existing method.Also the resolution of localizing VOP within 10 ms is significantlyenhanced and a performance of 73.0% is achieved as opposed to60.2% by the existing method.

Analytic Filter Bank for Speech Analysis, FeatureExtraction and Perceptual Studies

Unto K. Laine; Aalto University, FinlandMon-P-1-1-9, Time: 11:00–13:00

Speech signal consists of events in time and frequency, and thereforeits analysis with high-resolution time-frequency tools is often ofimportance. Analytic filter bank provides a simple, fast, and flexiblemethod to construct time-frequency representations of signals. Itsparameters can be easily adapted to different situations from uni-form to any auditory frequency scale, or even to a focused resolution.Since the Hilbert magnitude values of the channels are obtainedat every sample, it provides a practical tool for a high-resolutiontime-frequency analysis.

The present study describes the basic theory of analytic filters andtests their main properties. Applications of analytic filter bank to

different speech analysis tasks including pitch period estimation andpitch synchronous analysis of formant frequencies and bandwidthsare demonstrated. In addition, a new feature vector called groupdelay vector is introduced. It is shown that this representationprovides comparable, or even better results, than those obtained byspectral magnitude feature vectors in the analysis and classificationof vowels. The implications of this observation are discussed alsofrom the speech perception point of view.

Learning the Mapping Function from VoltageAmplitudes to Sensor Positions in 3D-EMA UsingDeep Neural Networks

Christian Kroos, Mark D. Plumbley; University of Surrey,UKMon-P-1-1-10, Time: 11:00–13:00

The first generation of three-dimensional Electromagnetic Articu-lography devices (Carstens AG500) suffered from occasional criticaltracking failures. Although now superseded by new devices, theAG500 is still in use in many speech labs and many valuable datasets exist. In this study we investigate whether deep neural networks(DNNs) can learn the mapping function from raw voltage amplitudesto sensor positions based on a comprehensive movement data set.This is compared to arriving sample by sample at individual positionvalues via direct optimisation as used in previous methods. Wefound that with appropriate hyperparameter settings a DNN was ableto approximate the mapping function with good accuracy, leading toa smaller error than the previous methods, but that the DNN-basedapproach was not able to solve the tracking problem completely.

Mon-P-1-2 : Speech and Audio Segmentationand Classification 2Poster 2, 11:00–13:00, Monday, 21 Aug. 2017Chair: Hugo Van hamme

Multilingual i-Vector Based Statistical Modeling forMusic Genre Classification

Jia Dai, Wei Xue, Wenju Liu; Chinese Academy ofSciences, ChinaMon-P-1-2-1, Time: 11:00–13:00

For music signal processing, compared with the strategy whichmodels each short-time frame independently, when the long-timefeatures are considered, the time-series characteristics of the musicsignal can be better presented. As a typical kind of long-timemodeling strategy, the identification vector (i-vector) uses statisticalmodeling to model the audio signal in the segment level. It canbetter capture the important elements of the music signal, and theseimportant elements may benefit to the classification of music signal.In this paper, the i-vector based statistical feature for music genreclassification is explored. In addition to learn enough importantelements for music signal, a new multilingual i-vector feature isproposed based on the multilingual model. The experimental resultsshow that the multilingual i-vector based models can achieve betterclassification performances than conventional short-time modelingbased methods.

Indoor/Outdoor Audio Classification UsingForeground Speech Segmentation

Banriskhem K. Khonglah 1, K.T. Deepak 2, S.R. MahadevaPrasanna 1; 1IIT Guwahati, India; 2IIIT Dharwad, IndiaMon-P-1-2-2, Time: 11:00–13:00

The task of indoor/ outdoor audio classification using foregroundspeech segmentation is attempted in this work. Foreground speech

Notes

94

segmentation is the use of features to segment between foregroundspeech and background interfering sources like noise. Initially, theforeground and background segments are obtained from foregroundspeech segmentation by using the normalized autocorrelationpeak strength (NAPS) of the zero frequency filtered signal (ZFFS)as a feature. The background segments are then considered fordetermining whether a particular segment is an indoor or outdooraudio sample. The mel frequency cepstral coefficients are obtainedfrom the background segments of both the indoor and outdooraudio samples and are used to train the Support Vector Machine(SVM) classifier. The use of foreground speech segmentation gives apromising performance for the indoor/ outdoor audio classificationtask.

Attention Based CLDNNs for Short-Duration AcousticScene Classification

Jinxi Guo 1, Ning Xu 2, Li-Jia Li 3, Abeer Alwan 1;1University of California at Los Angeles, USA; 2Snap,USA; 3Google, USAMon-P-1-2-3, Time: 11:00–13:00

Recently, neural networks with deep architecture have been widelyapplied to acoustic scene classification. Both Convolutional NeuralNetworks (CNNs) and Long Short-Term Memory Networks (LSTMs)have shown improvements over fully connected Deep Neural Net-works (DNNs). Motivated by the fact that CNNs, LSTMs and DNNs arecomplimentary in their modeling capability, we apply the CLDNNs(Convolutional, Long Short-Term Memory, Deep Neural Networks)framework to short-duration acoustic scene classification in a unifiedarchitecture. The CLDNNs take advantage of frequency modelingwith CNNs, temporal modeling with LSTM, and discriminative train-ing with DNNs. Based on the CLDNN architecture, several novelattention-based mechanisms are proposed and applied on the LSTMlayer to predict the importance of each time step. We evaluate theproposed method on the truncated version of the 2016 TUT acousticscenes dataset which consists of recordings from 15 different scenes.By using CLDNNs with bidirectional LSTM, we achieve higher perfor-mance compared to the conventional neural network architectures.Moreover, by combining the attention-weighted output with LSTMfinal time step output, significant improvement can be furtherachieved.

Frame-Wise Dynamic Threshold Based PolyphonicAcoustic Event Detection

Xianjun Xia 1, Roberto Togneri 1, Ferdous Sohel 2, DavidHuang 1; 1University of Western Australia, Australia;2Murdoch University, AustraliaMon-P-1-2-4, Time: 11:00–13:00

Acoustic event detection, the determination of the acoustic eventtype and the localisation of the event, has been widely appliedin many real-world applications. Many works adopt multi-labelclassification techniques to perform the polyphonic acoustic eventdetection with a global threshold to detect the active acoustic events.However, the global threshold has to be set manually and is highlydependent on the database being tested. To deal with this, wereplaced the fixed threshold method with a frame-wise dynamicthreshold approach in this paper. Two novel approaches, namelycontour and regressor based dynamic threshold approaches areproposed in this work. Experimental results on the popular TUTAcoustic Scenes 2016 database of polyphonic events demonstratedthe superior performance of the proposed approaches.

Enhanced Feature Extraction for Speech Detection inMedia Audio

Inseon Jang 1, ChungHyun Ahn 1, Jeongil Seo 1,Younseon Jang 2; 1ETRI, Korea; 2Chungnam NationalUniversity, KoreaMon-P-1-2-5, Time: 11:00–13:00

Speech detection is an important first step for audio analysis onmedia contents, whose goal is to discriminate the presence of speechfrom non-speech. It remains a challenge owing to various soundsources included in media audio. In this work, we present a novelaudio feature extraction method to reflect the acoustic characteristicof the media audio in the time-frequency domain. Since the degreeof combination of harmonic and percussive components variesdepending on the type of sound source, the audio features whichfurther distinguish between speech and non-speech can be obtainedby decomposing the signal into both components. For the evalua-tion, we use over 20 hours of drama which manually annotated forspeech detection as well as 4 full-length movies with annotationsreleased for a research community, whose total length is over 8hours. Experimental results with deep neural network show superiorperformance of the proposed in media audio condition.

Audio Classification Using Class-Specific LearnedDescriptors

Sukanya Sonowal 1, Tushar Sandhan 2, Inkyu Choi 2,Nam Soo Kim 2; 1Samsung Electronics, Korea; 2SeoulNational University, KoreaMon-P-1-2-6, Time: 11:00–13:00

This paper presents a classification scheme for audio signals usinghigh-level feature descriptors. The descriptor is designed to capturethe relevance of each acoustic feature group (or feature set likemel-frequency cepstral coefficients, perceptual features etc.) inrecognizing an audio class. For this, a bank of RVM classifiers aremodeled for each ‘audio class’–‘feature group’ pair. The responseof an input signal to this bank of RVM classifiers forms the entriesof the descriptor. Each entry of the descriptor thus measures theproximity of the input signal to an audio class based on a singlefeature group. This form of signal representation offers two-fold ad-vantages. First, it helps to determine the effectiveness of each featuregroup in classifying a specific audio class. Second, the descriptoroffers higher discriminability than the low-level feature groups anda simple SVM classifier trained on the descriptor produces betterperformance than several state-of-the-art methods.

Hidden Markov Model Variational Autoencoder forAcoustic Unit Discovery

Janek Ebbers 1, Jahn Heymann 1, Lukas Drude 1, ThomasGlarner 1, Reinhold Haeb-Umbach 1, Bhiksha Raj 2;1Universität Paderborn, Germany; 2Carnegie MellonUniversity, USAMon-P-1-2-7, Time: 11:00–13:00

Variational Autoencoders (VAEs) have been shown to provide ef-ficient neural-network-based approximate Bayesian inference forobservation models for which exact inference is intractable. Its ex-tension, the so-called Structured VAE (SVAE) allows inference in thepresence of both discrete and continuous latent variables. Inspiredby this extension, we developed a VAE with Hidden Markov Models(HMMs) as latent models. We applied the resulting HMM-VAE to thetask of acoustic unit discovery in a zero resource scenario. Startingfrom an initial model based on variational inference in an HMM withGaussian Mixture Model (GMM) emission probabilities, the accuracyof the acoustic unit discovery could be significantly improved by

Notes

95

the HMM-VAE. In doing so we were able to demonstrate for anunsupervised learning task what is well-known in the supervisedlearning case: Neural networks provide superior modeling powercompared to GMMs.

Virtual Adversarial Training and Data Augmentationfor Acoustic Event Detection with Gated RecurrentNeural Networks

Matthias Zöhrer, Franz Pernkopf; Technische UniversitätGraz, AustriaMon-P-1-2-8, Time: 11:00–13:00

In this paper, we use gated recurrent neural networks (GRNNs) forefficiently detecting environmental events of the IEEE Detection andClassification of Acoustic Scenes and Events challenge (DCASE2016).For this acoustic event detection task data is limited. Therefore, wepropose data augmentation such as on-the-fly shuffling and virtualadversarial training for regularization of the GRNNs. Both improvethe performance using GRNNs. We obtain a segment-based errorrate of 0.59 and an F-score of 58.6%.

Montreal Forced Aligner: Trainable Text-SpeechAlignment Using Kaldi

Michael McAuliffe 1, Michaela Socolof 2, Sarah Mihuc 1,Michael Wagner 1, Morgan Sonderegger 1; 1McGillUniversity, Canada; 2University of Maryland, USAMon-P-1-2-9, Time: 11:00–13:00

We present the Montreal Forced Aligner (MFA), a new open-source sys-tem for speech-text alignment. MFA is an update to the Prosodylab-Aligner, and maintains its key functionality of trainability on newdata, as well as incorporating improved architecture (triphone acous-tic models and speaker adaptation), and other features. MFA usesKaldi instead of HTK, allowing MFA to be distributed as a stand-alonepackage, and to exploit parallel processing for computationally-intensive training and scaling to larger datasets. We evaluate MFA’sperformance on aligning word and phone boundaries in Englishconversational and laboratory speech, relative to human-annotatedboundaries, focusing on the effects of aligner architecture andtraining on the data to be aligned. MFA performs well relative to twoexisting open-source aligners with simpler architecture (Prosodylab-Aligner and FAVE), and both its improved architecture and trainingon data to be aligned generally result in more accurate boundaries.

A Robust Voiced/Unvoiced Phoneme Classificationfrom Whispered Speech Using the ‘Color’ ofWhispered Phonemes and Deep Neural Network

G. Nisha Meenakshi, Prasanta Kumar Ghosh; IndianInstitute of Science, IndiaMon-P-1-2-10, Time: 11:00–13:00

In this work, we propose a robust method to perform frame-levelclassification of voiced (V) and unvoiced (UV) phonemes from whis-pered speech, a challenging task due to its voiceless and noise-likenature. We hypothesize that a whispered speech spectrum canbe represented as a linear combination of a set of colored noisespectra. A five-dimensional (5D) feature is computed by employingnon-negative matrix factorization with a fixed basis dictionary, con-structed using spectra of five colored noises. Deep Neural Network(DNN) is used as the classifier. We consider two baseline features-1)Mel Frequency Cepstral Coefficients (MFCC), 2) features computedfrom a data driven dictionary. Experiments reveal that the featuresfrom the colored noise dictionary perform better (on average) thanthat using the data driven dictionary, with a relative improvement

in the average V/UV accuracy of 10.30%, within, and 10.41%, across,data from seven subjects. We also find that the MFCCs and 5Dfeatures carry complementary information regarding the nature ofvoicing decisions in whispered speech. Hence, across all subjects,we obtain a balanced frame-level V/UV classification performance,when MFCC and 5D features are combined, compared to a skewedperformance when they are considered separately.

Mon-P-1-4 : Search, Computational Strategiesand Language ModelingPoster 4, 11:00–13:00, Monday, 21 Aug. 2017Chair: György Szaszák

Rescoring-Aware Beam Search for Reduced SearchErrors in Contextual Automatic Speech Recognition

Ian Williams, Petar Aleksic; Google, USAMon-P-1-4-1, Time: 11:00–13:00

Using context in automatic speech recognition allows the recognitionsystem to dynamically task-adapt and bring gains to a broad varietyof use-cases. An important mechanism of context-inclusion ison-the-fly rescoring of hypotheses with contextual language modelcontent available only in real-time.

In systems where rescoring occurs on the lattice during its con-struction as part of beam search decoding, hypotheses eligible forrescoring may be missed due to pruning. This can happen for manyreasons: the language model and rescoring model may assign signif-icantly different scores, there may be a lot of noise in the utterance,or word prefixes with a high out-degree may necessitate aggressivepruning to keep the search tractable. This results in misrecognitionswhen contextually-relevant hypotheses are pruned before rescoring,even if a contextual rescoring model favors those hypotheses by alarge margin.

We present a technique to adapt the beam search algorithm topreserve hypotheses when they may benefit from rescoring. Weshow that this technique significantly reduces the number of searchpruning errors on rescorable hypotheses, without a significantincrease in the search space size. This technique makes it feasible touse one base language model, but still achieve high-accuracy speechrecognition results in all contexts.

Comparison of Decoding Strategies for CTC AcousticModels

Thomas Zenkel 1, Ramon Sanabria 1, Florian Metze 1, JanNiehues 2, Matthias Sperber 2, Sebastian Stüker 2, AlexWaibel 1; 1Carnegie Mellon University, USA; 2KIT,GermanyMon-P-1-4-2, Time: 11:00–13:00

Connectionist Temporal Classification has recently attracted a lot ofinterest as it offers an elegant approach to building acoustic models(AMs) for speech recognition. The CTC loss function maps an inputsequence of observable feature vectors to an output sequence ofsymbols. Output symbols are conditionally independent of eachother under CTC loss, so a language model (LM) can be incorporatedconveniently during decoding, retaining the traditional separation ofacoustic and linguistic components in ASR.

For fixed vocabularies, Weighted Finite State Transducers provide astrong baseline for efficient integration of CTC AMs with n-gram LMs.Character-based neural LMs provide a straight forward solution foropen vocabulary speech recognition and all-neural models, and can

Notes

96

be decoded with beam search. Finally, sequence-to-sequence modelscan be used to translate a sequence of individual sounds into a wordstring.

We compare the performance of these three approaches, and analyzetheir error patterns, which provides insightful guidance for futureresearch and development in this important area.

Phone Duration Modeling for LVCSR Using NeuralNetworks

Hossein Hadian 1, Daniel Povey 2, Hossein Sameti 1,Sanjeev Khudanpur 2; 1Sharif University of Technology,Iran; 2Johns Hopkins University, USAMon-P-1-4-3, Time: 11:00–13:00

We describe our work on incorporating probabilities of phone dura-tions, learned by a neural net, into an ASR system. Phone durationsare incorporated via lattice rescoring. The input features are derivedfrom the phone identities of a context window of phones, plus thedurations of preceding phones within that window. Unlike someprevious work, our network outputs the probability of differentdurations (in frames) directly, up to a fixed limit. We evaluate thismethod on several large vocabulary tasks, and while we consistentlysee improvements inWord Error Rates, the improvements are smallerwhen the lattices are generated with neural net based acousticmodels.

Towards Better Decoding and Language ModelIntegration in Sequence to Sequence Models

Jan Chorowski 1, Navdeep Jaitly 2; 1Google, USA;2NVIDIA, USAMon-P-1-4-4, Time: 11:00–13:00

The recently proposed Sequence-to-Sequence (seq2seq) frameworkadvocates replacing complex data processing pipelines, such as anentire automatic speech recognition system, with a single neuralnetwork trained in an end-to-end fashion. In this contribution, weanalyse an attention-based seq2seq speech recognition system thatdirectly transcribes recordings into characters. We observe twoshortcomings: overconfidence in its predictions and a tendencyto produce incomplete transcriptions when language models areused. We propose practical solutions to both problems achievingcompetitive speaker independent word error rates on the Wall StreetJournal dataset: without separate language models we reach 10.6%WER, while together with a trigram language model, we reach 6.7%WER, a state-of-the-art result for HMM-free methods.

Empirical Evaluation of Parallel Training Algorithmson Acoustic Modeling

Wenpeng Li 1, Binbin Zhang 1, Lei Xie 1, Dong Yu 2;1Northwestern Polytechnical University, China; 2TencentAI Lab, USAMon-P-1-4-5, Time: 11:00–13:00

Deep learning models (DLMs) are state-of-the-art techniques inspeech recognition. However, training good DLMs can be time con-suming especially for production-size models and corpora. Althoughseveral parallel training algorithms have been proposed to improvetraining efficiency, there is no clear guidance on which one to choosefor the task in hand due to lack of systematic and fair comparisonamong them. In this paper we aim at filling this gap by comparingfour popular parallel training algorithms in speech recognition,namely asynchronous stochastic gradient descent (ASGD), blockwisemodel-update filtering (BMUF), bulk synchronous parallel (BSP) andelastic averaging stochastic gradient descent (EASGD), on 1000-hourLibriSpeech corpora using feed-forward deep neural networks (DNNs)and convolutional, long short-term memory, DNNs (CLDNNs). Based

on our experiments, we recommend using BMUF as the top choice totrain acoustic models since it is most stable, scales well with numberof GPUs, can achieve reproducible results, and in many cases evenoutperforms single-GPU SGD. ASGD can be used as a substitute insome cases.

Binary Deep Neural Networks for Speech Recognition

Xu Xiang, Yanmin Qian, Kai Yu; Shanghai Jiao TongUniversity, ChinaMon-P-1-4-6, Time: 11:00–13:00

Deep neural networks (DNNs) are widely used in most currentautomatic speech recognition (ASR) systems. To guarantee goodrecognition performance, DNNs usually require significant com-putational resources, which limits their application to low-powerdevices. Thus, it is appealing to reduce the computational costwhile keeping the accuracy. In this work, in light of the success inimage recognition, binary DNNs are utilized in speech recognition,which can achieve competitive performance and substantial speedup. To our knowledge, this is the first time that binary DNNs havebeen used in speech recognition. For binary DNNs, network weightsand activations are constrained to be binary values, which enablesfaster matrix multiplication based on bit operations. By exploitingthe hardware population count instructions, the proposed binarymatrix multiplication can achieve 5∼7 times speed up comparedwith highly optimized floating-point matrix multiplication. Thisresults in much faster DNN inference since matrix multiplicationis the most computationally expensive operation. Experiments onboth TIMIT phone recognition and a 50-hour Switchboard speechrecognition show that, binary DNNs can run about 4 times fasterthan standard DNNs during inference, with roughly 10.0% relativeaccuracy reduction.

Hierarchical Constrained Bayesian Optimization forFeature, Acoustic Model and Decoder ParameterOptimization

Akshay Chandrashekaran, Ian Lane; Carnegie MellonUniversity, USAMon-P-1-4-7, Time: 11:00–13:00

We describe the implementation of a hierarchical constrainedBayesian Optimization algorithm and it’s application to joint op-timization of features, acoustic model structure and decodingparameters for deep neural network (DNN)-based large vocabularycontinuous speech recognition (LVCSR) systems. Within our hi-erarchical optimization method we perform constrained Bayesianoptimization jointly of feature hyper-parameters and acoustic modelstructure in the first-level, and then perform an iteration of con-strained Bayesian optimization for the decoder hyper-parametersin the second. We show the the proposed hierarchical optimizationmethod can generate a model with higher performance than amanually optimized system on a server platform. Furthermore, wedemonstrate that the proposed framework can be used to auto-matically build real-time speech recognition systems for graphicsprocessing unit (GPU)-enabled embedded platforms that retainsimilar accuracy to a server platform, while running with constrainedcomputing resources.

Use of Global and Acoustic Features Associated withContextual Factors to Adapt Language Models forSpontaneous Speech Recognition

Shohei Toyama, Daisuke Saito, Nobuaki Minematsu;University of Tokyo, JapanMon-P-1-4-8, Time: 11:00–13:00

In this study, we propose a new method of adapting language modelsfor speech recognition using para-linguistic and extra-linguistic

Notes

97

features in speech. When we talk with others, we often change theway of lexical choice and speaking style according to various contex-tual factors. This fact indicates that the performance of automaticspeech recognition can be improved by taking the contextual factorsinto account, which can be estimated from speech acoustics. In thisstudy, we attempt to find global and acoustic features that are asso-ciated with those contextual factors, then integrate those featuresinto Recurrent Neural Network (RNN) language models for speechrecognition. In experiments, using Japanese spontaneous speechcorpora, we examine how i-vector and openSMILE are associated withcontextual factors. Then, we use those features in the rerankingprocess of RNN-based language models. Results show that perplexityis reduced by 16% relative and word error rate is reduced by 2.1%relative for highly emotional speech.

Joint Learning of Correlated Sequence Labeling TasksUsing Bidirectional Recurrent Neural Networks

Vardaan Pahuja 1, Anirban Laha 1, Shachar Mirkin 2,Vikas Raykar 1, Lili Kotlerman 2, Guy Lev 2; 1IBM, India;2IBM, IsraelMon-P-1-4-9, Time: 11:00–13:00

The stream of words produced by Automatic Speech Recognition(ASR) systems is typically devoid of punctuations and formatting.Most natural language processing applications expect segmentedand well-formatted texts as input, which is not available in ASRoutput. This paper proposes a novel technique of jointly modelingmultiple correlated tasks such as punctuation and capitalization us-ing bidirectional recurrent neural networks, which leads to improvedperformance for each of these tasks. This method could be extendedfor joint modeling of any other correlated sequence labeling tasks.

Estimation of Gap Between Current Language Modelsand Human Performance

Xiaoyu Shen, Youssef Oualil, Clayton Greenberg, MittulSingh, Dietrich Klakow; Universität des Saarlandes,GermanyMon-P-1-4-10, Time: 11:00–13:00

Language models (LMs) have gained dramatic improvement in thepast years due to the wide application of neural networks. Thisraises the question of how far we are away from the perfect languagemodel and how much more research is needed in language mod-elling. As for perplexity giving a value for human perplexity (as anupper bound of what is reasonably expected from an LM) is difficult.Word error rate (WER) has the disadvantage that it also measuresthe quality of other components of a speech recognizer like theacoustic model and the feature extraction. We therefore suggestevaluating LMs in a generative setting (which has been done beforeon selected hand-picked examples) and running a human evaluationon the generated sentences. The results imply that LMs need about10 to 20 more years of research before human performance isreached. Moreover, we show that the human judgement scores onthe generated sentences and perplexity are closely correlated. Thisleads to an estimated perplexity of 12 for an LM that would be ableto pass the human judgement test in the setting we suggested.

A Phonological Phrase Sequence Modelling Approachfor Resource Efficient and Robust Real-TimePunctuation Recovery

Anna Moró, György Szaszák; BME, HungaryMon-P-1-4-11, Time: 11:00–13:00

For the automatic punctuation of Automatic Speech Recognition(ASR) output, both prosodic and text based features are used, often

in combination. Pure prosody based approaches usually have lowcomputation needs, introduce little latency (delay) and they are alsomore robust to ASR errors. Text based approaches usually yieldbetter performance, they are however resource demanding (bothregarding their training and computational needs), often introducehigh time latency and are more sensitive to ASR errors. The presentpaper proposes a lightweight prosody based punctuation approachfollowing a new paradigm: we argue in favour of an all-inclusivemodelling of speech prosody instead of just relying on distinctacoustic markers: first, the entire phonological phrase structureis reconstructed, then its close correlation with punctuations isexploited in a sequence modelling approach with recurrent neuralnetworks. With this tiny and easy to implement model we reachperformance in Hungarian punctuation comparable to large, textbased models for other languages by keeping resource requirementsminimal and suitable for real-time operation with low latency.

Mon-P-2-1 : Speech PerceptionPoster 1, 14:30–16:30, Monday, 21 Aug. 2017Chair: Louis ten Bosch

Factors Affecting the Intelligibility of Low-PassFiltered Speech

Lei Wang, Fei Chen; SUSTech, ChinaMon-P-2-1-1, Time: 14:30–16:30

Frequency compression is an effective alternative to conventionalhearing aids amplification for patients with severe-to-profoundmiddle- and high-frequency hearing loss and with some low-frequency residual hearing. In order to develop novel frequencycompression strategy, it is important to first understand the mech-anism for recognizing low-pass filtered speech, which simulateshigh-frequency hearing loss. The present work investigated threefactors affecting the intelligibility of low-pass filtered speech, i.e.,vowels, temporal fine-structure, and fundamental frequency (F0)contour. Mandarin sentences were processed to generate threetypes (i.e., vowel-only, fine-structure-only, and F0-contour-flattened)of low-pass filtered stimuli. Listening experiments with normal-hearing listeners showed that among the three factors assessed, thevowel-only low-pass filtered speech was the most intelligible, whichwas followed by the fine-structure-based low-pass filtered speech.Flattening F0-contour significantly deteriorated the intelligibility oflow-pass filtered speech.

Phonetic Restoration of Temporally Reversed Speech

Shi-yu Wang, Fei Chen; SUSTech, ChinaMon-P-2-1-2, Time: 14:30–16:30

Early study showed that temporally reversed speech may still be veryintelligible. The present work further assessed the role of acousticcues accounting for the intelligibility of temporally reversed speech.Mandarin sentences were edited to be temporally reversed. Experi-ment 1 preserved the original consonant segments, and experiment2 only preserved the temporally reversed fine-structure waveform.Experimental results with normal-hearing listeners showed thatfor Mandarin speech, listeners could still perfectly understand thetemporally reversed speech with a reversion duration up to 50 ms.Preserving original consonant segments did not significantly improvethe intelligibility of the temporally reversed speech, suggesting thatthe reversion processing applied to vowels largely affected the intel-ligibility of temporally reversed speech. When the local short-timeenvelope waveform was removed, listeners could still understandstimuli with primarily temporally reversed fine-structure waveform,suggesting the perceptual role of temporally reversed fine-structureto the intelligibility of temporally reversed speech.

Notes

98

Simultaneous Articulatory and Acoustic Distortion inL1 and L2 Listening: Locally Time-Reversed “Fast”Speech

Mako Ishida; Sophia University, JapanMon-P-2-1-3, Time: 14:30–16:30

The current study explores how native and non-native speakers copewith simultaneous articulatory and acoustic distortion in speechperception. The articulatory distortion was generated by asking aspeaker to articulate target speech as fast as possible (fast speech).The acoustic distortion was created by dividing speech signals intosmall segments with equal time duration (e.g., 50 ms) from theonset of speech, and flipping every segment on a temporal axis, andputting them back together (locally time-reversed speech). This studyexplored how “locally time-reversed fast speech” was intelligibleas compared to “locally time-reversed normal speech” measured inIshida, Samuel, and Arai (2016). Participants were native Englishspeakers and native Japanese speakers who spoke English as a sec-ond language. They listened to English words and pseudowords thatcontained a lot of stop consonants. These items were spoken fast andlocally time-reversed at every 10, 20, 30, 40, 50, or 60 ms. In general,“locally time-reversed fast speech” became gradually unintelligible asthe length of reversed segments increased. Native speakers generallyunderstood locally time-reversed fast spoken words well but notpseudowords, while non-native speakers hardly understood bothwords and pseudowords. Language proficiency strongly supportedthe perceptual restoration of locally time-reversed fast speech.

Lexically Guided Perceptual Learning in MandarinChinese

L. Ann Burchfield 1, San-hei Kenny Luk 2, MarkAntoniou 1, Anne Cutler 1; 1Western Sydney University,Australia; 2McMaster University, CanadaMon-P-2-1-4, Time: 14:30–16:30

Lexically guided perceptual learning refers to the use of lexicalknowledge to retune speech categories and thereby adapt to anovel talker’s pronunciation. This adaptation has been extensivelydocumented, but primarily for segmental-based learning in Englishand Dutch. In languages with lexical tone, such as Mandarin Chinese,tonal categories can also be retuned in this way, but segmentalcategory retuning had not been studied. We report two experimentsin which Mandarin Chinese listeners were exposed to an ambiguousmixture of [f] and [s] in lexical contexts favoring an interpretation aseither [f] or [s]. Listeners were subsequently more likely to identifysounds along a continuum between [f] and [s], and to interpretminimal word pairs, in a manner consistent with this exposure. Thuslexically guided perceptual learning of segmental categories hadindeed taken place, consistent with suggestions that such learningmay be a universally available adaptation process.

The Effect of Spectral Profile on the Intelligibility ofEmotional Speech in Noise

Chris Davis, Chee Seng Chong, Jeesun Kim; WesternSydney University, AustraliaMon-P-2-1-5, Time: 14:30–16:30

The current study investigated why the intelligibility of expressivespeech in noise varies as a function of the emotion expressed (e.g.,happiness being more intelligible than sadness), even though thesignal-to-noise ratio is the same. We tested the straightforwardproposal that the expression of some emotions affect speech in-telligibility by shifting spectral energy above the energy profile ofthe noise masker. This was done by determining how the spectralprofile of speech is affected by different emotional expressions using

three different expressive speech databases. We then examined ifthese changes were correlated with scores produced by an objectiveintelligibility metric. We found a relatively consistent shift in spec-tral energy for different emotions across the databases and a highcorrelation between the extent of these changes and the objectiveintelligibility scores. Moreover, the pattern of intelligibility scoresis consistent with human perception studies (although there wasconsiderable individual variation). We suggest that the intelligibilityof emotion speech in noise is simply related to its audibility asconditioned by the effect that the expression of emotion has on itsspectral profile.

Whether Long-Term Tracking of Speech Rate AffectsPerception Depends on Who is Talking

Merel Maslowski, Antje S. Meyer, Hans Rutger Bosker;MPI for Psycholinguistics, The NetherlandsMon-P-2-1-6, Time: 14:30–16:30

Speech rate is known to modulate perception of temporally am-biguous speech sounds. For instance, a vowel may be perceived asshort when the immediate speech context is slow, but as long whenthe context is fast. Yet, effects of long-term tracking of speech rateare largely unexplored. Two experiments tested whether long-termtracking of rate influences perception of the temporal Dutch vowelcontrast /A/-/a:/. In Experiment 1, one low-rate group listened to‘neutral’ rate speech from talker A and to slow speech from talkerB. Another high-rate group was exposed to the same neutral speechfrom A, but to fast speech from B. Between-group comparison of the‘neutral’ trials revealed that the low-rate group reported a higherproportion of /a:/ in A’s ‘neutral’ speech, indicating that A soundedfaster when B was slow. Experiment 2 tested whether one’s ownspeech rate also contributes to effects of long-term tracking of rate.Here, talker B’s speech was replaced by playback of participants’own fast or slow speech. No evidence was found that one’s ownvoice affected perception of talker A in larger speech contexts. Theseresults carry implications for our understanding of the mechanismsinvolved in rate-dependent speech perception and of dialogue.

Emotional Thin-Slicing: A Proposal for a Short- andLong-Term Division of Emotional Speech

Daniel Oliveira Peres 1, Dominic Watt 2, WaldemarFerreira Netto 1; 1Universidade de São Paulo, Brazil;2University of York, UKMon-P-2-1-7, Time: 14:30–16:30

Human listeners are adept at successfully recovering linguistically-and socially-relevant information from very brief utterances. Studiesusing the ‘thin-slicing’ approach show that accurate judgments ofthe speaker’s emotional state can be made from minimal quanti-ties of speech. The present experiment tested the performanceof listeners exposed to thin-sliced samples of spoken BrazilianPortuguese selected to exemplify four emotions (anger, fear, sad-ness, happiness). Rather than attaching verbal labels to the audiosamples, participants were asked to pair the excerpts with averagedfacial images illustrating the four emotion categories. Half of thelisteners were native speakers of Brazilian Portuguese, while theothers were native English speakers who knew no Portuguese. Bothgroups of participants were found to be accurate and consistent inassigning the audio samples to the expected emotion category, butsome emotions were more reliably identified than others. Fear wasmisidentified most frequently. We conclude that the phonetic cuesto speakers’ emotional states are sufficiently salient and differen-tiated that listeners need only a few syllables upon which to basejudgments, and that as a species we owe our perceptual sensitivity inthis area to the survival value of being able to make rapid decisionsconcerning the psychological states of others.

Notes

99

Predicting Epenthetic Vowel Quality from Acoustics

Adriana Guevara-Rukoz 1, Erika Parlato-Oliveira 2, ShiYu 1, Yuki Hirose 3, Sharon Peperkamp 1, EmmanuelDupoux 1; 1ENS, France; 2Universidade Federal de MinasGerais, Brazil; 3University of Tokyo, JapanMon-P-2-1-8, Time: 14:30–16:30

Past research has shown that sound sequences not permitted in ournative language may be distorted by our perceptual system. A well-documented example is vowel epenthesis, a phenomenon by whichlisteners hallucinate non-existent vowels within illegal consonantalsequences. As reported in previous work, this occurs for instancein Japanese (JP) and Brazilian Portuguese (BP), languages for whichthe ‘default’ epenthetic vowels are /u/ and /i/, respectively. In aperceptual experiment, we corroborate the finding that the quality ofthis illusory vowel is language-dependent, but also that this defaultchoice can be overridden by coarticulatory information present onthe consonant cluster. In a second step, we analyse recordings of JPand BP speakers producing ‘epenthesized’ versions of stimuli fromthe perceptual task. Results reveal that the default vowel corre-sponds to the vowel with the most reduced acoustic characteristicsand whose formants are acoustically closest to formant transitionspresent in consonantal clusters. Lastly, we model behaviouralresponses from the perceptual experiment with an exemplar modelusing dynamic time warping (DTW)-based similarity measures onMFCCs.

The Effect of Spectral Tilt on Size Discrimination ofVoiced Speech Sounds

Toshie Matsui 1, Toshio Irino 1, Kodai Yamamoto 1,Hideki Kawahara 1, Roy D. Patterson 2; 1WakayamaUniversity, Japan; 2University of Cambridge, UKMon-P-2-1-9, Time: 14:30–16:30

A number of studies, with either voiced or unvoiced speech, havedemonstrated that a speaker’s geometric mean formant frequency(MFF) has a large effect on the perception of the speaker’s size, aswould be expected. One study with unvoiced speech showed thatlifting the slope of the speech spectrum by 6 dB/octave also led to areduction in the perceived size of the speaker. This paper reports ananalogous experiment to determine whether lifting the slope of thespeech spectrum by 6 dB/octave affects the perception of speakersize with voiced speech (words). The results showed that voicedspeech with high-frequency enhancement was perceived to arisefrom smaller speakers. On average, the point of subjective equalityin MFF discrimination was reduced by about 5%. However, therewere large individual differences; some listeners were effectivelyinsensitive to spectral enhancement of 6 dB/octave; others showeda consistent effect of the same enhancement. The results suggestthat models of speaker size perception will need to include a listenerspecific parameter for the effect of spectral slope.

Misperceptions of the Emotional Content of Naturaland Vocoded Speech in a Car

Jaime Lorenzo-Trueba 1, Cassia Valentini Botinhao 2,Gustav Eje Henter 1, Junichi Yamagishi 1; 1NII, Japan;2University of Edinburgh, UKMon-P-2-1-10, Time: 14:30–16:30

This paper analyzes a) how often listeners interpret the emotionalcontent of an utterance incorrectly when listening to vocoded ornatural speech in adverse conditions; b) which noise conditionscause the most misperceptions; and c) which group of listenersmisinterpret emotions the most. The long-term goal is to constructnew emotional speech synthesizers that adapt to the environmentand to the listener. We performed a large-scale listening test whereover 400 listeners between the ages of 21 and 72 assessed natural

and vocoded acted emotional speech stimuli. The stimuli had beenartificially degraded using a room impulse response recorded in a carand various in-car noise types recorded in a real car. Experimentalresults show that the recognition rates for emotions and perceivedemotional strength degrade as signal-to-noise ratio decreases. Inter-estingly, misperceptions seem to be more pronounced for negativeand low-arousal emotions such as calmness or anger, while positiveemotions such as happiness appear to be more robust to noise. AnANOVA analysis of listener meta-data further revealed that genderand age also influenced results, with elderly male listeners mostlikely to incorrectly identify emotions.

The Relative Cueing Power of F0 and Duration inGerman Prominence Perception

Oliver Niebuhr 1, Jana Winkler 2; 1University of SouthernDenmark, Denmark; 2Christian-Albrechts-Universität zuKiel, GermanyMon-P-2-1-11, Time: 14:30–16:30

Previous studies showed for German and other (West) Germaniclanguage, including English, that perceived syllable prominence isprimarily controlled by changes in duration and F0, with the lattercue being more powerful than the former. Our study is an initialapproach to develop this prominence hierarchy further by puttingnumbers on the interplay of duration and F0. German listeners indi-rectly judged through lexical identification the relative prominencelevels of two neighboring syllables. Results show that an increase inF0 of between 0.49 and 0.76 st is required to outweigh the promi-nence effect of a 30% increase in duration of a neighboring syllable.These numbers are fairly stable across a large range of absolute F0and duration levels and hence useful in speech technology.

Perception and Acoustics of Vowel Nasality inBrazilian Portuguese

Luciana Marques, Rebecca Scarborough; University ofColorado at Boulder, USAMon-P-2-1-12, Time: 14:30–16:30

This study explores the relationship between identification, degreeof nasality and vowel quality in oral, nasal and nasalized vowels inBrazilian Portuguese. Despite common belief that the language pos-sesses contrastive nasal vowels, literature examination shows thatnasal vowels may be followed by a nasal resonance, while nasalizedvowels must be followed by a nasal consonant. It is argued that thenasal resonance may be the remains of a consonant that nasalizesthe vowel, making nasal vowels simply coarticulatorily nasalized (e.g.[1]). If so, vowel nasality should not be more informative for the per-ception of a word containing a nasal vowel than for a word containinga nasalized vowel, as nasality is attributed to coarticulation. To testthis hypothesis, randomized stimuli containing the first syllable ofwords with oral, nasal and nasalized vowels were presented to BPlisteners who had to identify the stimuli original word. Preliminaryresults demonstrate that accuracy decreased for nasal and nasalizedstimuli. A comparison between patterns of response to measureddegrees of vowel acoustic nasality and formant values demonstratethat vowel quality differences may play a more relevant role in wordidentification than type of nasality in a vowel.

Sociophonetic Realizations Guide Subsequent LexicalAccess

Jonny Kim, Katie Drager; University of Hawai‘i atManoa, USAMon-P-2-1-13, Time: 14:30–16:30

Previous studies on spoken word recognition suggest that lexicalaccess is facilitated when social information attributed to the voice

Notes

100

is congruent with the social characteristics associated with the word.This paper builds on this work, presenting results from a lexicaldecision task in which target words associated with different agegroups were preceded by sociophonetic primes. No age-relatedphonetic cues were provided within the target words; instead, thenon-related prime words contained a sociophonetic variable involvedin ongoing change. We found that age-associated words are recog-nized faster when preceded by an age-congruent phonetic variantin the prime word. The results demonstrate that lexical access isinfluenced by sociophonetic variation, a result which we argue arisesfrom experience-based probabilities of covariation between soundsand words.

Mon-P-2-2 : Speech Production andPerceptionPoster 2, 14:30–16:30, Monday, 21 Aug. 2017Chair: Wentao Gu

Critical Articulators Identification from RT-MRI of theVocal Tract

Samuel Silva, António Teixeira; Universidade de Aveiro,PortugalMon-P-2-2-1, Time: 14:30–16:30

Several technologies, such as electromagnetic midsagittal articu-lography (EMA) or real-time magnetic resonance (RT-MRI), enablestudying the static and dynamic aspects of speech production.The resulting knowledge can, in turn, inform the improvement ofspeech production models, e.g., for articulatory speech synthesis,by enabling the identification of which articulators and gestures areinvolved in producing specific sounds.

The amount of data available from these technologies, and theneed for a systematic quantitative assessment, advise tackling thesematters through data-driven approaches, preferably unsupervised,since annotated data is scarce. In this context, a method for statis-tical identification of critical articulators has been proposed, in theliterature, and successfully applied to EMA data. However, the manydifferences regarding the data available from other technologies,such as RT-MRI, and language-specific aspects create a challengingsetting for its direct and wider applicability.

In this article, we address the steps needed to extend the applica-bility of the proposed statistical analyses, initially applied to EMA,to an existing RT-MRI corpus and test it for a different language,European Portuguese. The obtained results, for three speakers,and considering 33 phonemes, provide phonologically meaningfulcritical articulator outcomes and show evidence of the applicabilityof the method to RT-MRI.

Semantic Edge Detection for Tracking Vocal TractAir-Tissue Boundaries in Real-Time MagneticResonance Images

Krishna Somandepalli, Asterios Toutios, Shrikanth S.Narayanan; University of Southern California, USAMon-P-2-2-2, Time: 14:30–16:30

Recent developments in real-time magnetic resonance imaging(rtMRI) have enabled the study of vocal tract dynamics duringproduction of running speech at high frame rates (e.g., 83 frames persecond). Such large amounts of acquired data require scalable auto-mated methods to identify different articulators (e.g., tongue, velum)for further analysis. In this paper, we propose a convolutional neuralnetwork with an encoder-decoder architecture to jointly detect therelevant air-tissue boundaries as well as to label them, which we

refer to as ‘semantic edge detection’. We pose this as a pixel labelingproblem, with the outline contour of each articulator of interestas positive class and the remaining tissue and airway as negativeclasses. We introduce a loss function modified with additionalpenalty for misclassification at air-tissue boundaries to accountfor class imbalance and improve edge localization. We then use agreedy search algorithm to draw contours from the probability mapsof the positive classes predicted by the network. The articulatorcontours obtained by our method are comparable to the true labelsgenerated by iteratively fitting a manually created subject-specifictemplate. Our results generalize well across subjects and differentvocal tract postures, demonstrating a significant improvement overthe structured regression baseline.

Vocal Tract Airway Tissue Boundary Tracking forrtMRI Using Shape and Appearance Priors

Sasan Asadiabadi, Engin Erzin; Koç Üniversitesi, TurkeyMon-P-2-2-3, Time: 14:30–16:30

Knowledge about the dynamic shape of the vocal tract is the basis ofmany speech production applications such as, articulatory analysis,modeling and synthesis. Vocal tract airway tissue boundary seg-mentation in the mid-sagittal plane is necessary as an initial step forextraction of the cross-sectional area function. This segmentationproblem is however challenging due to poor resolution of real-timespeech MRI, grainy noise and the rapidly varying vocal tract shape.We present a novel approach to vocal tract airway tissue boundarytracking by training a statistical shape and appearance model forhuman vocal tract. We manually segment a set of vocal tract profilesand utilize a statistical approach to train a shape and appearancemodel for the tract. An active contour approach is employed to seg-ment the airway tissue boundaries of the vocal tract while restrictingthe curve movement to the trained shape and appearance model.Then the contours in subsequent frames are tracked using densemotion estimation methods. Experimental evaluations over the meansquare error metric indicate significant improvements compared tothe state-of-the-art.

An Objective Critical Distance Measure Based on theRelative Level of Spectral Valley

T.V. Ananthapadmanabha 1, A.G. Ramakrishnan 2,Shubham Sharma 2; 1VSS, India; 2Indian Institute ofScience, IndiaMon-P-2-2-4, Time: 14:30–16:30

Spectral integration is a subjective phenomenon in which a vowelwith two formants, spaced below a critical distance, is perceivedto be of the same phonetic quality as that of a vowel with a singleformant. It is tedious to conduct perceptual tests to determine thecritical distance for various experimental conditions. To alleviatethis difficulty, we propose an objective critical distance (OCD) thatcan be determined from the spectral envelope of a speech signal.OCD is defined as that spacing between the adjacent formants whenthe level of the spectral valley between them reaches the meanspectral value. The measured OCD lies in the same range of 3 to3.5 Bark as the subjective critical distance for similar experimentalconditions giving credibility to the definition. However, it is notedthat OCD for front vowels is significantly different from that for theback vowels.

Notes

101

Database of Volumetric and Real-Time Vocal TractMRI for Speech Science

Tanner Sorensen 1, Zisis Skordilis 1, Asterios Toutios 1,Yoon-Chul Kim 2, Yinghua Zhu 3, Jangwon Kim 4, AdamLammert 5, Vikram Ramanarayanan 6, Louis Goldstein 1,Dani Byrd 1, Krishna Nayak 1, Shrikanth S. Narayanan 1;1University of Southern California, USA; 2SamsungMedical Center, Korea; 3Google, USA; 4Canary Speech,USA; 5MIT Lincoln Laboratory, USA; 6EducationalTesting Service, USAMon-P-2-2-5, Time: 14:30–16:30

We present the USC Speech and Vocal Tract Morphology MRIDatabase, a 17-speaker magnetic resonance imaging database forspeech research. The database consists of real-time magnetic reso-nance images (rtMRI) of dynamic vocal tract shaping, denoised audiorecorded simultaneously with rtMRI, and 3D volumetric MRI of vocaltract shapes during sustained speech sounds. We acquired 2D real-time MRI of vocal tract shaping during consonant-vowel-consonantsequences, vowel-consonant-vowel sequences, read passages, andspontaneous speech. We acquired 3D volumetric MRI of the full setof vowels and continuant consonants of American English. Each3D volumetric MRI was acquired in one 7-second scan in whichthe participant sustained the sound. This is the first database tocombine rtMRI of dynamic vocal tract shaping and 3D volumetricMRI of the entire vocal tract. The database provides a uniqueresource with which to examine the relationship between vocal tractmorphology and vocal tract function. The USC Speech and VocalTract Morphology MRI Database is provided free for research use athttp://sail.usc.edu/span/morphdb.

The Influence on Realization and Perception ofLexical Tones from Affricate’s Aspiration

Chong Cao, Yanlu Xie, Qi Zhang, Jinsong Zhang; BLCU,ChinaMon-P-2-2-6, Time: 14:30–16:30

Consonants in /CV/ syllables usually have potential influence ononset fundamental frequency (i.e., onset f0) of succeeding vowels.Previous studies showed such effect with respect to the aspirationof stops with evidence from Mandarin, a tonal language. While fewstudies investigated the effect on onset f0 from the aspiration ofaffricates. The differences between stops and affricates in aspirationleave space for further investigations. We examined the effect ofaffricate’s aspiration on the realization of onset f0 of followingvowels in the form of isolated syllables and continuous speechby reference to a minimal pair of syllables which differ only inaspiration. Besides, we conducted tone identification tests using twosets of tone continua based on the same minimal pair of syllables.Experimental results showed that the aspirated syllables increasedthe onset f0 of following vowels compared with unaspirated counter-parts in both kinds of contexts. While the magnitude of differencesvaried with tones. And the perception results showed that aspiratedsyllables tended to be perceived as tones that have relative loweronset f0, which in turn supported the production result. The presentstudy may have applications for speech identification and speechsynthesis.

Audiovisual Recalibration of Vowel Categories

Matthias K. Franken, Frank Eisner, Jan-MathijsSchoffelen, Daniel J. Acheson, Peter Hagoort, James M.McQueen; Radboud Universiteit Nijmegen, TheNetherlandsMon-P-2-2-7, Time: 14:30–16:30

One of the most daunting tasks of a listener is to map a continuousauditory stream onto known speech sound categories and lexicalitems. A major issue with this mapping problem is the variability inthe acoustic realizations of sound categories, both within and acrossspeakers. Past research has suggested listeners may use visualinformation (e.g., lip-reading) to calibrate these speech categories tothe current speaker. Previous studies have focused on audiovisualrecalibration of consonant categories. The present study exploreswhether vowel categorization, which is known to show less sharplydefined category boundaries, also benefit from visual cues.

Participants were exposed to videos of a speaker pronouncing oneout of two vowels, paired with audio that was ambiguous betweenthe two vowels. After exposure, it was found that participants hadrecalibrated their vowel categories. In addition, individual variabilityin audiovisual recalibration is discussed. It is suggested that listen-ers’ category sharpness may be related to the weight they assign tovisual information in audiovisual speech perception. Specifically,listeners with less sharp categories assign more weight to visualinformation during audiovisual speech recognition.

The Effect of Gesture on Persuasive Speech

Judith Peters, Marieke Hoetjes; Radboud UniversiteitNijmegen, The NetherlandsMon-P-2-2-8, Time: 14:30–16:30

Speech perception is multimodal, with not only speech, but alsogesture presumably playing a role in how a message is perceived.However, there have not been many studies on the effect that handgestures may have on speech perception in general, and on persuasivespeech in particular. Moreover, we do not yet know whether an effectof gestures may be larger when addressees are not involved in thetopic of the discourse, and are therefore more focused on peripheralcues, rather than the content of the message. In the current studyparticipants were shown a speech with or without gestures. Someparticipants were involved in the topic of the speech, others werenot. We studied five measures of persuasiveness. Results showedthat for all but one measure, viewing the video with accompanyinggestures made the speech more persuasive. In addition, there wereseveral interactions, showing that the performance of the speakerand the factual accuracy of the speech scored high especially forthose participants who not only saw gestures but were also notinvolved in the topic of the speech.

Auditory-Visual Integration of Talker Gender inCantonese Tone Perception

Wei Lai; University of Pennsylvania, USAMon-P-2-2-9, Time: 14:30–16:30

This study investigated the auditory-visual integration of talkergender in the perception of tone variances. Two experiments wereconducted to evaluate how listeners use the information of talkergender to adjust their expectation towards speakers’ pitch rangeand uncover intended tonal targets in Cantonese tone perception.Results from an audio-only tone identification task showed thattone categorization along the same pitch continuum shifted underdifferent conditions of voice gender. Listeners generally heard a toneof lower pitch when the word was produced by a female voice, whilethey heard a tone of higher pitch when the word was produced at thesame pitch level by a male voice. Results from an audio-visual tone

Notes

102

identification task showed that tone categorization along the samepitch continuum shifted under different conditions of face gender,despite the fact that the photos of different genders were disguisedfor the same set of stimuli in identical voices with identical pitchheights. These findings show that gender normalization plays a rolein uncovering linguistic pitch targets, and lend support to a hypoth-esis according to which listeners make use of socially constructedstereotypes to facilitate their basic phonological categorization inspeech perception and processing.

Event-Related Potentials Associated withSomatosensory Effect in Audio-Visual SpeechPerception

Takayuki Ito 1, Hiroki Ohashi 2, Eva Montas 2, Vincent L.Gracco 2; 1GIPSA, France; 2Haskins Laboratories, USAMon-P-2-2-10, Time: 14:30–16:30

Speech perception often involves multisensory processing. Althoughprevious studies have demonstrated visual [1, 2] and somatosensoryinteractions [3, 4] with auditory processing, it is not clear whethersomatosensory information can contribute to the processing ofaudio-visual speech perception. This study explored the neuralconsequence of somatosensory interactions in audio-visual speechprocessing. We assessed whether somatosensory orofacial stimu-lation influenced event-related potentials (ERPs) in response to anaudio-visual speech illusion (the McGurk Effect [1]). 64 scalp sites ofERPs were recorded in response to audio-visual speech stimulationand somatosensory stimulation. In the audio-visual condition, anauditory stimulus /ba/ was synchronized with the video of con-gruent facial motion (the production of /ba/) or incongruent facialmotion (the production of the /da/: McGurk condition). These twoaudio-visual stimulations were randomly presented with and withoutsomatosensory stimulation associated with facial skin deformation.We found ERPs differences associated with the McGurk effect in thepresence of the somatosensory conditions. ERPs for the McGurk ef-fect reliably diverge around 280 ms after auditory onset. The resultsdemonstrate a change of cortical potential of audio-visual processingdue to somatosensory inputs and suggest that somatosensory infor-mation encoding facial motion also influences speech processing.

When a Dog is a Cat and How it Changes Your PupilSize: Pupil Dilation in Response to InformationMismatch

Lena F. Renner, Marcin Włodarczak; StockholmUniversity, SwedenMon-P-2-2-11, Time: 14:30–16:30

In the present study, we investigate pupil dilation as a measure oflexical retrieval. We captured pupil size changes in reaction to amatch or a mismatch between a picture and an auditorily presentedword in 120 trials presented to ten native speakers of Swedish. Ineach trial a picture was displayed for six seconds, and 2.5 secondsinto the trial the word was played through loudspeakers. The pictureand the word were matching in half of the trials, and all stimuliwere common high-frequency monosyllabic Swedish words. Thedifference in pupil diameter trajectories across the two conditionswas analyzed with Functional Data Analysis. In line with the ex-pectations, the results indicate greater dilation in the mismatchcondition starting from around 800 ms after the stimulus onset.Given that similar processes were observed in brain imaging studies,pupil dilation measurements seem to provide an appropriate tool toreveal lexical retrieval. The results suggest that pupillometry couldbe a viable alternative to existing methods in the field of speech andlanguage processing, for instance across different ages and clinicalgroups.

Cross-Modal Analysis Between Phonation Differencesand Texture Images Based on Sentiment Correlations

Win Thuzar Kyaw, Yoshinori Sagisaka; WasedaUniversity, JapanMon-P-2-2-12, Time: 14:30–16:30

Motivated by the success of speech characteristics representation bycolor attributes, we analyzed the cross-modal sentiment correlationsbetween voice source characteristics and textural image characteris-tics. For the analysis, we employed vowel sounds with representativethree phonation differences (modal, creaky and breathy) and 36texture images with 36 semantic attributes (e.g., banded, crackedand scaly) annotated one semantic attribute for each texture. Byasking 40 subjects to select the most fitted textures from 36 figureswith different textures after listening 30 speech samples with dif-ferent phonations, we measured the correlations between acousticparameters showing voice source variations and the parameters ofselected textural image differences showing coarseness, contrast,directionality, busyness, complexity and strength. From the textureclassifications, voice characteristics can be roughly characterized bytextural differences: modal — gauzy, banded and smeared, creaky —porous, crystalline, cracked and scaly, breathy — smeared, freckledand stained. We have also found significant correlations betweenvoice source acoustic parameters and textural parameters. Thesecorrelations suggest the possibility of cross-modal mapping betweenvoice source characteristics and textural parameters, which enablesvisualization of speech information with source variations reflectinghuman sentiment perception.

Wireless Neck-Surface Accelerometer andMicrophone on Flex Circuit with Application toNoise-Robust Monitoring of Lombard Speech

Daryush D. Mehta 1, Patrick C. Chwalek 2, Thomas F.Quatieri 2, Laura J. Brattain 2; 1Massachusetts GeneralHospital, USA; 2MIT Lincoln Laboratory, USAMon-P-2-2-13, Time: 14:30–16:30

Ambulatory monitoring of real-world voice characteristics andbehavior has the potential to provide important assessment of voiceand speech disorders and psychological and emotional state. Inthis paper, we report on the novel development of a lightweight,wireless voice monitor that synchronously records dual-channeldata from an acoustic microphone and a neck-surface accelerometerembedded on a flex circuit. In this paper, Lombard speech effectswere investigated in pilot data from four adult speakers with normalvocal function who read a phonetically balanced paragraph in thepresence of different ambient acoustic noise levels. Whereas thesignal-to-noise ratio (SNR) of the microphone signal decreased inthe presence of increasing ambient noise level, the SNR of theaccelerometer sensor remained high. Lombard speech propertieswere thus robustly computed from the accelerometer signal andobserved in all four speakers who exhibited increases in averageestimates of sound pressure level (+2.3 dB), fundamental frequency(+21.4 Hz), and cepstral peak prominence (+1.3 dB) from quiet toloud ambient conditions. Future work calls for ambulatory data col-lection in naturalistic environments, where the microphone acts as asound level meter and the accelerometer functions as a noise-robustvoicing sensor to assess voice disorders, neurological conditions,and cognitive load.

Notes

103

Video-Based Tracking of Jaw Movements DuringSpeech: Preliminary Results and Future Directions

Andrea Bandini, Aravind Namasivayam, YanaYunusova; University Health Network, CanadaMon-P-2-2-14, Time: 14:30–16:30

Facial (e.g., lips and jaw) movements can provide important in-formation for the assessment, diagnosis and treatment of motorspeech disorders. However, due to the high costs of the instru-mentation used to record speech movements, such information istypically limited to research studies. With the recent developmentof depth sensors and efficient algorithms for facial tracking, clinicalapplications of this technology may be possible. Although liptracking methods have been validated in the past, jaw trackingremains a challenge. In this study, we assessed the accuracy oftracking jaw movements with a video-based system composed ofa face tracker and a depth sensor, specifically developed for shortrange applications (Intel RealSense SR300). The assessment wasperformed on healthy subjects during speech and non-speech tasks.Preliminary results showed that jaw movements can be trackedwith reasonable accuracy (RMSE≈2mm), with better performance forslow movements. Further tests are needed in order to improve theperformance of these systems and develop accurate methodologiesthat can reveal subtle changes in jaw movements for the assessmentand treatment of motor speech disorders.

Accurate Synchronization of Speech and EGG SignalUsing Phase Information

Sunil Kumar S.B., K. Sreenivasa Rao, Tanumay Mandal;IIT Kharagpur, IndiaMon-P-2-2-15, Time: 14:30–16:30

Synchronization of speech and corresponding Electroglottographic(EGG) signal is very helpful for speech processing research and devel-opment. During simultaneous recording of speech and EGG signals,the speech signal will be delayed by the duration corresponding tothe speech wave propagation from the glottis to the microphonerelative to the EGG signal. Even in same session of recording, thedelay between the speech and the EGG signals is varying due to thenatural movement of speaker’s head and movement of microphonein case MIC is held by hand. To study and model the informationwithin glottal cycles, precise synchronization of speech and EGGsignals is of utmost necessity. In this work, we propose a methodfor synchronization of speech and EGG signals based on the glottalactivity information present in the signals. The performance of theproposed method is demonstrated by estimation of delay betweenthe two signals (speech signals and corresponding EGG signals) andsynchronizing these signals by compensating the estimated delay.The CMU-Arctic database consist of simultaneous recording of thespeech and the EGG signals is used for the evaluation of the proposedmethod.

The Acquisition of Focal Lengthening in StockholmSwedish

Anna Sara H. Romøren 1, Aoju Chen 2; 1HiOA, Norway;2Universiteit Utrecht, The NetherlandsMon-P-2-2-16, Time: 14:30–16:30

In order to be efficient communicators, children need to adapttheir utterances to the common ground shared between themselvesand their conversational partners. One way of doing this is byprosodically highlighting focal information. In this paper we look atone specific prosodic manipulation, namely word duration, askingwhether Swedish-speaking children lengthen words to mark focus,as compared to adult controls. To the best of our knowledge, this isthe first study on the relationship between focus and word durationin Swedish-speaking children.

Mon-P-2-3 : Multi-lingual Models andAdaptation for ASRPoster 3, 14:30–16:30, Monday, 21 Aug. 2017Chair: Khe Chai Sim

Multilingual Recurrent Neural Networks withResidual Learning for Low-Resource SpeechRecognition

Shiyu Zhou, Yuanyuan Zhao, Shuang Xu, Bo Xu; ChineseAcademy of Sciences, ChinaMon-P-2-3-1, Time: 14:30–16:30

The shared-hidden-layer multilingual deep neural network (SHL-MDNN), in which the hidden layers of feed-forward deep neuralnetwork (DNN) are shared across multiple languages while thesoftmax layers are language dependent, has been shown to beeffective on acoustic modeling of multilingual low-resource speechrecognition. In this paper, we propose that the shared-hidden-layerwith Long Short-Term Memory (LSTM) recurrent neural networkscan achieve further performance improvement considering LSTMhas outperformed DNN as the acoustic model of automatic speechrecognition (ASR). Moreover, we reveal that shared-hidden-layermultilingual LSTM (SHL-MLSTM) with residual learning can yieldadditional moderate but consistent gain from multilingual tasksgiven the fact that residual learning can alleviate the degradationproblem of deep LSTMs. Experimental results demonstrate thatSHL-MLSTM can relatively reduce word error rate (WER) by 2.1–6.8%over SHL-MDNN trained using six languages and 2.6–7.3% overmonolingual LSTM trained using the language specific data onCALLHOME datasets. Additional WER reduction, about relatively2% over SHL-MLSTM, can be obtained through residual learning onCALLHOME datasets, which demonstrates residual learning is usefulfor SHL-MLSTM on multilingual low-resource ASR.

CTC Training of Multi-Phone Acoustic Models forSpeech Recognition

Olivier Siohan; Google, USAMon-P-2-3-2, Time: 14:30–16:30

Phone-sized acoustic units such as triphones cannot properly cap-ture the long-term co-articulation effects that occur in spontaneousspeech. For that reason, it is interesting to construct acoustic unitscovering a longer time-span such as syllables or words. Unfor-tunately, the frequency distribution of those units is such that afew high frequency units account for most of the tokens, whilemany units rarely occur. As a result, those units suffer from datasparsity and can be difficult to train. In this paper we propose ascalable data-driven approach to construct a set of salient unitsmade of sequences of phones called M-phones. We illustrate thatsince the decomposition of a word sequence into a sequence ofM-phones is ambiguous, those units are well suited to be used with aconnectionist temporal classification (CTC) approach which does notrely on an explicit frame-level segmentation of the word sequenceinto a sequence of acoustic units. Experiments are presented on aVoice Search task using 12,500 hours of training data.

An Investigation of Deep Neural Networks forMultilingual Speech Recognition Training andAdaptation

Sibo Tong, Philip N. Garner, Hervé Bourlard; IdiapResearch Institute, SwitzerlandMon-P-2-3-3, Time: 14:30–16:30

Different training and adaptation techniques for multilingual Au-tomatic Speech Recognition (ASR) are explored in the context of

Notes

104

hybrid systems, exploiting Deep Neural Networks (DNN) and HiddenMarkov Models (HMM). In multilingual DNN training, the hiddenlayers (possibly extracting bottleneck features) are usually sharedacross languages, and the output layer can either model multiple setsof language-specific senones or one single universal IPA-based mul-tilingual senone set. Both architectures are investigated, exploitingand comparing different language adaptive training (LAT) techniquesoriginating from successful DNN-based speaker-adaptation. Morespecifically, speaker adaptive training methods such as ClusterAdaptive Training (CAT) and Learning Hidden Unit Contribution(LHUC) are considered. In addition, a language adaptive outputarchitecture for IPA-based universal DNN is also studied and tested.

Experiments show that LAT improves the performance and adapta-tion on the top layer further improves the accuracy. By combiningstate-level minimum Bayes risk (sMBR) sequence training with LAT,we show that a language adaptively trained IPA-based universal DNNoutperforms a monolingually sequence trained model.

2016 BUT Babel System: Multilingual BLSTM AcousticModel with i-Vector Based Adaptation

Martin Karafiát, Murali Karthick Baskar, Pavel Matejka,Karel Veselý, František Grézl, Lukáš Burget, JanCernocký; Brno University of Technology, Czech RepublicMon-P-2-3-4, Time: 14:30–16:30

The paper provides an analysis of BUT automatic speech recognitionsystems (ASR) built for the 2016 IARPA Babel evaluation. The IARPABabel program concentrates on building ASR system for many low re-source languages, where only a limited amount of transcribed speechis available for each language. In such scenario, we found essentialto train the ASR systems in a multilingual fashion. In this work, wereport superior results obtained with pre-trained multilingual BLSTMacoustic models, where we used multi-task training with separateclassification layer for each language. The results reported on threeBabel Year 4 languages show over 3% absolute WER reductionsobtained from such multilingual pre-training. Experiments withdifferent input features show that the multilingual BLSTM performsthe best with simple log-Mel-filter-bank outputs, which makes ourpreviously successful multilingual stack bottleneck features withCMLLR adaptation obsolete. Finally, we experiment with differentconfigurations of i-vector based speaker adaptation in the mono- andmulti-lingual BLSTM architectures. This results in additional WERreductions over 1% absolute.

Optimizing DNN Adaptation for Recognition ofEnhanced Speech

Marco Matassoni, Alessio Brutti, Daniele Falavigna; FBK,ItalyMon-P-2-3-5, Time: 14:30–16:30

Speech enhancement directly using deep neural network (DNN) isof major interest due to the capability of DNN to tangibly reducethe impact of noisy conditions in speech recognition tasks. Simi-larly, DNN based acoustic model adaptation to new environmentalconditions is another challenging topic. In this paper we presentan analysis of acoustic model adaptation in presence of a disjointspeech enhancement component, identifying an optimal settingfor improving the speech recognition performance. Adaptation isderived from a consolidated technique that introduces in the trainingprocess a regularization term to prevent overfitting. We propose tooptimize the adaptation of the clean acoustic models towards theenhanced speech by tuning the regularization term based on thedegree of enhancement. Experiments on a popular noisy dataset(e.g., AURORA-4) demonstrate the validity of the proposed approach.

Deep Least Squares Regression for SpeakerAdaptation

Younggwan Kim, Hyungjun Lim, Jahyun Goo, HoirinKim; KAIST, KoreaMon-P-2-3-6, Time: 14:30–16:30

Recently, speaker adaptation methods in deep neural networks(DNNs) have been widely studied for automatic speech recognition.However, almost all adaptation methods for DNNs have to considervarious heuristic conditions such as mini-batch sizes, learning ratescheduling, stopping criteria, and initialization conditions becauseof the inherent property of a stochastic gradient descent (SGD)-basedtraining process. Unfortunately, those heuristic conditions are hardto be properly tuned. To alleviate those difficulties, in this paper, wepropose a least squares regression-based speaker adaptation methodin a DNN framework utilizing posterior mean of each class. Also,we show how the proposed method can provide a unique solutionwhich is quite easy and fast to calculate without SGD. The proposedmethod was evaluated in the TED-LIUM corpus. Experimental resultsshowed that the proposed method achieved up to a 4.6% relativeimprovement against a speaker independent DNN. In addition, wereport further performance improvement of the proposed methodwith speaker-adapted features.

Multi-Task Learning Using Mismatched Transcriptionfor Under-Resourced Speech Recognition

Van Hai Do 1, Nancy F. Chen 2, Boon Pang Lim 2, MarkHasegawa-Johnson 1; 1Viettel Group, Vietnam; 2A*STAR,SingaporeMon-P-2-3-7, Time: 14:30–16:30

It is challenging to obtain large amounts of native (matched) labelsfor audio in under-resourced languages. This could be due to alack of literate speakers of the language or a lack of universallyacknowledged orthography. One solution is to increase the amountof labeled data by using mismatched transcription, which employstranscribers who do not speak the language (in place of nativespeakers), to transcribe what they hear as nonsense speech in theirown language (e.g., Mandarin). This paper presents a multi-tasklearning framework where the DNN acoustic model is simultaneouslytrained using both a limited amount of native (matched) transcrip-tion and a larger set of mismatched transcription. We find that byusing a multi-task learning framework, we achieve improvementsover monolingual baselines and previously proposed mismatchedtranscription adaptation techniques. In addition, we show that usingalignments provided by a GMM adapted by mismatched transcriptionfurther improves acoustic modeling performance. Our experimentson Georgian data from the IARPA Babel program show the effective-ness of the proposed method.

Generalized Distillation Framework for SpeakerNormalization

Neethu Mariam Joy, Sandeep Reddy Kothinti, S. Umesh,Basil Abraham; IIT Madras, IndiaMon-P-2-3-8, Time: 14:30–16:30

Generalized distillation framework has been shown to be effectivein speech enhancement in the past. We extend this idea to speakernormalization without any explicit adaptation data in this paper.In the generalized distillation framework, we assume the presenceof some “privileged” information to guide the training process inaddition to the training data. In the proposed approach, the priv-ileged information is obtained from a “teacher” model, trained onspeaker-normalized FMLLR features. The “student” model is trainedon un-normalized filterbank features and uses teacher’s supervision

Notes

105

for cross-entropy training. The proposed distillation method doesnot need first pass decode information during testing and imposesno constraints on the duration of the test data for computingspeaker-specific transforms unlike in FMLLR or i-vector. Experimentsdone on Switchboard and AMI corpus show that the generalizeddistillation framework shows improvement over un-normalizedfeatures with or without i-vectors.

Learning Factorized Transforms for UnsupervisedAdaptation of LSTM-RNN Acoustic Models

Lahiru Samarakoon 1, Brian Mak 1, Khe Chai Sim 2;1HKUST, China; 2Google, USAMon-P-2-3-9, Time: 14:30–16:30

Factorized Hidden Layer (FHL) adaptation has been proposed forspeaker adaptation of deep neural network (DNN) based acousticmodels. In FHL adaptation, a speaker-dependent (SD) transformationmatrix and an SD bias are included in addition to the standard affinetransformation. The SD transformation is a linear combinationof rank-1 matrices whereas the SD bias is a linear combination ofvectors. Recently, the Long Short-Term Memory (LSTM) RecurrentNeural Networks (RNNs) have shown to outperform DNN acousticmodels in many Automatic Speech Recognition (ASR) tasks. In thiswork, we investigate the effectiveness of SD transformations forLSTM-RNN acoustic models. Experimental results show that whencombined with scaling of LSTM cell states’ outputs, SD transforma-tions achieve 2.3% and 2.1% absolute improvements over the baselineLSTM systems for the AMI IHM and AMI SDM tasks respectively.

Factorised Representations for Neural NetworkAdaptation to Diverse Acoustic Environments

Joachim Fainberg, Steve Renals, Peter Bell; University ofEdinburgh, UKMon-P-2-3-10, Time: 14:30–16:30

Adapting acoustic models jointly to both speaker and environmenthas been shown to be effective. In many realistic scenarios, however,either the speaker or environment at test time might be unknown,or there may be insufficient data to learn a joint transform. Gener-ating independent speaker and environment transforms improvesthe match of an acoustic model to unseen combinations. Usingi-vectors, we demonstrate that it is possible to factorise speakeror environment information using multi-condition training withneural networks. Specifically, we extract bottleneck features fromnetworks trained to classify either speakers or environments. Weperform experiments on the Wall Street Journal corpus combinedwith environment noise from the Diverse Environments MultichannelAcoustic Noise Database. Using the factorised i-vectors we showimprovements in word error rates on perturbed versions of theeval92 and dev93 test sets, both when one factor is missing andwhen the factors are seen but not in the desired combination.

Mon-P-2-4 : Prosody and Text ProcessingPoster 4, 14:30–16:30, Monday, 21 Aug. 2017Chair: Zofia Malisz

An RNN Model of Text Normalization

Richard Sproat 1, Navdeep Jaitly 2; 1Google, USA;2NVIDIA, USAMon-P-2-4-1, Time: 14:30–16:30

We present a recurrent neural net (RNN) model of text normaliza-tion — defined as the mapping of written text to its spoken form,and a description of the open-source dataset that we used in our

experiments. We show that while the RNN model achieves very highoverall accuracies, there remain errors that would be unacceptable ina speech application like TTS. We then show that a simple FST-basedfilter can help mitigate those errors. Even with that mitigationchallenges remain, and we end the paper outlining some possiblesolutions. In releasing our data we are thereby inviting others tohelp solve this problem.

Weakly-Supervised Phrase Assignment from Text in aSpeech-Synthesis System Using Noisy Labels

Asaf Rendel 1, Raul Fernandez 2, Zvi Kons 1, AndrewRosenberg 2, Ron Hoory 1, Bhuvana Ramabhadran 2;1IBM, Israel; 2IBM, USAMon-P-2-4-2, Time: 14:30–16:30

The proper segmentation of an input text string into meaningfulintonational phrase units is a fundamental task in the text-processingcomponent of a text-to-speech (TTS) system that generates intelli-gible and natural synthesis. In this work we look at the creation ofa symbolic, phrase-assignment model within the front end (FE) ofa North American English TTS system when high-quality labels forsupervised learning are unavailable and/or potentially mismatchedto the target corpus and domain. We explore a labeling scheme thatmerges heuristics derived from (i) automatic high-quality phoneticalignments, (ii) linguistic rules, and (iii) a legacy acoustic phrase-labeling system to arrive at a ground truth that can be used to traina bidirectional recurrent neural network model. We evaluate theperformance of this model in terms of objective metrics describingcategorical phrase assignment within the FE proper, as well as onthe effect that these intermediate labels carry onto the TTS back endfor the task of continuous prosody prediction (i.e., intonation andduration contours, and pausing). For this second task, we rely onsubjective listening tests and demonstrate that the proposed systemsignificantly outperforms a linguistic rules-based baseline for twodifferent synthetic voices.

Prosody Aware Word-Level Encoder Based onBLSTM-RNNs for DNN-Based Speech Synthesis

Yusuke Ijima, Nobukatsu Hojo, Ryo Masumura, TaichiAsami; NTT, JapanMon-P-2-4-3, Time: 14:30–16:30

Recent studies have shown the effectiveness of the use of word vec-tors in DNN-based speech synthesis. However, these word vectorstrained from a large amount of text generally carry not prosodic in-formation, which is important information for speech synthesis, butsemantic information. Therefore, if word vectors that take prosodicinformation into account can be obtained, it would be expected toimprove the quality of synthesized speech. In this paper, to obtainword-level vectors that take prosodic information into account, wepropose a novel prosody aware word-level encoder. A novel point ofthe proposed technique is to train a word-level encoder by using alarge speech corpus constructed for automatic speech recognition. Aword-level encoder that estimates the F0 contour for each word fromthe input word sequence is trained. The outputs of the bottlenecklayer in the trained encoder are used as the word-level vector. Bytraining the relationship between words and their prosodic infor-mation by using large speech corpus, the outputs of the bottlenecklayer would be expected to contain prosodic information. The re-sults of objective and subjective experiments indicate the proposedtechnique can synthesize speech with improved naturalness.

Notes

106

Global Syllable Vectors for Building TTS Front-Endwith Deep Learning

Jinfu Ni, Yoshinori Shiga, Hisashi Kawai; NICT, JapanMon-P-2-4-4, Time: 14:30–16:30

Recent vector space representations of words have succeeded incapturing syntactic and semantic regularities. In the context oftext-to-speech (TTS) synthesis, a front-end is a key component forextracting multi-level linguistic features from text, where syllableacts as a link between low- and high-level features. This paperdescribes the use of global syllable vectors as features to build afront-end, particularly evaluated in Chinese. The global syllablevectors directly capture global statistics of syllable-syllable co-occurrences in a large-scale text corpus. They are learned by a globallog-bilinear regression model in an unsupervised manner, whilst thefront-end is built using deep bidirectional recurrent neural networksin a supervised fashion. Experiments are conducted on large-scaleChinese speech and treebank text corpora, evaluating graphemeto phoneme (G2P) conversion, word segmentation, part of speech(POS) tagging, phrasal chunking, and pause break prediction. Resultsshow that the proposed method is efficient for building a compactand robust front-end with high performance. The global syllablevectors can be acquired relatively cheaply from plain text resources,therefore, they are vital to develop multilingual speech synthesis,especially for under-resourced language modeling.

Prosody Control of Utterance Sequence forInformation Delivering

Ishin Fukuoka, Kazuhiko Iwata, Tetsunori Kobayashi;Waseda University, JapanMon-P-2-4-5, Time: 14:30–16:30

We propose a conversational speech synthesis system in whichthe prosodic features of each utterance are controlled throughoutthe entire input text. We have developed a “news-telling system,”which delivered news articles through spoken language. The speechsynthesis system for the news-telling should be able to highlightutterances containing noteworthy information in the article witha particular way of speaking so as to impress them on the users.To achieve this, we introduced role and position features of theindividual utterances in the article into the control parameters forprosody generation throughout the text. We defined three categoriesfor the role feature: a nucleus (which is assigned to the utteranceincluding the noteworthy information), a front satellite (whichprecedes the nucleus) and a rear satellite (which follows the nucleus).We investigated how the prosodic features differed depending onthe role and position features through an analysis of news-tellingspeech data uttered by a voice actress. We designed the speechsynthesis system on the basis of a deep neural network having therole and position features added to its input layer. Objective andsubjective evaluation results showed that introducing those featureswas effective in the speech synthesis for the information delivering.

Multi-Task Learning for Prosodic Structure GenerationUsing BLSTM RNN with Structured Output Layer

Yuchen Huang, Zhiyong Wu, Runnan Li, Helen Meng,Lianhong Cai; Tsinghua University, ChinaMon-P-2-4-6, Time: 14:30–16:30

Prosodic structure generation from text plays an important rolein Chinese text-to-speech (TTS) synthesis, which greatly influencesthe naturalness and intelligibility of the synthesized speech. Thispaper proposes a multi-task learning method for prosodic structuregeneration using bidirectional long short-term memory (BLSTM)recurrent neural network (RNN) and structured output layer (SOL).Unlike traditional methods where prerequisites such as lexicon

word or even syntactic tree are usually required as the input, theproposed method predicts prosodic boundary labels directly fromChinese characters. BLSTM RNN is used to capture the bidirectionalcontextual dependencies of prosodic boundary labels. SOL furthermodels correlations between prosodic structures, lexicon wordsas well as part-of-speech (POS), where the prediction of prosodicboundary labels are conditioned upon word tokenization and POStagging results. Experimental results demonstrate the effectivenessof the proposed method.

Investigating Efficient Feature RepresentationMethods and Training Objective for BLSTM-BasedPhone Duration Prediction

Yibin Zheng, Jianhua Tao, Zhengqi Wen, Ya Li, Bin Liu;Chinese Academy of Sciences, ChinaMon-P-2-4-7, Time: 14:30–16:30

Accurate modeling and prediction of speech-sound durations areimportant in generating natural synthetic speech. This paper fo-cuses on both feature and training objective aspects to improvethe performance of the phone duration model for speech synthesissystem. In feature aspect, we combine the feature representationfrom gradient boosting decision tree (GBDT) and phoneme iden-tity embedding model (which is realized by the jointly training ofphoneme embedded vector (PEV) and word embedded vector (WEV))for BLSTM to predict the phone duration. The PEV is used to replacethe one-hot phoneme identity, and GBDT is utilized to transformthe traditional contextual features. In the training objective aspect,a new training objective function which taking into account of thecorrelation and consistency between the predicted utterance andthe natural utterance is proposed. Perceptual tests indicate theproposed methods could improve the naturalness of the syntheticspeech, which benefits from the proposed feature representationmethods could capture more precise contextual features, and theproposed training objective function could tackle the over-averagedproblem for the generated phone durations.

Discrete Duration Model for Speech Synthesis

Bo Chen, Tianling Bian, Kai Yu; Shanghai Jiao TongUniversity, ChinaMon-P-2-4-8, Time: 14:30–16:30

The acoustic model and the duration model are the two majorcomponents in statistical parametric speech synthesis (SPSS) sys-tems. The neural network based acoustic model makes it possibleto model phoneme duration at phone-level instead of state-level inconventional hidden Markov model (HMM) based SPSS systems. Sincethe duration of phonemes is countable value, the distribution of thephone-level duration is discrete given the linguistic features, whichmeans the Gaussian hypothesis is no longer necessary. This paperprovides an investigation on the performance of LSTM-RNN durationmodel that directly models the probability of the countable durationvalues given linguistic features using cross entropy as criteria. Themulti-task learning is also experimented at the same time, with acomparison to the standard LSTM-RNN duration model in objectiveand subjective measures. The result shows that directly modeling thediscrete distribution has its benefit and multi-task model achievesbetter performance in phone-level duration modeling.

Comparison of Modeling Target in LSTM-RNNDuration Model

Bo Chen, Jiahao Lai, Kai Yu; Shanghai Jiao TongUniversity, ChinaMon-P-2-4-9, Time: 14:30–16:30

Speech duration is an important component in statistical parameterspeech synthesis(SPSS). In LSTM-RNN based SPSS system, the speech

Notes

107

duration affects the quality of synthesized speech in two aspects,the prosody of speech and the position features in acoustic model.This paper investigated the effects of duration in LSTM-RNN basedSPSS system. The performance of the acoustic models with positionfeatures at different levels are compared. Also, duration models withdifferent network architectures are presented. A method to utilizethe priori knowledge that the sum of state duration of a phonemeshould be equal to the phone duration is proposed and proved tohave better performance in both state duration and phone durationmodeling. The result shows that acoustic model with state-levelposition features has better performance in acoustic modeling(especially in voice/unvoice classification), which means state-levelduration model still has its advantage and the duration models withthe priori knowledge can result in better speech quality.

Learning Word Vector Representations Based onAcoustic Counts

M. Sam Ribeiro, Oliver Watts, Junichi Yamagishi;University of Edinburgh, UKMon-P-2-4-10, Time: 14:30–16:30

This paper presents a simple count-based approach to learning wordvector representations by leveraging statistics of co-occurrencesbetween text and speech. This type of representation requires twodiscrete sequences of units defined across modalities. Two possiblemethods for the discretization of an acoustic signal are presented,which are then applied to fundamental frequency and energy con-tours of a transcribed corpus of speech, yielding a sequence oftextual objects (e.g. words, syllables) aligned with a sequence ofdiscrete acoustic events. Constructing a matrix recording the co-occurrence of textual objects with acoustic events and reducing itsdimensionality with matrix decomposition results in a set of context-independent representations of word types. These are applied tothe task of acoustic modelling for speech synthesis; objective andsubjective results indicate that these representations are useful forthe generation of acoustic parameters in a text-to-speech (TTS) sys-tem. In general, we observe that the more discretization approaches,acoustic signals, and levels of linguistic analysis are incorporatedinto a TTS system via these count-based representations, the betterthat TTS system performs.

Synthesising Uncertainty: The Interplay of VocalEffort and Hesitation Disfluencies

Éva Székely, Joseph Mendelson, Joakim Gustafson; KTH,SwedenMon-P-2-4-11, Time: 14:30–16:30

As synthetic voices become more flexible, and conversationalsystems gain more potential to adapt to the environmental andsocial situation, the question needs to be examined, how differentmodifications to the synthetic speech interact with each other andhow their specific combinations influence perception. This workinvestigates how the vocal effort of the synthetic speech togetherwith added disfluencies affect listeners’ perception of the degreeof uncertainty in an utterance. We introduce a DNN voice builtentirely from spontaneous conversational speech data and capableof producing a continuum of vocal efforts, prolongations and filledpauses with a corpus-based method. Results of a listener evaluationindicate that decreased vocal effort, filled pauses and prolongationof function words increase the degree of perceived uncertaintyof conversational utterances expressing the speaker’s beliefs. Wedemonstrate that the effect of these three cues are not merelyadditive, but that interaction effects, in particular between the twotypes of disfluencies and between vocal effort and prolongationsneed to be considered when aiming to communicate a specific levelof uncertainty. The implications of these findings are relevant for

adaptive and incremental conversational systems using expressivespeech synthesis and aspiring to communicate the attitude of uncer-tainty.

Mon-S&T-1/2-A : Show & Tell 1E306, 11:00–13:00, 14:30–16:30, Monday, 21 Aug. 2017

Prosograph: A Tool for Prosody Visualisation ofLarge Speech Corpora

Alp Öktem, Mireia Farrús, Leo Wanner; UniversitatPompeu Fabra, SpainMon-S&T-2-A-1, Time: 14:30–16:30

This paper presents an open-source tool that has been developed tovisualize a speech corpus with its transcript and prosodic featuresaligned at word level. In particular, the tool is aimed at providinga simple and clear way to visualize prosodic patterns along largesegments of speech corpora, and can be applied in any research thatinvolves prosody analysis.

ChunkitApp: Investigating the Relevant Units ofOnline Speech Processing

Svetlana Vetchinnikova, Anna Mauranen, NinaMikušová; University of Helsinki, FinlandMon-S&T-2-A-2, Time: 14:30–16:30

This paper presents a web-based application for tablets ‘ChunkitApp’developed to investigate chunking in online speech processing. Thedesign of the app is based on recent theoretical developments in lin-guistics and cognitive science, and in particular on the suggestions ofLinear Unit Grammar [1]. The data collected using the app providesevidence for the reality of online chunking in language processingand the validity of the construct. In addition to experimental uses,the app has potential applications in language education and speechrecognition.

Extending the EMU Speech Database ManagementSystem: Cloud Hosting, Team Collaboration,Automatic Revision Control

Markus Jochim; LMU München, GermanyMon-S&T-2-A-3, Time: 14:30–16:30

In this paper, we introduce a new component of the EMU SpeechDatabase Management System [1, 2] to improve the team workflowof handling production data (both acoustic and physiological) inphonetics and the speech sciences. It is named emuDB Manager, andit facilitates the coordination of team efforts, possibly distributedover several nations, by introducing automatic revision control(based on Git), cloud hosting (in private clouds provided by theresearchers themselves or a third party), by keeping track of whichparts of the database have already been edited (and by whom), andby centrally collecting and making searchable the notes made duringthe edit process.

HomeBank: A Repository for Long-Form Real-WorldAudio Recordings of Children

Anne S. Warlaumont 1, Mark VanDam 2, ElikaBergelson 3, Alejandrina Cristia 4; 1University ofCalifornia at Merced, USA; 2Washington State University,USA; 3Duke University, USA; 4LSCP (UMR 8554), FranceMon-S&T-2-A-4, Time: 14:30–16:30

HomeBank is a new component of the TalkBank system, focused onlong-form (i.e., multi-hour, typically daylong) real-world recordings

Notes

108

of children’s language experiences, and it is linked to a GitHub repos-itory in which tools for analyzing those recordings can be shared.HomeBank constitutes not only a rich resource for researchersinterested in early language acquisition specifically, but also forthose seeking to study spontaneous speech, media exposure, andaudio environments more generally. This Show and Tell describesthe procedures for accessing and contributing HomeBank data andcode. It also overviews the current contents of the repositories, andprovides some examples of audio recordings, available transcrip-tions, and currently available analysis tools.

A System for Real Time Collaborative TranscriptionCorrection

Peter Bell, Joachim Fainberg, Catherine Lai, MarkSinclair; University of Edinburgh, UKMon-S&T-2-A-5, Time: 14:30–16:30

We present a system to enable efficient, collaborative humancorrection of ASR transcripts, designed to operate in real-timesituations, for example, when post-editing live captions generatedfor news broadcasts. In the system, confusion networks derivedfrom ASR lattices are used to highlight low-confident words andpresent alternatives to the user for quick correction. The systemuses a client-server architecture, whereby information about eachmanual edit is posted to the server. Such information can be usedto dynamically update the one-best ASR output for all utterancescurrently in the editing pipeline. We propose to make updatesin three different ways; by finding a new one-best path throughan existing ASR lattice consistent with the correction received; byidentifying further instances of out-of-vocabulary terms entered bythe user; and by adapting the language model on the fly. Updates arereceived asynchronously by the client.

MoPAReST — Mobile Phone Assisted Remote SpeechTherapy Platform

Chitralekha Bhat 1, Anjali Kant 2, Bhavik Vachhani 1,Sarita Rautara 2, Ashok Kumar Sinha 2, Sunil KumarKopparapu 1; 1TCS Innovation Labs Mumbai, India;2AYJNISHD, IndiaMon-S&T-2-A-6, Time: 14:30–16:30

Through this paper, we present the Mobile Phone Assisted RemoteSpeech Therapy Platform for individuals with speech disabilitiesto avail the benefits of therapy remotely with minimal face-to-facesessions with the Speech Language Pathologist (SLP). The objective isto address the skewed ratio of SLP to patients as well increase the ef-ficacy of the therapy by keeping the patient engaged more frequentlyalbeit asynchronously and remotely. The platform comprises (1) Aweb-interface to be used by the SLP to monitor the progress of theirpatients at a time convenient to them and (2) A mobile applicationalong with speech processing algorithms to provide instant feedbackto the patient. We envision this platform to cut down the therapytime, especially for rural Indian patients. Evaluation of this platformis being done for five patients with mis-articulation in Marathilanguage.

Mon-S&T-1/2-B : Show & Tell 2E397, 11:00–13:00, 14:30–16:30, Monday, 21 Aug. 2017

An Apparatus to Investigate Western Opera SingingSkill Learning Using Performance and ResultBiofeedback, and Measuring its Neural Correlates

Aurore Jaumard-Hakoun 1, Samy Chikhi 1, TakfarinasMedani 1, Angelika Nair 2, Gérard Dreyfus 1,François-Benoît Vialatte 1; 1ESPCI Paris, France; 2DrewUniversity, USAMon-S&T-2-B-1, Time: 14:30–16:30

We present our preliminary developments on a biofeedback interfacefor Western operatic style training, combining performance andresult biofeedback. Electromyographic performance feedbacks, aswell as formant-tuning result feedbacks are displayed visually, usingcontinuously scrolling displays, or discrete post-trial evaluations.Our final aim is to investigate electroencephalographic (EEG) mea-surements in order to identify neural correlates of feedback-basedskill learning.

PercyConfigurator — Perception Experiments as aService

Christoph Draxler; LMU München, GermanyMon-S&T-2-B-2, Time: 14:30–16:30

PercyConfigurator is an experiment editor that eliminates the needfor programming; the experiment definition and content are simplydropped onto the PercyConfigurator web page for interactive editingand testing. When the editing is done, the experiment definitionand content are uploaded to the server. The server returns a linkto the experiment which is then distributed to potential participants.

The Bavarian Archive for Speech Signals (BAS) hosts PercyConfigura-tor as a free service to the academic community.

System for Speech Transcription and Post-Editing inMicrosoft Word

Askars Salimbajevs, Indra Ikauniece; Tilde, LatviaMon-S&T-2-B-3, Time: 14:30–16:30

In this demonstration paper, we introduce a transcription servicethat can be used for transcription of different meetings, sessionsetc. The service performs speaker diarization, automatic speechrecognition, punctuation restoration and produces human-readabletranscripts as special Microsoft Word documents that have audio andword alignments embedded. Thereby, a widely-used word processoris transformed into a transcription post-editing tool. Currently,Latvian and Lithuanian languages are supported, but other languagescan be easily added.

Emojive! Collecting Emotion Data from Speech andFacial Expression Using Mobile Game App

Ji Ho Park, Nayeon Lee, Dario Bertero, Anik Dey, PascaleFung; HKUST, ChinaMon-S&T-2-B-4, Time: 14:30–16:30

We developed Emojive!, a mobile game app to make emotion recog-nition from audio and image interactive and fun, motivating theusers to play with the app. The game is to act out a specific emotion,among six emotion labels (happy, sad, anger, anxiety, loneliness,criticism), given by the system. Double player mode lets two peopleto compete their acting skills. The more users play the game, themore emotion-labelled data will be acquired. We are using deep

Notes

109

Convolutional Neural Network (CNN) models to recognize emotionfrom audio and facial image in real-time with a mobile front-endclient including intuitive user interface and simple data visualization.

Mylly — The Mill: A New Platform for ProcessingSpeech and Text Corpora Easily and Efficiently

Mietta Lennes 1, Jussi Piitulainen 1, Martin Matthiesen 2;1University of Helsinki, Finland; 2CSC, FinlandMon-S&T-2-B-5, Time: 14:30–16:30

Speech and language researchers need to manage and analyzeincreasing quantities of material. Various tools are available forvarious stages of the work, but they often require the researcher touse different interfaces and to convert the output from each toolinto suitable input for the next one.

The Language Bank of Finland (Kielipankki) is developing an on-lineplatform called Mylly for processing speech and language datain a graphical user interface that integrates different tools into asingle workflow. Mylly provides tools and computational resourcesfor processing material and for the inspecting the results. Thetools plugged into Mylly include a parser, morphological analyzers,generic finite-state technology, and a speech recognizer. Users canupload data and download any intermediate results in the toolchain. Mylly runs on CSC’s Taito cluster and is an instance of theChipster platform. Access rights to Mylly are given for academic use.

The Language Bank of Finland is a collection of corpora, tools andother services maintained by FIN-CLARIN, a consortium of Finnishuniversities and research organizations coordinated by the Univer-sity of Helsinki. The technological infrastructure for the LanguageBank of Finland is provided by CSC – IT Center for Science.

Visual Learning 2: Pronunciation App UsingUltrasound, Video, and MRI

Kyori Suzuki, Ian Wilson, Hayato Watanabe; Universityof Aizu, JapanMon-S&T-2-B-6, Time: 14:30–16:30

We demonstrate Visual Learning 2, an English pronunciation appfor second-language (L2) learners and phonetics students. This iOSapp links together audio, front and side video, MRI and ultrasoundmovies of a native speaker reading a phonetically balanced text.Users can watch and shadow front and side video overlaid with anultrasound tongue movie. They are able to play the video at threespeeds and start the video from any word by tapping on it, with achoice of display in either English or IPA. Users can record their ownaudio/video and play it back in sync with the model for comparison.

Keynote 1: James AllenAula Magna, 08:30–09:30, Tuesday, 22 Aug. 2017Chair: Joakim Gustafson

Dialogue as Collaborative Problem Solving

James Allen; University of Rochester, USATue-K2-1, Time: 08:30–09:30

I will describe the current status of a long-term effort at developingdialogue systems that go beyond simple task execution models tosystems that involve collaborative problem solving. Such systemsinvolve open-ended discussion and the tasks cannot be accomplishedwithout extensive interaction (e.g., 10 turns or more). The key ideais that dialogue itself arises from an agent’s ability for collaborativeproblem solving (CPS). In such dialogues, agents may introduce,modify and negotiate goals; propose and discuss the merits possiblepaths to solutions; explicitly discuss progress as the two agents worktowards the goals; and evaluate how well a goal was accomplished.To complicate matters, user utterances in such settings are muchmore complex than seen in simple task execution dialogues and re-quires full semantic parsing. A key question we have been exploringin the past few years is how much of dialogue can be accounted forby domain-independent mechanisms. I will discuss these issues anddraw examples from a dialogue system we have built that, exceptfor the specialized domain reasoning required in each case, usesthe same architecture to perform three different tasks: collaborativeblocks world planning, when the system and user build structuresand may have differing goals; biocuration, in which a biologist andthe system interact in order to build executable causal models ofbiological pathways; and collaborative composition, where the userand system collaborate to compose simple pieces of music.

Tue-SS-3-11 : Special Session: Speech andHuman-Robot InteractionF11, 10:00–12:00, Tuesday, 22 Aug. 2017Chairs: Gérard Bailly, Gabriel Skantze

IntroductionTue-SS-3-11-8, Time: 10:00–10:15

(No abstract available at the time of publication)

Elicitation Design for Acoustic DepressionClassification: An Investigation of Articulation Effort,Linguistic Complexity, and Word Affect

Brian Stasak 1, Julien Epps 1, Roland Goecke 2;1University of New South Wales, Australia; 2University ofCanberra, AustraliaTue-SS-3-11-1, Time: 10:15–10:30

Assessment of neurological and psychiatric disorders like depressionare unusual from a speech processing perspective, in that speakerscan be prompted or instructed in what they should say (e.g. as partof a clinical assessment). Despite prior speech-based depressionstudies that have used a variety of speech elicitation methods, therehas been little evaluation of the best elicitation mode. One approachto understand this better is to analyze an existing database fromthe perspective of articulation effort, word affect, and linguisticcomplexity measures as proxies for depression sub-symptoms (e.g.psychomotor retardation, negative stimulus suppression, cognitiveimpairment). Here a novel measure for quantifying articulation effortis introduced, and when applied experimentally to the DAIC corpusshows promise for identifying speech data that are more discrimi-

Notes

110

native of depression. Interestingly, experiment results demonstratethat by selecting speech with higher articulation effort, linguisticcomplexity, or word-based arousal/valence, improvements in acous-tic speech-based feature depression classification performance canbe achieved, serving as a guide for future elicitation design.

Robustness Over Time-Varying Channels inDNN-HMM ASR Based Human-Robot Interaction

José Novoa 1, Jorge Wuth 1, Juan Pablo Escudero 1, JosuéFredes 1, Rodrigo Mahu 1, Richard M. Stern 2,Nestor Becerra Yoma 1; 1Universidad de Chile, Chile;2Carnegie Mellon University, USATue-SS-3-11-2, Time: 10:30–10:45

This paper addresses the problem of time-varying channels inspeech-recognition-based human-robot interaction using Locally-Normalized Filter-Bank features (LNFB), and training strategies thatcompensate for microphone response and room acoustics. Testingutterances were generated by re-recording the Aurora-4 testingdatabase using a PR2 mobile robot, equipped with a Kinect audiointerface while performing head rotations and movements towardand away from a fixed source. Three training conditions were evalu-ated called Clean, 1-IR and 33-IR. With Clean training, the DNN-HMMsystem was trained using the Aurora-4 clean training database. With1-IR training, the same training data were convolved with an impulseresponse estimated at one meter from the source with no rotationof the robot head. With 33-IR training, the Aurora-4 training datawere convolved with impulse responses estimated at one, two andthree meters from the source and 11 angular positions of the robothead. The 33-IR training method produced reductions in WER greaterthan 50% when compared with Clean training using both LNFB andconventional Mel filterbank features. Nevertheless, LNFB featuresprovided a WER 23% lower than MelFB using 33-IR training. The useof 33-IR training and LNFB features reduced WER by 64% comparedto Clean training and MelFB features.

Analysis of Engagement and User Experience with aLaughter Responsive Social Robot

Bekir Berker Türker, Zana Buçinca, Engin Erzin, YücelYemez, Metin Sezgin; Koç Üniversitesi, TurkeyTue-SS-3-11-3, Time: 10:45–11:00

We explore the effect of laughter perception and response in termsof engagement in human-robot interaction. We designed two distinctexperiments in which the robot has two modes: laughter responsiveand laughter non-responsive. In responsive mode, the robot detectslaughter using a multimodal real-time laughter detection moduleand invokes laughter as a backchannel to users accordingly. In non-responsive mode, robot has no utilization of detection, thus providesno feedback. In the experimental design, we use a straightforwardquestion-answer based interaction scenario using a back-projectedrobot head. We evaluate the interactions with objective and subjec-tive measurements of engagement and user experience.

Automatic Classification of Autistic ChildVocalisations: A Novel Database and Results

Alice Baird 1, Shahin Amiriparian 1, Nicholas Cummins 1,Alyssa M. Alcorn 2, Anton Batliner 1, SergeyPugachevskiy 1, Michael Freitag 1, Maurice Gerczuk 1,Björn Schuller 1; 1Universität Passau, Germany;2University College London, UKTue-SS-3-11-4, Time: 11:00–11:15

Humanoid robots have in recent years shown great promise for sup-

porting the educational needs of children on the autism spectrum.To further improve the efficacy of such interactions, user-adaptationstrategies based on the individual needs of a child are required. Inthis regard, the proposed study assesses the suitability of a rangeof speech-based classification approaches for automatic detectionof autism severity according to the commonly used Social Respon-siveness Scale second edition (SRS-2). Autism is characterised bysocialisation limitations including child language and communica-tion ability. When compared to neurotypical children of the same agethese can be a strong indication of severity. This study introduces anovel dataset of 803 utterances recorded from 14 autistic childrenaged between 4–10 years, during Wizard-of-Oz interactions with ahumanoid robot. Our results demonstrate the suitability of supportvector machines (SVMs) which use acoustic feature sets from multipleInterspeech COMPARE challenges. We also evaluate deep spectrumfeatures, extracted via an image classification convolutional neuralnetwork (CNN) from the spectrogram of autistic speech instances.At best, by using SVMs on the acoustic feature sets, we achieved aUAR of 73.7% for the proposed 3-class task.

Crowd-Sourced Design of Artificial AttentiveListeners

Catharine Oertel, Patrik Jonell, DimosthenisKontogiorgos, Joseph Mendelson, Jonas Beskow, JoakimGustafson; KTH, SwedenTue-SS-3-11-5, Time: 11:15–11:30

Feedback generation is an important component of human-humancommunication. Humans can choose to signal support, understand-ing, agreement or also scepticism by means of feedback tokens.Many studies have focused on the timing of feedback behaviours. Inthe current study, however, we keep the timing constant and insteadfocus on the lexical form and prosody of feedback tokens as well astheir sequential patterns.

For this we crowdsourced participant’s feedback behaviour in iden-tical interactional contexts in order to model a virtual agent thatis able to provide feedback as an attentive/supportive as well asattentive/sceptical listener. The resulting models were realised in arobot which was evaluated by third-party observers.

Studying the Link Between Inter-SpeakerCoordination and Speech Imitation ThroughHuman-Machine Interactions

Leonardo Lancia 1, Thierry Chaminade 2, Noël Nguyen 3,Laurent Prévot 3; 1LPP (UMR 7018), France; 2Institut deNeuroscience de la Timone, France; 3LPL (UMR 7309),FranceTue-SS-3-11-6, Time: 11:30–11:45

According to accounts of inter-speaker coordination based on inter-nal predictive models, speakers tend to imitate each other each timethey need to coordinate their behavior. According to accounts basedon the notion of dynamical coupling, imitation should be observedonly if it helps stabilizing the specific coordinative pattern producedby the interlocutors or if it is a direct consequence of inter-speakercoordination. To compare these accounts, we implemented an artifi-cial agent designed to repeat a speech utterance while coordinatingits behavior with that of a human speaker performing the sametask. We asked 10 Italian speakers to repeat the utterance /topkop/simultaneously with the agent during short time intervals. In someinteractions, the agent was parameterized to cooperate with thespeakers (by producing its syllables simultaneously with those ofthe human) while in others it was parameterized to compete withthem (by producing its syllables in-between those of the human). Apositive correlation between the stability of inter-speaker coordina-

Notes

111

tion and the degree of f0 imitation was observed only in cooperativeinteractions. However, in line with accounts based on prediction,speakers imitate the f0 of the agent regardless of whether this isparameterized to cooperate or to compete with them.

DiscussionTue-SS-3-11-7, Time: 11:45–12:00

(No abstract available at the time of publication)

Tue-SS-4-11 : Special Session: IncrementalProcessing and Responsive BehaviourF11, 13:30–15:30, Tuesday, 22 Aug. 2017Chairs: Timo Baumann, Ingmar Steiner

IntroductionTue-SS-4-11-7, Time: 13:30–13:45

(No abstract available at the time of publication)

Adjusting the Frame: Biphasic Performative Controlof Speech Rhythm

Samuel Delalez 1, Christophe d’Alessandro 2; 1LIMSI,France; 2∂’Alembert (UMR 7190), FranceTue-SS-4-11-1, Time: 13:45–14:00

Performative time and pitch scaling is a new research paradigm forprosodic analysis by synthesis. In this paper, a system for real-timerecorded speech time and pitch scaling by the means of hands orfeet gestures is designed and evaluated. Pitch is controlled with thepreferred hand, using a stylus on a graphic tablet. Time is controlledusing rhythmic frames, or constriction gestures, defined by pairs ofcontrol points. The “Arsis” corresponds to the constriction (weakbeat of the syllable) and the “Thesis” corresponds to the vocalic nu-cleus (strong beat of the syllable). This biphasic control of rhythmicunits is performed by the non-preferred hand using a button. Pitchand time scales are modified according to these gestural controlswith the help of a real-time pitch synchronous overlap-add technique(RT-PSOLA). Rhythm and pitch control accuracy are assessed in aprosodic imitation experiment: the task is to reproduce intonationand rhythm of various sentences. The results show that inter-vocalicdurations differ on average of only 20 ms. The system appears as anew and effective tool for performative speech and singing synthe-sis. Consequences and applications in speech prosody research arediscussed.

Attentional Factors in Listeners’ Uptake of GestureCues During Speech Processing

Raheleh Saryazdi, Craig G. Chambers; University ofToronto, CanadaTue-SS-4-11-2, Time: 14:00–14:15

In conversation, speakers spontaneously produce manual gesturesthat can facilitate listeners’ comprehension of speech. However,various factors may affect listeners’ ability to use gesture cues. Herewe examine a situation where a speaker is referring to physicalobjects in the contextual here-and-now. In this situation, objects forpotential reference will compete with gestures for visual attention.In two experiments, a speaker provided instructions to pick upobjects in the visual environment (“Pick up the candy”). On sometrials, the speaker produced a “pick up” gesture that reflectedthe size/shape of the target object. Gaze position was recordedto evaluate how listeners allocated attention to scene elements.Experiment 1 showed that, although iconic gestures (when present)were rarely fixated directly, peripheral uptake of these cues speeded

listeners’ visual identification of intended referents as the instructionunfolded. However, the benefit was mild and occurred primarily forsmall/hard-to-identify objects. In Experiment 2, background noisewas added to reveal whether challenging auditory environments leadlisteners to allocate additional visual attention to gesture cues ina compensatory manner. Interestingly, background noise actuallyreduced listeners’ use of gesture cues. Together the findings high-light how situational factors govern the use of visual cues duringmultimodal communication.

Motion Analysis in Vocalized Surprise Expressions

Carlos Ishi, Takashi Minato, Hiroshi Ishiguro; ATR HIL,JapanTue-SS-4-11-3, Time: 14:15–14:30

The background of our research is the generation of natural human-like motions during speech in android robots that have a highlyhuman-like appearance. Mismatches in speech and motion aresources of unnaturalness, especially when emotion expressionsare involved. Surprise expressions often occur in dialogue inter-actions, and they are often accompanied by verbal interjectionalutterances. In this study, we analyze facial, head and body motionsduring several types of vocalized surprise expressions appearing inhuman-human dialogue interactions. The analysis results indicatean inter-dependence between motion types and different types ofsurprise expression (such as emotional, social or quoted) as wellas different degrees of surprise expression. The synchronizationbetween motion and surprise utterances is also analyzed.

Enhancing Backchannel Prediction Using WordEmbeddings

Robin Ruede, Markus Müller, Sebastian Stüker, AlexWaibel; KIT, GermanyTue-SS-4-11-4, Time: 14:30–14:45

Backchannel responses like “uh-huh”, “yeah”, “right” are used by thelistener in a social dialog as a way to provide feedback to the speaker.In the context of human-computer interaction, these responses canbe used by an artificial agent to build rapport in conversationswith users. In the past, multiple approaches have been proposedto detect backchannel cues and to predict the most natural timingto place those backchannel utterances. Most of these are basedon manually optimized fixed rules, which may fail to generalize.Many systems rely on the location and duration of pauses and pitchslopes of specific lengths. In the past, we proposed an approachby training artificial neural networks on acoustic features such aspitch and power and also attempted to add word embeddings viaword2vec. In this work, we refined this approach by evaluatingdifferent methods to add timed word embeddings via word2vec.Comparing the performance using various feature combinations, wecould show that adding linguistic features improves the performanceover a prediction system that only uses acoustic features.

A Computational Model for Phonetically ResponsiveSpoken Dialogue Systems

Eran Raveh, Ingmar Steiner, Bernd Möbius; Universitätdes Saarlandes, GermanyTue-SS-4-11-5, Time: 14:45–15:00

This paper introduces a model for segment-level phonetic re-sponsiveness. It is based on behavior observed in human-humaninteraction, and is designed to be integrated into spoken dialoguesystems to capture potential phonetic variation and simulate con-vergence capabilities. Each step in the process is responsible foran aspect of the interaction, including monitoring the input speechand appropriately analyzing it. Various parameters can be tuned

Notes

112

to configure the speech handling and adjust the response style.Evaluation was performed by simulating simple end-to-end dialoguescenarios, including analyzing the synthesized output of the model.The results show promising ground for further extensions.

Incremental Dialogue Act Recognition: Token- vsChunk-Based Classification

Eustace Ebhotemhen, Volha Petukhova, Dietrich Klakow;Universität des Saarlandes, GermanyTue-SS-4-11-6, Time: 15:00–15:15

This paper presents a machine learning based approach to incre-mental dialogue act classification with a focus on the recognitionof communicative functions associated with dialogue segments in amultidimensional space, as defined in the ISO 24617-2 dialogue actannotation standard. The main goal is to establish the nature of anincrement whose processing will result in a reliable overall systemperformance. We explore scenarios where increments are tokensor syntactically, semantically or prosodically motivated chunks.Combing local classification with meta-classifiers at a late fusiondecision level we obtained state-of-the-art classification performance.Experiments were carried out on manually corrected transcriptionsand on potentially erroneous ASR output. Chunk-based classificationyields better results on the manual transcriptions, whereas token-based classification shows a more robust performance on the ASRoutput. It is also demonstrated that layered hierarchical and cascadetraining procedures result in better classification performance thanthe single-layered approach based on a joint classification predictingcomplex class labels.

DiscussionTue-SS-4-11-8, Time: 15:15–15:30

(No abstract available at the time of publication)

Tue-SS-5-11 : Special Session: AcousticManifestations of Social CharacteristicsF11, 16:00–18:00, Tuesday, 22 Aug. 2017Chairs: Stefanie Jannedy, Melanie Weirich

IntroductionTue-SS-5-11-10, Time: 16:00–16:05

(No abstract available at the time of publication)

Clear Speech — Mere Speech? How Segmental andProsodic Speech Reduction Shape the ImpressionThat Speakers Create on Listeners

Oliver Niebuhr; University of Southern Denmark,DenmarkTue-SS-5-11-1, Time: 16:05–16:25

Research on speech reduction is primarily concerned with analyzing,modeling, explaining, and, ultimately, predicting phonetic variation.That is, the focus is on the speech signal itself. The present paperadds a little side note to this fundamental line of research byaddressing the question whether variation in the degree of reductionalso has a systematic effect on the attributes we ascribe to thespeaker who produces the speech signal. A perception experimentwas carried out for German in which 46 listeners judged whetheror not speakers showing 3 different combinations of segmental andprosodic reduction levels (unreduced, moderately reduced, stronglyreduced) are appropriately described by 13 physical, social, andcognitive attributes. The experiment shows that clear speech is

not mere speech, and less clear speech is not just reduced either.Rather, results revealed a complex interplay of reduction levels andperceived speaker attributes in which moderate reduction can makea better impression on listeners than no reduction. In additionto its relevance in reduction models and theories, this interplayis instructive for various fields of speech application from socialrobotics to charisma coaching.

Relationships Between Speech Timing and PerceivedHostility in a French Corpus of Political Debates

Charlotte Kouklia, Nicolas Audibert; LPP (UMR 7018),FranceTue-SS-5-11-2, Time: 17:25–18:00

This study investigates the relationship between perceived hostilityand speech timing features within extracts from Montreuil’s CityCouncil sessions in 2013, marked by a tense political context at thistime. A dataset of 118 speech extracts from the mayor (DominiqueVoynet) and four of her political opponents during the City Councilhas been analyzed through the combination of perception tests andspeech timing phenomena, estimated from classical timing-relatedmeasurements and custom metrics. We also develop a methodolog-ical framework for the phonetic analysis of nonscripted speech: adouble perceptive evaluation of the original dataset (22 participants)allowed us to measure the difference of hostility perceived (dHost)between the original audio extracts and their read transcriptions,and the five speakers produced the same utterances in a controlledreading task to make the direct comparison with original extractspossible. Correlations between dHost and speech timing featuresdifferences between each original utterance and its control coun-terpart show that perceived hostility is mainly influenced by localdeviations to the expected accentuation pattern in French combinedwith the insertion of silent pauses. Moreover, a finer-grained analysisof rhythmic features reveals different strategies amongst speakers,especially regarding the realization of interpausal speech rate varia-tion and final syllables lengthening.

Towards Speaker Characterization: Identifying andPredicting Dimensions of Person Attribution

Laura Fernández Gallardo, Benjamin Weiss; T-Labs,GermanyTue-SS-5-11-3, Time: 17:25–18:00

A great number of investigations on person characterization relyon the assessment of the Big-Five personality traits, a prevalentand widely accepted model with strong psychological foundation.However, in the context on characterizing unfamiliar individualsfrom their voices only, it may be hard for assessors to determine theBig-Five traits based on their first impression. In this study, a 28-itemsemantic differential rating scale has been completed by a total of33 listeners who were presented with 15 male voice stimuli. A factoranalysis on their responses enabled us to identify five perceptualfactors of person attribution: (social and physical) attractiveness,confidence, apathy, serenity, and incompetence. A discussion on therelations of these dimensions of speaker attribution to the Big-Fivefactors is provided and speech features relevant to the automaticprediction of our dimensions are analyzed, together with SVMregression performance. Although more data are needed to validateour findings, we believe that our approach can lead to establisha space of person attributions with dimensions that can easily bedetected from utterances in zero-acquaintance scenarios.

Notes

113

Prosodic Analysis of Attention-Drawing Speech

Carlos Ishi 1, Jun Arai 1, Norihiro Hagita 2; 1ATR HIL,Japan; 2ATR IRC, JapanTue-SS-5-11-4, Time: 17:25–18:00

The term “attention drawing” refers to the action of sellers whocall out to get the attention of people passing by in front of theirstores or shops to invite them inside to buy or sample products.Since the speaking styles exhibited in such attention-drawing speechare clearly different from conversational speech, in this study,we focused on prosodic analyses of attention-drawing speechand collected the speech data of multiple people with previousattention-drawing experience by simulating several situations. Wethen investigated the effects of several factors, including backgroundnoise, interaction phases, and shop categories on the prosodicfeatures of attention-drawing utterances. Analysis results indicatethat compared to dialogue interaction utterances, attention-drawingutterances usually have higher power, higher mean F0s, smaller F0ranges, and do not drop at the end of sentences, regardless of thepresence or absence of background noise. Analysis of sentence-finalsyllable intonation indicates the presence of lengthened flat or risingtones in attention-drawing utterances.

Perceptual and Acoustic CorreLates of Gender in thePrepubertal Voice

Adrian P. Simpson 1, Riccarda Funk 2, Frederik Palmer 1;1FSU Jena, Germany; 2MLU Halle-Wittenberg, GermanyTue-SS-5-11-5, Time: 17:25–18:00

This study investigates the perceptual and acoustic correlates ofgender in the prepubertal voice. 23 German-speaking primary schoolpupils (13 female, 10 male) aged 8–9 years were recorded producing10 sentences each. Two sentences from each speaker were presentedin random order to a group of listeners who were asked to assign agender to each stimulus. Single utterances from each of the threemale and three female speakers whose gender was identified mostreliably were played in a second experiment to two further groupsof listeners who judged each stimulus against seven perceptualattribute pairs. Acoustic analysis of those parameters correspondingmost directly to the perceptual attributes revealed a number ofhighly significant correlations, indicating some aspects of the voiceand speech (f0, harmonics-to-noise ratio, tempo) that children use toconstruct and adults use to identify gender in the prepubertal voice.

To See or not to See: Interlocutor Visibility andLikeability Influence Convergence in Intonation

Katrin Schweitzer, Michael Walsh, Antje Schweitzer;Universität Stuttgart, GermanyTue-SS-5-11-6, Time: 16:25–16:45

In this paper we look at convergence and divergence in intonationin the context of social qualities. Specifically we examine pitchaccent realisations in the GECO corpus of German conversations.Pitch accents are represented as 6-dimensional vectors where eachdimension corresponds to a characteristic of the accent’s shape.Convergence/divergence is then measured by calculating the dis-tance between pitch accent realisations of conversational partners.A decrease of distance values over time indicates convergence, anincrease divergence. The corpus comprises dialogue sessions intwo modalities: partners either saw each other during the conver-sation or not. Linear mixed model analyses show convergence aswell as divergence effects in the realisations of H*L accents. Thisconvergence/divergence is strongly related to the modality and tohow much speakers like their partners: generally, seeing the partnercomes with divergence, whereas when the dialogue partners cannotsee each other, there is convergence. The effect varies, however,

depending on the extent to which a speaker likes their partner.Less liking entails a greater change in the realisations over time— stronger divergence when partners could see each other, andstronger convergence when they could not.

Acoustic Correlates of Parental Role and GenderIdentity in the Speech of Expecting Parents

Melanie Weirich, Adrian P. Simpson; FSU Jena, GermanyTue-SS-5-11-7, Time: 16:45–17:05

Differences between male and female speakers have been explainedin terms of biological inevitabilities but also in terms of behavioraland socially motivated factors. The aim of this study is to investigatethe latter by examining gender-specific variability within the samegender.

The speech of 29 German men and women — all of them expectingtheir first child but varying in the time they plan to stay at homeduring their child’s first year (parental role) — is analyzed. Acousticanalyses comprise the vowel space size and the realization of theinter-sibilant contrast.

While the data is part of a larger longitudinal project investigatingadult- and infant-directed speech during the infant’s first year oflife, this study concentrates on the recordings made before the birthof the child. Inter-speaker variability is investigated in relation to1) the chosen parental role and 2) self-ascribed ratings on positivefeminine attributes (gender identity).

Results show that both factors (planned duration of parental leaveand the femininity ratings) contribute to the variability found be-tween, but also within the same gender. In particular, the vowelspace size was found to be positively correlated with self-ascribedfemininity ratings in male speakers.

A Semi-Supervised Learning Approach forAcoustic-Prosodic Personality Perception inUnder-Resourced Domains

Rubén Solera-Ureña 1, Helena Moniz 1, FernandoBatista 1, Vera Cabarrão 1, Anna Pompili 1,Ramon Fernandez Astudillo 1, Joana Campos 2, AnaPaiva 2, Isabel Trancoso 1; 1INESC-ID Lisboa, Portugal;2Universidade de Lisboa, PortugalTue-SS-5-11-8, Time: 17:25–18:00

Automatic personality analysis has gained attention in the last yearsas a fundamental dimension in human-to-human and human-to-machine interaction. However, it still suffers from limited numberand size of speech corpora for specific domains, such as theassessment of children’s personality. This paper investigates asemi-supervised training approach to tackle this scenario. We devisean experimental setup with age and language mismatch and twotraining sets: a small labeled training set from the Interspeech 2012Personality Sub-challenge, containing French adult speech labeledwith personality OCEAN traits, and a large unlabeled training set ofPortuguese children’s speech. As test set, a corpus of Portuguesechildren’s speech labeled with OCEAN traits is used. Based on thissetting, we investigate a weak supervision approach that iterativelyrefines an initial model trained with the labeled data-set using theunlabeled data-set. We also investigate knowledge-based features,which leverage expert knowledge in acoustic-prosodic cues and thusneed no extra data. Results show that, despite the large mismatchimposed by language and age differences, it is possible to attainimprovements with these techniques, pointing both to the benefitsof using a weak supervision and expert-based acoustic-prosodicfeatures across age and language.

Notes

114

Effects of Talker Dialect, Gender & Race on Accuracyof Bing Speech and YouTube Automatic Captions

Rachael Tatman 1, Conner Kasten 2; 1University ofWashington, USA; 2Zonar Systems, USATue-SS-5-11-9, Time: 17:05–17:25

This project compares the accuracy of two automatic speech recogni-tion (ASR) systems — Bing Speech and YouTube’s automatic captions— across gender, race and four dialects of American English. Thedialects included were chosen for their acoustic dissimilarity. BingSpeech had differences in word error rate (WER) between dialectsand ethnicities, but they were not statistically reliable. YouTube’sautomatic captions, however, did have statistically different WERsbetween dialects and races. The lowest average error rates were forGeneral American and white talkers, respectively. Neither systemhad a reliably different WER between genders, which had beenpreviously reported for YouTube’s automatic captions [1]. However,the higher error rate non-white talkers is worrying, as it may reducethe utility of these systems for talkers of color.

Tue-O-3-1 : Neural Network Acoustic Modelsfor ASR 1Aula Magna, 10:00–12:00, Tuesday, 22 Aug. 2017

Chairs: Herve Bourlard, Jan Cernocký

A Comparison of Sequence-to-Sequence Models forSpeech Recognition

Rohit Prabhavalkar 1, Kanishka Rao 1, Tara N. Sainath 1,Bo Li 1, Leif Johnson 1, Navdeep Jaitly 2; 1Google, USA;2NVIDIA, USATue-O-3-1-1, Time: 10:00–10:20

In this work, we conduct a detailed evaluation of various all-neural,end-to-end trained, sequence-to-sequence models applied to thetask of speech recognition. Notably, each of these systems directlypredicts graphemes in the written domain, without using an externalpronunciation lexicon, or a separate language model. We exam-ine several sequence-to-sequence models including connectionisttemporal classification (CTC), the recurrent neural network (RNN)transducer, an attention-based model, and a model which augmentsthe RNN transducer with an attention mechanism.

We find that the sequence-to-sequence models are competitivewith traditional state-of-the-art approaches on dictation test sets,although the baseline, which uses a separate pronunciation andlanguage model, outperforms these models on voice-search test sets.

CTC in the Context of Generalized Full-Sum HMMTraining

Albert Zeyer, Eugen Beck, Ralf Schlüter, Hermann Ney;RWTH Aachen University, GermanyTue-O-3-1-2, Time: 10:20–10:40

We formulate a generalized hybrid HMM-NN training procedure usingthe full-sum over the hidden state-sequence and identify CTC as aspecial case of it. We present an analysis of the alignment behaviorof such a training procedure and explain the strong localization oflabel output behavior of full-sum training (also referred to as peakyor spiky behavior). We show how to avoid that behavior by usinga state prior. We discuss the temporal decoupling between outputlabel position/time-frame, and the corresponding evidence in theinput observations when this is trained with BLSTM models. Wealso show a way how to overcome this by jointly training a FFNN.We implemented the Baum-Welch alignment algorithm in CUDA to

be able to do fast soft realignments on GPU. We have publishedthis code along with some of our experiments as part of RETURNN,RWTH’s extensible training framework for universal recurrent neuralnetworks. We finish with experimental validation of our study onWSJ and Switchboard.

Advances in Joint CTC-Attention Based End-to-EndSpeech Recognition with a Deep CNN Encoder andRNN-LM

Takaaki Hori 1, Shinji Watanabe 1, Yu Zhang 2, WilliamChan 3; 1MERL, USA; 2MIT, USA; 3Carnegie MellonUniversity, USATue-O-3-1-3, Time: 10:40–11:00

We present a state-of-the-art end-to-end Automatic Speech Recogni-tion (ASR) model. We learn to listen and write characters with a jointConnectionist Temporal Classification (CTC) and attention-basedencoder-decoder network. The encoder is a deep ConvolutionalNeural Network (CNN) based on the VGG network. The CTC networksits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine theCTC predictions, the attention-based decoder predictions and aseparately trained LSTM language model. We achieve a 5–10% errorreduction compared to prior systems on spontaneous Japanese andChinese speech, and our end-to-end model beats out traditionalhybrid ASR systems.

Multitask Learning with CTC and Segmental CRF forSpeech Recognition

Liang Lu 1, Lingpeng Kong 2, Chris Dyer 3, Noah A.Smith 4; 1TTIC, USA; 2Carnegie Mellon University, USA;3DeepMind, UK; 4University of Washington, USATue-O-3-1-4, Time: 11:00–11:20

Segmental conditional random fields (SCRFs) and connectionist tem-poral classification (CTC) are two sequence labeling methods usedfor end-to-end training of speech recognition models. Both modelsdefine a transcription probability by marginalizing decisions aboutlatent segmentation alternatives to derive a sequence probability:the former uses a globally normalized joint model of segment labelsand durations, and the latter classifies each frame as either an outputsymbol or a “continuation” of the previous label. In this paper, wetrain a recognition model by optimizing an interpolation betweenthe SCRF and CTC losses, where the same recurrent neural network(RNN) encoder is used for feature extraction for both outputs. Wefind that this multitask objective improves recognition accuracywhen decoding with either the SCRF or CTC models. Additionally, weshow that CTC can also be used to pretrain the RNN encoder, whichimproves the convergence rate when learning the joint model.

Direct Acoustics-to-Word Models for EnglishConversational Speech Recognition

Kartik Audhkhasi, Bhuvana Ramabhadran, GeorgeSaon, Michael Picheny, David Nahamoo; IBM, USATue-O-3-1-5, Time: 11:20–11:40

Recent work on end-to-end automatic speech recognition (ASR) hasshown that the connectionist temporal classification (CTC) loss canbe used to convert acoustics to phone or character sequences. Suchsystems are used with a dictionary and separately-trained LanguageModel (LM) to produce word sequences. However, they are not trulyend-to-end in the sense of mapping acoustics directly to wordswithout an intermediate phone representation. In this paper, wepresent the first results employing direct acoustics-to-word CTCmodels on two well-known public benchmark tasks: Switchboardand CallHome. These models do not require an LM or even a decoder

Notes

115

at run-time and hence recognize speech with minimal complexity.However, due to the large number of word output units, CTC wordmodels require orders of magnitude more data to train reliablycompared to traditional systems. We present some techniques tomitigate this issue. Our CTC word model achieves a word error rateof 13.0%/18.8% on the Hub5-2000 Switchboard/CallHome test setswithout any LM or decoder compared with 9.6%/16.0% for phone-based CTC with a 4-gram LM. We also present rescoring results onCTC word model lattices to quantify the performance benefits of aLM, and contrast the performance of word and phone CTC models.

Reducing the Computational Complexity ofTwo-Dimensional LSTMs

Bo Li, Tara N. Sainath; Google, USATue-O-3-1-6, Time: 11:40–12:00

Long Short-Term Memory Recurrent Neural Networks (LSTMs) aregood at modeling temporal variations in speech recognition tasks,and have become an integral component of many state-of-the-artASR systems. More recently, LSTMs have been extended to modelvariations in the speech signal in two dimensions, namely time andfrequency [1, 2]. However, one of the problems with two-dimensionalLSTMs, such as Grid-LSTMs, is that the processing in both timeand frequency occurs sequentially, thus increasing computationalcomplexity. In this work, we look at minimizing the dependence ofthe Grid-LSTM with respect to previous time and frequency points inthe sequence, thus reducing computational complexity. Specifically,we compare reducing computation using a bidirectional Grid-LSTM(biGrid-LSTM) with non-overlapping frequency sub-band processing,a PyraMiD-LSTM [3] and a frequency-block Grid-LSTM (fbGrid-LSTM)for parallel time-frequency processing. We find that the fbGrid-LSTMcan reduce computation costs by a factor of four with no loss inaccuracy, on a 12,500 hour Voice Search task.

Tue-O-3-2 : Models of Speech ProductionA2, 10:00–12:00, Tuesday, 22 Aug. 2017Chairs: Marcin Wlodarczak, Daryush Mehta

Functional Principal Component Analysis of VocalTract Area Functions

Jorge C. Lucero; Universidade de Brasília, BrazilTue-O-3-2-1, Time: 10:00–10:20

This paper shows the application of a functional version of principalcomponent analysis to build a parametrization of vocal tract areafunctions for vowel production. Sets of measured area valuesfor ten vowels are expressed as smooth functional data and nextdecomposed into a mean area function and a basis of orthogonaleigenfunctions. Interpretations of the first four eigenfunctions areprovided in terms of tongue movements and vocal tract lengthvariations. Also, an alternative set of eigenfunctions with closerassociation to specific regions of the vocal tract is obtained via avarimax rotation. The general intention of the paper is to show thebenefits of a functional approach to analyze vocal tract shapes andmotivate further applications.

Analysis of Acoustic-to-Articulatory Speech InversionAcross Different Accents and Languages

Ganesh Sivaraman 1, Carol Espy-Wilson 1, MartijnWieling 2; 1University of Maryland, USA;2Rijksuniversiteit Groningen, The NetherlandsTue-O-3-2-2, Time: 10:20–10:40

The focus of this paper is estimating articulatory movements of thetongue and lips from acoustic speech data. While there are several

potential applications of such a method in speech therapy andpronunciation training, performance of such acoustic-to-articulatoryinversion systems is not very high due to limited availability ofsimultaneous acoustic and articulatory data, substantial speakervariability, and variable methods of data collection. This papertherefore evaluates the impact of speaker, language and accentvariability on the performance of an acoustic-to-articulatory speechinversion system. The articulatory dataset used in this studyconsists of 21 Dutch speakers reading Dutch and English wordsand sentences, and 22 UK English speakers reading English wordsand sentences. We trained several acoustic-to-articulatory speechinversion systems both based on deep and shallow neural networkarchitectures in order to estimate electromagnetic articulography(EMA) sensor positions, as well as vocal tract variables (TVs). Ourresults show that with appropriate feature and target normalization,a speaker-independent speech inversion system trained on datafrom one language is able to estimate sensor positions (or TVs) forthe same language correlating at about r = 0.53 with the actualsensor positions (or TVs). Cross-language results show a reducedperformance of r = 0.47.

Integrated Mechanical Model of [r]-[l] and [b]-[m]-[w]Producing Consonant Cluster [br]

Takayuki Arai; Sophia University, JapanTue-O-3-2-3, Time: 10:40–11:00

We have developed two types of mechanical models of the humanvocal tract. The first model was designed for the retroflex approxi-mant [r] and the alveolar lateral approximant [l]. It consisted of themain vocal tract and a flapping tongue, where the front half of thetongue can be rotated against the palate. When the tongue is shortand rotated approximately 90 degrees, the retroflex approximant [r]is produced. The second model was designed for [b], [m], and [w].Besides the main vocal tract, this model contains a movable lower lipfor lip closure and a nasal cavity with a controllable velopharyngealport. In the present study, we joined these two mechanical modelsto form a new model containing the main vocal tract, the flappingtongue, the movable lower lip, and the nasal cavity with the con-trollable velopharyngeal port. This integrated model now makes itpossible to produce consonant sequences. Therefore, we examinedthe sequence [br], in particular, adjusting the timing of the lip andlingual gestures to produce the best sound. Because the gestures arevisually observable from the outside of this model, the timing of thegestures were examined with the use of a high-speed video camera.

A Speaker Adaptive DNN Training Approach forSpeaker-Independent Acoustic Inversion

Leonardo Badino 1, Luca Franceschi 1, Raman Arora 2,Michele Donini 1, Massimiliano Pontil 1; 1Istituto Italianodi Tecnologia, Italy; 2Johns Hopkins University, USATue-O-3-2-4, Time: 11:00–11:20

We address the speaker-independent acoustic inversion (AI) prob-lem, also referred to as acoustic-to-articulatory mapping. The scarceavailability of multi-speaker articulatory data makes it difficult tolearn a mapping which generalizes from a limited number of trainingspeakers and reliably reconstructs the articulatory movements ofunseen speakers. In this paper, we propose a Multi-task Learning(MTL)-based approach that explicitly separates the modeling of eachtraining speaker AI peculiarities from the modeling of AI character-istics that are shared by all speakers. Our approach stems from thewell known Regularized MTL approach and extends it to feed-forwarddeep neural networks (DNNs). Given multiple training speakers, welearn for each an acoustic-to-articulatory mapping represented by aDNN. Then, through an iterative procedure, we search for a canonicalspeaker-independent DNN that is “similar” to all speaker-dependent

Notes

116

DNNs. The degree of similarity is controlled by a regularizationparameter. We report experiments on the University of WisconsinX-ray Microbeam Database under different training/testing experi-mental settings. The results obtained indicate that our MTL-trainedcanonical DNN largely outperforms a standardly trained (i.e., singletask learning-based) speaker independent DNN.

Acoustic-to-Articulatory Mapping Based on Mixture ofProbabilistic Canonical Correlation Analysis

Hidetsugu Uchida, Daisuke Saito, Nobuaki Minematsu;University of Tokyo, JapanTue-O-3-2-5, Time: 11:20–11:40

In this paper, we propose a novel acoustic-to-articulatory mappingmodel based on mixture of probabilistic canonical correlation analy-sis (mPCCA). In PCCA, it is assumed that two different kinds of dataare observed as results from different linear transforms of a commonlatent variable. It is expected that this variable represents a commonfactor which is inherent in the different domains, such as acousticand articulatory feature spaces. mPCCA is an expansion of PCCA andit can model a much more complex structure. In mPCCA, covariancematrices of a joint probabilistic distribution of acoustic-articulatorydata are structuralized reasonably by using transformation coeffi-cients of the linear transforms. Even if the number of componentsin mPCCA increases, the structuralized covariance matrices can beexpected to avoid over-fitting. Training and mapping processesof the mPCCA-based mapping model are reasonably derived byusing the EM algorithm. Experiments using MOCHA-TIMIT showthat the proposed mapping method has achieved better mappingperformance than the conventional GMM-based mapping.

Test-Retest Repeatability of Articulatory StrategiesUsing Real-Time Magnetic Resonance Imaging

Tanner Sorensen 1, Asterios Toutios 1, Johannes Töger 2,Louis Goldstein 1, Shrikanth S. Narayanan 1; 1Universityof Southern California, USA; 2Lund University, SwedenTue-O-3-2-6, Time: 11:40–12:00

Real-time magnetic resonance imaging (rtMRI) provides informationabout the dynamic shaping of the vocal tract during speech produc-tion. This paper introduces and evaluates a method for quantifyingarticulatory strategies using rtMRI. The method decomposes theformation and release of a constriction in the vocal tract into thecontributions of individual articulators such as the jaw, tongue, lips,and velum. The method uses an anatomically guided factor analysisand dynamical principles from the framework of Task Dynamics. Weevaluated the method within a test-retest repeatability framework.We imaged healthy volunteers (n = 8, 4 females, 4 males) in twoscans on the same day and quantified inter-study agreement with theintraclass correlation coefficient and mean within-subject standarddeviation. The evaluation established a limit on effect size andintra-group differences in articulatory strategy which can be studiedusing the method.

Tue-O-3-4 : Speaker RecognitionB4, 10:00–12:00, Tuesday, 22 Aug. 2017Chairs: Jean-Francois Bonastre, Kornel Laskowski

Deep Neural Network Embeddings forText-Independent Speaker Verification

David Snyder, Daniel Garcia-Romero, Daniel Povey,Sanjeev Khudanpur; Johns Hopkins University, USATue-O-3-4-1, Time: 10:00–10:20

This paper investigates replacing i-vectors for text-independentspeaker verification with embeddings extracted from a feed-forwarddeep neural network. Long-term speaker characteristics are capturedin the network by a temporal pooling layer that aggregates overthe input speech. This enables the network to be trained to dis-criminate between speakers from variable-length speech segments.After training, utterances are mapped directly to fixed-dimensionalspeaker embeddings and pairs of embeddings are scored using aPLDA-based backend. We compare performance with a traditionali-vector baseline on NIST SRE 2010 and 2016. We find that theembeddings outperform i-vectors for short speech segments andare competitive on long duration test conditions. Moreover, the tworepresentations are complementary, and their fusion improves onthe baseline at all operating points. Similar systems have recentlyshown promising results when trained on very large proprietarydatasets, but to the best of our knowledge, these are the best resultsreported for speaker-discriminative neural networks when trainedand tested on publicly available corpora.

Tied Variational Autoencoder Backends for i-VectorSpeaker Recognition

Jesús Villalba 1, Niko Brümmer 2, Najim Dehak 1; 1JohnsHopkins University, USA; 2Nuance Communications,South AfricaTue-O-3-4-2, Time: 10:20–10:40

Probabilistic linear discriminant analysis (PLDA) is the de factostandard for backends in i-vector speaker recognition. If we try toextend the PLDA paradigm using non-linear models, e.g., deep neuralnetworks, the posterior distributions of the latent variables and themarginal likelihood become intractable. In this paper, we propose toapproach this problem using stochastic gradient variational Bayes.We generalize the PLDA model to let i-vectors depend non-linearly onthe latent factors. We approximate the evidence lower bound (ELBO)by Monte Carlo sampling using the reparametrization trick. Thisenables us to optimize of the ELBO using backpropagation to jointlyestimate the parameters that define the model and the approximateposteriors of the latent factors. We also present a reformulationof the likelihood ratio, which we call Q-scoring. Q-scoring makespossible to efficiently score the speaker verification trials for thismodel. Experimental results on NIST SRE10 suggest that more datamight be required to exploit the potential of this method.

Improved Gender Independent Speaker RecognitionUsing Convolutional Neural Network BasedBottleneck Features

Shivesh Ranjan, John H.L. Hansen; University of Texas atDallas, USATue-O-3-4-3, Time: 10:40–11:00

This paper proposes a novel framework to improve performance ofgender independent i-Vector PLDA based speaker recognition usingconvolutional neural network (CNN). Convolutional layers of a CNN

Notes

117

offer robustness to variations in input features including those dueto gender. A CNN is trained for ASR with a linear bottleneck layer.Bottleneck features extracted using the CNN are then used to train agender-independent UBM to obtain frame posteriors for training ani-Vector extractor matrix. To preserve speaker specific information,a hybrid approach to training the i-Vector extractor matrix usingMFCC features with corresponding frame posteriors derived frombottleneck features is proposed. On the NIST SRE10 C5 conditionpooled trials, our approach reduces the EER and minDCF 2010 by+14.62% and +14.42% respectively compared to a standard mfccbased gender-independent speaker recognition system.

Autoencoder Based Domain Adaptation for SpeakerRecognition Under Insufficient Channel Information

Suwon Shon 1, Seongkyu Mun 1, Wooil Kim 2, HanseokKo 1; 1Korea University, Korea; 2Incheon NationalUniversity, KoreaTue-O-3-4-4, Time: 11:00–11:20

In real-life conditions, mismatch between development and testdomain degrades speaker recognition performance. To solve theissue, many researchers explored domain adaptation approachesusing matched in-domain dataset. However, adaptation wouldbe not effective if the dataset is insufficient to estimate channelvariability of the domain. In this paper, we explore the problemof performance degradation under such a situation of insufficientchannel information. In order to exploit limited in-domain dataseteffectively, we propose an unsupervised domain adaptation approachusing Autoencoder based Domain Adaptation (AEDA). The proposedapproach combines an autoencoder with a denoising autoencoder toadapt resource-rich development dataset to test domain. The pro-posed technique is evaluated on the Domain Adaptation Challenge13 experimental protocols that is widely used in speaker recognitionfor domain mismatched condition. The results show significantimprovements over baselines and results from other prior studies.

Nonparametrically Trained Probabilistic LinearDiscriminant Analysis for i-Vector SpeakerVerification

Abbas Khosravani, Mohammad Mehdi Homayounpour;Amirkabir University of Technology, IranTue-O-3-4-5, Time: 11:20–11:40

In this paper we propose to estimate the parameters of the prob-abilistic linear discriminant analysis (PLDA) in text-independenti-vector speaker verification framework using a nonparametric formrather than maximum likelihood estimation (MLE) obtained by an EMalgorithm. In this approach the between-speaker covariance matrixthat represents global information about the speaker variability isreplaced with a local estimation computed on a nearest neighborbasis for each target speaker. The nonparametric between- andwithin-speaker scatter matrices can better exploit the discriminantinformation in training data and is more adapted to sample distribu-tion especially when it does not satisfy Gaussian assumption as ini-vectors without length-normalization. We evaluated this approachon the recent NIST 2016 speaker recognition evaluation (SRE) aswell as NIST 2010 core condition and found significant performanceimprovement compared with a generatively trained PLDA model.

DNN Bottleneck Features for Speaker Clustering

Jesús Jorrín, Paola García, Luis Buera; NuanceCommunications, SpainTue-O-3-4-6, Time: 11:40–12:00

In this work, we explore deep neural network bottleneck features(BNF) in the context of speaker clustering. A straightforward mannerto deal with speaker clustering is to reuse the bottleneck features

extracted from speaker recognition. However, the selection of abottleneck architecture or nonlinearity impacts the performance ofboth systems. In this work, we analyze the bottleneck features ob-tained for speaker recognition and test them in a speaker clusteringscenario. We observe that there are deep neural network topologiesthat work better for both cases, even when their classification criteria(senone classification) is loosely met. We present results that out-perform a traditional MFCC system by 21% for speaker recognitionand between 20% and 37% in clustering using the same topology.

Tue-O-3-6 : Phonation and Voice QualityC6, 10:00–12:00, Tuesday, 22 Aug. 2017Chairs: Peter Birkholz, Kikuo Maekawa

Creak as a Feature of Lexical Stress in Estonian

Kätlin Aare 1, Pärtel Lippus 1, Juraj Šimko 2; 1Universityof Tartu, Estonia; 2University of Helsinki, FinlandTue-O-3-6-1, Time: 10:00–10:20

In addition to typological, turn-taking or sociolinguistic factors,presence of creaky voice in spontaneous interaction is also influ-enced by the syntactic and phonological properties of speech. Forexample, creaky voice is reportedly more frequent in function wordsthan content words, has been observed to accompany unstressedsyllables and ends of phrases, and is associated with relaxation andreduced speech.

In Estonian, creaky voice is frequently used by all speakers. In thispaper, we observe the use of creaky voice in spontaneous Estonian inconnection to syllabic properties of words, lexical stress, word class,lengthening, and timing in phrases.

The results indicate that creak occurs less in syllables with primarystress than in unstressed syllables. However, syllables with sec-ondary stress are most frequently creaky. In content words, theprimary stressed syllables creak less frequently and unstressedsyllables more frequently compared to function words. The stress-related pattern is similar in both function and content words, butmore contrastive in content words. The probability of creakinessincreases considerably with non-final lengthening within words, andfor all syllables towards the end of the intonational phrase.

Cross-Speaker Variation in Voice Source Correlates ofFocus and Deaccentuation

Irena Yanushevskaya, Ailbhe Ní Chasaide, Christer Gobl;Trinity College Dublin, IrelandTue-O-3-6-2, Time: 10:20–10:40

This paper describes cross-speaker variation in the voice sourcecorrelates of focal accentuation and deaccentuation. A set ofutterances with varied narrow focus placement as well as broadfocus and deaccented renditions were produced by six speakers ofEnglish. These were manually inverse filtered and parameterizedon a pulse-by-pulse basis using the LF source model. Z-normalizedF0, EE, OQ and RD parameters (selected through correlation andfactor analysis) were used to generate speaker specific baseline voiceprofiles and to explore cross-speaker variation in focal and non-focal(post- and prefocal) syllables. As expected, source parameter valueswere found to differ in the focal and postfocal portions of the utter-ance. For four of the six speakers the measures revealed a trend oftenser phonation on the focal syllable (an increase in EE and F0 andtypically, a decrease in OQ and RD) as well as increased laxness inthe postfocal part of the utterance. For two of the speakers, however,the measurements showed a different trend. These speakers hadvery high F0 and often high EE on the focal accent. In these cases, RDand OQ values tended to be raised rather than lowered. The possiblereasons for these differences are discussed.

Notes

118

Acoustic Characterization of Word-Final Glottal Stopsin Mizo and Assam Sora

Sishir Kalita, Wendy Lalhminghlui, Luke Horo, PriyankooSarmah, S.R. Mahadeva Prasanna, SamarendraDandapat; IIT Guwahati, IndiaTue-O-3-6-3, Time: 10:40–11:00

The present work proposed an approach to characterize the word-final glottal stops in Mizo and Assam Sora language. Generally,glottal stops have more strong glottal and ventricular constrictionat the coda position than at the onset. However, the primary sourcecharacteristics of glottal stops are irregular glottal cycles, abruptglottal closing, and reduced open cycle. These changes will notonly affect the vocal quality parameters but may also significantlyaffect the vocal tract characteristics due to changes in the subglottalcoupling behavior. This motivates to analyze the dynamic vocaltract characteristics in terms of source behavior, apart from theexcitation source features computed from the Linear Prediction(LP) residual for the acoustic characterization of the word-finalglottal stops. The dominant resonance frequency (DRF) of the vocaltract using Hilbert Envelope of Numerator Group Delay (HNGD) areextracted at every sample instants as a cue to study this deviation.The gradual increase in the DRF and significantly lower duration forwhich subglottal coupling is occurring is observed for the glottalstop region for both the languages.

Iterative Optimal Preemphasis for ImprovedGlottal-Flow Estimation by Iterative Adaptive InverseFiltering

Parham Mokhtari, Hiroshi Ando; NICT, JapanTue-O-3-6-4, Time: 11:00–11:20

Iterative adaptive inverse filtering (IAIF) [1] remains among the state-of-the-art algorithms for estimating glottal flow from the recordedspeech signal. Here, we re-examine IAIF in light of its foundational,classical model of voiced (non-nasalized) speech, wherein the overallspectral tilt is caused only by lip-radiation and glottal effects, whilethe vocal-tract transfer function contains formant peaks but isotherwise not tilted. In contrast, IAIF initially models and cancelsthe formants after only a first-order preemphasis of the speech sig-nal, which is generally not enough to completely remove spectral tilt.

Iterative optimal preemphasis (IOP) is therefore proposed to replaceIAIF’s initial step. IOP is a rapidly converging algorithm that modelsa signal (then inverse-filters it) with one real pole (zero) at a time,until spectral tilt is flattened. IOP-IAIF is evaluated on sustained /a/in a range of voice qualities from weak-breathy to shouted-tense.Compared with standard IAIF, IOP-IAIF yields: (i) an acceptable glottalflow even for a weak breathy voice that the standard algorithm failedto handle; (ii) generally smoother glottal flows that neverthelessretain pulse shape and closed phase; and (iii) enhanced separationof voice qualities in both normalized amplitude quotient (NAQ) andglottal harmonic spectra.

Automatic Measurement of Pre-Aspiration

Yaniv Sheena 1, Míša Hejná 2, Yossi Adi 1, JosephKeshet 1; 1Bar-Ilan University, Israel; 2Aarhus University,DenmarkTue-O-3-6-5, Time: 11:20–11:40

Pre-aspiration is defined as the period of glottal friction occurringin sequences of vocalic/consonantal sonorants and phoneticallyvoiceless obstruents. We propose two machine learning methods forautomatic measurement of pre-aspiration duration: a feedforwardneural network, which works at the frame level; and a structured pre-diction model, which relies on manually designed feature functions,

and works at the segment level. The input for both algorithms is aspeech signal of an arbitrary length containing a single obstruent,and the output is a pair of times which constitutes the pre-aspirationboundaries. We train both models on a set of manually annotatedexamples. Results suggest that the structured model is superior tothe frame-based model as it yields higher accuracy in predicting theboundaries and generalizes to new speakers and new languages.Finally, we demonstrate the applicability of our structured predic-tion algorithm by replicating linguistic analysis of pre-aspiration inAberystwyth English with high correlation.

Acoustic and Electroglottographic Study of Breathyand Modal Vowels as Produced by Heritage andNative Gujarati Speakers

Kiranpreet Nara; University of Toronto, CanadaTue-O-3-6-6, Time: 11:40–12:00

While all languages of the world use modal phonation, many also relyon other phonation types such as breathy or creaky voice. For exam-ple, Gujarati, an Indo-Aryan language, makes a distinction betweenbreathy and modal phonation among consonants and vowels: /bHaR/‘burden’, /baR/ ‘twelve’, and /ba

¨R/ ‘outside’ [1, 2]. This study, which

is a replication and an extension of Khan [3], aims to determine theacoustic and articulatory parameters that distinguish breathy andmodal vowels. The participants of this study are heritage and nativeGujarati speakers.

The materials consisted of 40 target words with the modal andbreathy pairs of the three vowel qualities: /a/ vs /a

¨/, /e/ vs /e

¨/,

and /o/ vs /o¨/. The participants uttered the words in the context

of a sentence. Acoustic measurements such as H1-H2, H1-A1,harmonic-to-noise ratio and articulatory measurements such ascontact quotient were calculated throughout the vowel duration.

The results of the Smoothing Spline ANOVA analyses indicated thatmeasures such as H1-A1, harmonic to noise ratio, and contact quo-tient distinguished modal and breathy vowels for native speakers.Heritage speakers also had a contrast between breathy and modalvowels, however the contrast is not as robust as that of nativespeakers.

Tue-O-3-8 : Speech Synthesis ProsodyD8, 10:00–12:00, Tuesday, 22 Aug. 2017Chairs: Mirjam Wester, Prasanta Ghosh

An RNN-Based Quantized F0 Model with Multi-TierFeedback Links for Text-to-Speech Synthesis

Xin Wang, Shinji Takaki, Junichi Yamagishi; NII, JapanTue-O-3-8-1, Time: 10:00–10:20

A recurrent-neural-network-based F0 model for text-to-speech (TTS)synthesis that generates F0 contours given textual features isproposed. In contrast to related F0 models, the proposed one isdesigned to learn the temporal correlation of F0 contours at multiplelevels. The frame-level correlation is covered by feeding back the F0output of the previous frame as the additional input of the currentframe; meanwhile, the correlation over long-time spans is similarlymodeled but by using F0 features aggregated over the phoneme andsyllable. Another difference is that the output of the proposed modelis not the interpolated continuous-valued F0 contour but rather asequence of discrete symbols, including quantized F0 levels and asymbol for the unvoiced condition. By using the discrete F0 symbols,the proposed model avoids the influence of artificially interpolatedF0 curves. Experiments demonstrated that the proposed F0 model,which was trained using a dropout strategy, generated smooth F0contours with relatively better perceived quality than those frombaseline RNN models.

Notes

119

Phrase Break Prediction for Long-Form Reading TTS:Exploiting Text Structure Information

Viacheslav Klimkov 1, Adam Nadolski 1, Alexis Moinet 2,Bartosz Putrycz 1, Roberto Barra-Chicote 2, ThomasMerritt 2, Thomas Drugman 3; 1Amazon.com, Poland;2Amazon.com, UK; 3Amazon.com, BelgiumTue-O-3-8-2, Time: 10:20–10:40

Phrasing structure is one of the most important factors in increasingthe naturalness of text-to-speech (TTS) systems, in particular forlong-form reading. Most existing TTS systems are optimized forisolated short sentences, and completely discard the larger contextor structure of the text.

This paper presents how we have built phrasing models based ondata extracted from audiobooks. We investigate how various types oftextual features can improve phrase break prediction: part-of-speech(POS), guess POS (GPOS), dependency tree features and word embed-dings. These features are fed into a bidirectional LSTM or a CARTbaseline. The resulting systems are compared using both objectiveand subjective evaluations. Using BiLSTM and word embeddingsproves to be beneficial.

Physically Constrained Statistical F0 Prediction forElectrolaryngeal Speech Enhancement

Kou Tanaka 1, Hirokazu Kameoka 2, Tomoki Toda 3,Satoshi Nakamura 1; 1NAIST, Japan; 2NTT, Japan;3Nagoya University, JapanTue-O-3-8-3, Time: 10:40–11:00

Electrolaryngeal (EL) speech produced by a laryngectomee using anelectrolarynx to mechanically generate artificial excitation soundsseverely suffers from unnatural fundamental frequency (F0) patternscaused by monotonic excitation sounds. To address this issue, wehave previously proposed EL speech enhancement systems using sta-tistical F0 pattern prediction methods based on a Gaussian MixtureModel (GMM), making it possible to predict the underlying F0 patternof EL speech from its spectral feature sequence. Our previous workrevealed that the naturalness of the predicted F0 pattern can beimproved by incorporating a physically based generative model ofF0 patterns into the GMM-based statistical F0 prediction systemwithin a Product-of-Expert framework. However, one drawback ofthis method is that it requires an iterative procedure to obtain apredicted F0 pattern, making it difficult to realize a real-time system.In this paper, we propose yet another approach to physically basedstatistical F0 pattern prediction by using a HMM-GMM framework.This approach is noteworthy in that it allows to generate an F0pattern that is both statistically likely and physically natural withoutiterative procedures. Experimental results demonstrated that theproposed method was capable of generating F0 patterns more similarto those in normal speech than the conventional GMM-based method.

DNN-SPACE: DNN-HMM-Based Generative Model ofVoice F0 Contours for Statistical Phrase/AccentCommand Estimation

Nobukatsu Hojo 1, Yasuhito Ohsugi 2, Yusuke Ijima 1,Hirokazu Kameoka 1; 1NTT, Japan; 2University of Tokyo,JapanTue-O-3-8-4, Time: 11:00–11:20

This paper proposes a method to extract prosodic features froma speech signal by leveraging auxiliary linguistic information. Aprosodic feature extractor called the statistical phrase/accentcommand estimation (SPACE) has recently been proposed. Thisextractor is based on a statistical model formulated as a stochastic

counterpart of the Fujisaki model, a well-founded mathematicalmodel representing the control mechanism of vocal fold vibration.The key idea of this approach is that a phrase/accent commandpair sequence is modeled as an output sequence of a path-restrictedhidden Markov model (HMM) so that estimating the state transitionamounts to estimating the phrase/accent commands. Since thephrase and accent commands are related to linguistic information,we may expect to improve the command estimation accuracy byusing them as auxiliary information for the inference. To model therelationship between the phrase/accent commands and linguisticinformation, we construct a deep neural network (DNN) that mapsthe linguistic feature vectors to the state posterior probabilities ofthe HMM. Thus, given a pitch contour and linguistic information,we can estimate phrase/accent commands via state decoding. Wecall this method “DNN-SPACE.” Experimental results revealed thatusing linguistic information was effective in improving the commandestimation accuracy.

Controlling Prominence Realisation in ParametricDNN-Based Speech Synthesis

Zofia Malisz 1, Harald Berthelsen 2, Jonas Beskow 1,Joakim Gustafson 1; 1KTH, Sweden; 2STTS, SwedenTue-O-3-8-5, Time: 11:20–11:40

This work aims to improve text-to-speech synthesis for Wikipediaby advancing and implementing models of prosodic prominence.We propose a new system architecture with explicit prominencemodeling and test the first component of the architecture. Weautomatically extract a phonetic feature related to prominence fromthe speech signal in the ARCTIC corpus. We then modify the labelfiles and train an experimental TTS system based on the feature usingMerlin, a statistical-parametric DNN-based engine. Test sentenceswith contrastive prominence on the word-level are synthesised andseparate listening tests a) evaluating the level of prominence controlin generated speech, and b) naturalness, are conducted. Our resultsshow that the prominence feature-enhanced system successfullyplaces prominence on the appropriate words and increases perceivednaturalness relative to the baseline.

Increasing Recall of Lengthening Detection viaSemi-Automatic Classification

Simon Betz, Jana Voße, Sina Zarrieß, Petra Wagner;Universität Bielefeld, GermanyTue-O-3-8-6, Time: 11:40–12:00

Lengthening is the ideal hesitation strategy for synthetic speech anddialogue systems: it is unobtrusive and hard to notice, because itoccurs frequently in everyday speech before phrase boundaries, inaccentuation, and in hesitation. Despite its elusiveness, it allowsvaluable extra time for computing or information highlighting inincremental spoken dialogue systems. The elusiveness of the matter,however, poses a challenge for extracting lengthening instances fromcorpus data: we suspect a recall problem, as human annotators mightnot be able to consistently label lengthening instances. We addressthis issue by filtering corpus data for instances of lengthening, usinga simple classification method, based on a threshold for normalizedphone duration. The output is then manually labeled for disfluency.This is compared to an existing, fully manual disfluency annotation,showing that recall is significantly higher with semi-automatic pre-classification. This shows that it is inevitable to use semi-automaticpre-selection to gather enough candidate data points for manualannotation and subsequent lengthening analyses. Also, it is desirableto further increase the performance of the automatic classification.We evaluate in detail human versus semi-automatic annotation andtrain another classifier on the resulting dataset to check the integrityof the disfluent – non-disfluent distinction.

Notes

120

Tue-O-3-10 : Emotion RecognitionE10, 10:00–12:00, Tuesday, 22 Aug. 2017Chairs: Elmar Nöth, Shrikanth Narayanan

Efficient Emotion Recognition from Speech UsingDeep Learning on Spectrograms

Aharon Satt, Shai Rozenberg, Ron Hoory; IBM, IsraelTue-O-3-10-1, Time: 10:00–10:20

We present a new implementation of emotion recognition from thepara-lingual information in the speech, based on a deep neural net-work, applied directly to spectrograms. This new method achieveshigher recognition accuracy compared to previously publishedresults, while also limiting the latency. It processes the speech inputin smaller segments — up to 3 seconds, and splits a longer inputinto non-overlapping parts to reduce the prediction latency.

The deep network comprises common neural network tools —convolutional and recurrent networks — which are shown to effec-tively learn the information that represents emotions directly fromspectrograms. Convolution-only lower-complexity deep networkachieves a prediction accuracy of 66% over four emotions (testedon IEMOCAP — a common evaluation corpus), while a combinedconvolution-LSTM higher-complexity model achieves 68%.

The use of spectrograms in the role of speech-representing featuresenables effective handling of background non-speech signals suchas music (excl. singing) and crowd noise, even at noise levels com-parable with the speech signal levels. Using harmonic modeling toremove non-speech components from the spectrogram, we demon-strate significant improvement of the emotion recognition accuracyin the presence of unknown background non-speech signals.

Interaction and Transition Model for Speech EmotionRecognition in Dialogue

Ruo Zhang, Ando Atsushi, Satoshi Kobashikawa, YushiAono; NTT, JapanTue-O-3-10-2, Time: 10:20–10:40

In this paper we propose a novel emotion recognition method mod-eling interaction and transition in dialogue. Conventional emotionrecognition utilizes intra-features such as MFCCs or F0s withinindividual utterance. However, human perceive emotions not onlythrough individual utterances but also by contextual information.The proposed method takes in account the contextual effect of utter-ance in dialogue, which the conventional method fails to. Proposedmethod introduces Emotion Interaction and Transition (EIT) modelswhich is constructed by end-to-end LSTMs. The inputs of EIT modelare the previous emotions of both target and opponent speaker,estimated by state-of-the-art utterance emotion recognition model.The experimental results show that the proposed method improvesoverall accuracy and average precision by a relative error reductionof 18.8% and 22.6% respectively.

Progressive Neural Networks for Transfer Learning inEmotion Recognition

John Gideon 1, Soheil Khorram 1, Zakaria Aldeneh 1,Dimitrios Dimitriadis 2, Emily Mower Provost 1;1University of Michigan, USA; 2IBM, USATue-O-3-10-3, Time: 10:40–11:00

Many paralinguistic tasks are closely related and thus representa-tions learned in one domain can be leveraged for another. In thispaper, we investigate how knowledge can be transferred betweenthree paralinguistic tasks: speaker, emotion, and gender recognition.

Further, we extend this problem to cross-dataset tasks, asking howknowledge captured in one emotion dataset can be transferred toanother. We focus on progressive neural networks and comparethese networks to the conventional deep learning method of pre-training and fine-tuning. Progressive neural networks provide a wayto transfer knowledge and avoid the forgetting effect present whenpre-training neural networks on different tasks. Our experimentsdemonstrate that: (1) emotion recognition can benefit from usingrepresentations originally learned for different paralinguistic tasksand (2) transfer learning can effectively leverage additional datasetsto improve the performance of emotion recognition systems.

Jointly Predicting Arousal, Valence and Dominancewith Multi-Task Learning

Srinivas Parthasarathy, Carlos Busso; University ofTexas at Dallas, USATue-O-3-10-4, Time: 11:00–11:20

An appealing representation of emotions is the use of emotionalattributes such as arousal (passive versus active), valence (nega-tive versus positive) and dominance (weak versus strong). Whileprevious studies have considered these dimensions as orthogonaldescriptors to represent emotions, there are strong theoreticaland practical evidences showing the interrelation between theseemotional attributes. This observation suggests that predictingemotional attributes with a unified framework should outperformmachine learning algorithms that separately predict each attribute.This study presents methods to jointly learn emotional attributes byexploiting their interdependencies. The framework relies on multi-task learning (MTL) implemented with deep neural networks (DNN)with shared hidden layers. The framework provides a principledapproach to learn shared feature representations that maximize theperformance of regression models. The results of within-corpusand cross-corpora evaluation show the benefits of MTL over singletask learning (STL). MTL achieves gains on concordance correlationcoefficient (CCC) as high as 4.7% for within-corpus evaluations,and 14.0% for cross-corpora evaluations. The visualization of theactivations of the last hidden layers illustrates that MTL createsbetter feature representation. The best structure has shared layersfollowed by attribute-dependent layers, capturing better the relationbetween attributes.

Discretized Continuous Speech Emotion Recognitionwith Multi-Task Deep Recurrent Neural Network

Duc Le, Zakaria Aldeneh, Emily Mower Provost;University of Michigan, USATue-O-3-10-5, Time: 11:20–11:40

Estimating continuous emotional states from speech as a function oftime has traditionally been framed as a regression problem. In thispaper, we present a novel approach that moves the problem into theclassification domain by discretizing the training labels at differentresolutions. We employ a multi-task deep bidirectional long-shortterm memory (BLSTM) recurrent neural network (RNN) trained withcost-sensitive Cross Entropy loss to model these labels jointly. Weintroduce an emotion decoding algorithm that incorporates long- andshort-term temporal properties of the signal to produce more robusttime series estimates. We show that our proposed approach achievescompetitive audio-only performance on the RECOLA dataset, relativeto previously published works as well as other strong regressionbaselines. This work provides a link between regression and clas-sification, and contributes an alternative approach for continuousemotion recognition.

Notes

121

Towards Speech Emotion Recognition “in the Wild”Using Aggregated Corpora and Deep Multi-TaskLearning

Jaebok Kim, Gwenn Englebienne, Khiet P. Truong,Vanessa Evers; University of Twente, The NetherlandsTue-O-3-10-6, Time: 11:40–12:00

One of the challenges in Speech Emotion Recognition (SER) “inthe wild” is the large mismatch between training and test data(e.g. speakers and tasks). In order to improve the generalisationcapabilities of the emotion models, we propose to use Multi-TaskLearning (MTL) and use gender and naturalness as auxiliary tasks indeep neural networks. This method was evaluated in within-corpusand various cross-corpus classification experiments that simulateconditions “in the wild”. In comparison to Single-Task Learning(STL) based state of the art methods, we found that our MTL methodproposed improved performance significantly. Particularly, modelsusing both gender and naturalness achieved more gains than thoseusing either gender or naturalness separately. This benefit was alsofound in the high-level representations of the feature space, obtainedfrom our method proposed, where discriminative emotional clusterscould be observed.

Tue-O-4-1 : WaveNet and Novel ParadigmsAula Magna, 13:30–15:30, Tuesday, 22 Aug. 2017Chairs: Peter Cahill, Rob Clark

Speaker-Dependent WaveNet Vocoder

Akira Tamamori, Tomoki Hayashi, Kazuhiro Kobayashi,Kazuya Takeda, Tomoki Toda; Nagoya University, JapanTue-O-4-1-1, Time: 13:30–13:50

In this study, we propose a speaker-dependent WaveNet vocoder, amethod of synthesizing speech waveforms with WaveNet, by utiliz-ing acoustic features from existing vocoder as auxiliary features ofWaveNet. It is expected that WaveNet can learn a sample-by-samplecorrespondence between speech waveform and acoustic features.The advantage of the proposed method is that it does not require(1) explicit modeling of excitation signals and (2) various assump-tions, which are based on prior knowledge specific to speech. Weconducted both subjective and objective evaluation experiments onCMU-ARCTIC database. From the results of the objective evaluation,it was demonstrated that the proposed method could generate high-quality speech with phase information recovered, which was lost by amel-cepstrum vocoder. From the results of the subjective evaluation,it was demonstrated that the sound quality of the proposed methodwas significantly improved from mel-cepstrum vocoder, and theproposed method could capture source excitation information moreaccurately.

Waveform Modeling Using Stacked DilatedConvolutional Neural Networks for SpeechBandwidth Extension

Yu Gu, Zhen-Hua Ling; USTC, ChinaTue-O-4-1-2, Time: 13:50–14:10

This paper presents a waveform modeling and generation methodfor speech bandwidth extension (BWE) using stacked dilatedconvolutional neural networks (CNNs) with causal or non-causalconvolutional layers. Such dilated CNNs describe the predictivedistribution for each wideband or high-frequency speech sampleconditioned on the input narrowband speech samples. Distinguishedfrom conventional frame-based BWE approaches, the proposed meth-ods can model the speech waveforms directly and therefore avert the

spectral conversion and phase estimation problems. Experimentalresults prove that the BWE methods proposed in this paper canachieve better performance than the state-of-the-art frame-basedapproach utilizing recurrent neural networks (RNNs) incorporatinglong short-term memory (LSTM) cells in subjective preference tests.

Direct Modeling of Frequency Spectra and WaveformGeneration Based on Phase Recovery for DNN-BasedSpeech Synthesis

Shinji Takaki 1, Hirokazu Kameoka 2, JunichiYamagishi 1; 1NII, Japan; 2NTT, JapanTue-O-4-1-3, Time: 14:10–14:30

In statistical parametric speech synthesis (SPSS) systems usingthe high-quality vocoder, acoustic features such as mel-cepstrumcoefficients and F0 are predicted from linguistic features in orderto utilize the vocoder to generate speech waveforms. However, thegenerated speech waveform generally suffers from quality deterio-ration such as buzziness caused by utilizing the vocoder. Althoughseveral attempts such as improving an excitation model have beeninvestigated to alleviate the problem, it is difficult to completelyavoid it if the SPSS system is based on the vocoder. To overcomethis problem, there have recently been attempts to directly modelwaveform samples. Superior performance has been demonstrated,but computation time and latency are still issues. With the aimto construct another type of DNN-based speech synthesizer withneither the vocoder nor computational explosion, we investigateddirect modeling of frequency spectra and waveform generation basedon phase recovery. In this framework, STFT spectral amplitudesthat include harmonic information derived from F0 are directlypredicted through a DNN-based acoustic model and we use Griffinand Lim’s approach to recover phase and generate waveforms. Theexperimental results showed that the proposed system synthesizedspeech without buzziness and outperformed one generated from aconventional system using the vocoder.

A Hierarchical Encoder-Decoder Model for StatisticalParametric Speech Synthesis

Srikanth Ronanki, Oliver Watts, Simon King; Universityof Edinburgh, UKTue-O-4-1-4, Time: 14:30–14:50

Current approaches to statistical parametric speech synthesis usingNeural Networks generally require input at the same temporal reso-lution as the output, typically a frame every 5ms, or in some casesat waveform sampling rate. It is therefore necessary to fabricatehighly-redundant frame-level (or sample-level) linguistic features atthe input. This paper proposes the use of a hierarchical encoder-decoder model to perform the sequence-to-sequence regressionin a way that takes the input linguistic features at their originaltimescales, and preserves the relationships between words, syllablesand phones. The proposed model is designed to make more effectiveuse of suprasegmental features than conventional architectures, aswell as being computationally efficient. Experiments were conductedon prosodically-varied audiobook material because the use of supra-segmental features is thought to be particularly important in thiscase. Both objective measures and results from subjective listeningtests, which asked listeners to focus on prosody, show that theproposed method performs significantly better than a conventionalarchitecture that requires the linguistic input to be at the acousticframe rate.

We provide code and a recipe to enable our system to be reproducedusing the Merlin toolkit.

Notes

122

Statistical Voice Conversion with WaveNet-BasedWaveform Generation

Kazuhiro Kobayashi, Tomoki Hayashi, Akira Tamamori,Tomoki Toda; Nagoya University, JapanTue-O-4-1-5, Time: 14:50–15:10

This paper presents a statistical voice conversion (VC) technique withtheWaveNet-based waveform generation. VC based on a Gaussianmixture model (GMM) makes it possible to convert the speakeridentity of a source speaker into that of a target speaker. However,in the conventional vocoding process, various factors such as F0extraction errors, parameterization errors and over-smoothingeffects of converted feature trajectory cause the modeling errorsof the speech waveform, which usually bring about sound qualitydegradation of the converted voice. To address this issue, weapply a direct waveform generation technique based on a WaveNetvocoder to VC. In the proposed method, first, the acoustic featuresof the source speaker are converted into those of the target speakerbased on the GMM. Then, the waveform samples of the convertedvoice are generated based on the WaveNet vocoder conditioned onthe converted acoustic features. In this paper, to investigate themodeling accuracies of the converted speech waveform, we compareseveral types of the acoustic features for training and synthesizingbased on the WaveNet vocoder. The experimental results confirmedthat the proposed VC technique achieves higher conversion accuracyon speaker individuality with comparable sound quality comparedto the conventional VC technique.

Google’s Next-Generation Real-Time Unit-SelectionSynthesizer Using Sequence-to-Sequence LSTM-BasedAutoencoders

Vincent Wan 1, Yannis Agiomyrgiannakis 1, HannaSilen 1, Jakub Vít 2; 1Google, UK; 2University of WestBohemia, Czech RepublicTue-O-4-1-6, Time: 15:10–15:30

A neural network model that significant improves unit-selection-based Text-To-Speech synthesis is presented. The model employsa sequence-to-sequence LSTM-based autoencoder that compressesthe acoustic and linguistic features of each unit to a fixed-sizevector referred to as an embedding. Unit-selection is facilitatedby formulating the target cost as an L2 distance in the embeddingspace. In open-domain speech synthesis the method achieves a 0.2improvement in the MOS, while for limited-domain it reaches thecap of 4.5 MOS. Furthermore, the new TTS system halves the gapbetween the previous unit-selection system and WaveNet in terms ofquality while retaining low computational cost and latency.

Tue-O-4-2 : Models of Speech PerceptionA2, 13:30–15:30, Tuesday, 22 Aug. 2017Chairs: Chris Davis, Frank Zimmerer

A Comparison of Sentence-Level Speech IntelligibilityMetrics

Alexander Kain 1, Max Del Giudice 2, Kris Tjaden 3;1Oregon Health & Science University, USA; 2IndependentResearcher, USA; 3University at Buffalo, USATue-O-4-2-1, Time: 13:30–13:50

We examine existing and novel automatically-derived acoustic met-rics that are predictive of speech intelligibility. We hypothesize thatthe degree of variability in feature space is correlated with the extentof a speaker’s phonemic inventory, their degree of articulatorydisplacements, and thus with their degree of perceived speechintelligibility. We begin by using fully-automatic F1/F2 formantfrequency trajectories for both vowel space area calculation andas input to a proposed class-separability metric. We then switchto representing vowels by means of short-term spectral features,and measure vowel separability in that space. Finally, we considerthe case where phonetic labeling is unavailable; here we calculateshort-term spectral features for the entire speech utterance and thenestimate their entropy based on the length of a minimum spanningtree. In an alternative approach, we propose to first segment thespeech signal using a hidden Markov model, and then calculatespectral feature separability based on the automatically-derivedclasses. We apply all approaches to a database with healthy controlsas well as speakers with mild dysarthria, and report the resultingcoefficients of determination.

An Auditory Model of Speaker Size Perception forVoiced Speech Sounds

Toshio Irino 1, Eri Takimoto 1, Toshie Matsui 1, Roy D.Patterson 2; 1Wakayama University, Japan; 2Universityof Cambridge, UKTue-O-4-2-2, Time: 13:50–14:10

An auditory model was developed to explain the results of behav-ioral experiments on perception of speaker size with voiced speechsounds. It is based on the dynamic, compressive gammachirp (dcGC)filterbank and a weighting function (SSI weight) derived from a theoryof size-shape segregation in the auditory system. Voiced words withand without high-frequency emphasis (+6 dB/octave) were producedusing a speech vocoder (STRAIGHT). The SSI weighting functionreduces the effect of glottal pulse excitation in voiced speech, which,in turn, makes it possible for the model to explain the individualsubject variability in the data.

The Recognition of Compounds: A ComputationalAccount

L. ten Bosch, L. Boves, M. Ernestus; Radboud UniversiteitNijmegen, The NetherlandsTue-O-4-2-3, Time: 14:10–14:30

This paper investigates the processes in comprehending spokennoun-noun compounds, using data from the BALDEY database.BALDEY contains lexicality judgments and reaction times (RTs) forDutch stimuli for which also linguistic information is included. Twodifferent approaches are combined. The first is based on regressionby Dynamic Survival Analysis, which models decisions and RTs as aconsequence of the fact that a cumulative density function exceedssome threshold. The parameters of that function are estimated from

Notes

123

the observed RT data. The second approach is based on DIANA,a process-oriented computational model of human word compre-hension, which simulates the comprehension process with theacoustic stimulus as input. DIANA gives the identity and the num-ber of the word candidates that are activated at each 10 ms time step.

Both approaches show how the processes involved in comprehendingcompounds change during a stimulus. Survival Analysis shows thatthe impact of word duration varies during the course of a stimulus.The density of word and non-word hypotheses in DIANA shows acorresponding pattern with different regimes. We show how theapproaches complement each other, and discuss additional ways inwhich data and process models can be combined.

Humans do not Maximize the Probability of CorrectDecision When Recognizing DANTALE Words in Noise

Mohsen Zareian Jahromi, Jan Østergaard, Jesper Jensen;Aalborg University, DenmarkTue-O-4-2-4, Time: 14:30–14:50

Inspired by the DANTALE II listening test paradigm, which is usedfor determining the intelligibility of noisy speech, we assess thehypothesis that humans maximize the probability of correct decisionwhen recognizing words contaminated by additive Gaussian, speech-shaped noise. We first propose a statistical Gaussian communicationand classification scenario, where word models are built from shortterm spectra of human speech, and optimal classifiers in the senseof maximum a posteriori estimation are derived. Then, we performa listening test, where the participants are instructed to make theirbest guess of words contaminated with speech-shaped Gaussiannoise. Comparing the human’s performance to that of the optimalclassifier reveals that at high SNR, humans perform comparable tothe optimal classifier. However, at low SNR, the human performanceis inferior to that of the optimal classifier. This shows that, at leastin this specialized task, humans are generally not able to maximizethe probability of correct decision, when recognizing words.

Single-Ended Prediction of Listening Effort Based onAutomatic Speech Recognition

Rainer Huber, Constantin Spille, Bernd T. Meyer; Carlvon Ossietzky Universität Oldenburg, GermanyTue-O-4-2-5, Time: 14:50–15:10

A new, single-ended, i.e. reference-free measure for the predictionof perceived listening effort of noisy speech is presented. It is basedon phoneme posterior probabilities (or posteriorgrams) obtainedfrom a deep neural network of an automatic speech recognitionsystem. Additive noisy or other distortions of speech tend to smearthe posteriorgrams. The smearing is quantified by a performancemeasure, which is used as a predictor for the perceived listeningeffort required to understand the noisy speech. The proposed mea-sure was evaluated using a database obtained from the subjectiveevaluation of noise reduction algorithms of commercial hearing aids.Listening effort ratings of processed noisy speech samples weregathered from 20 hearing-impaired subjects. Averaged subjectiveratings were compared with corresponding predictions computed bythe proposed new method, the ITU-T standard P.563 for single-endedspeech quality assessment, the American National Standard ANIQUE+for single-ended speech quality assessment, and a single-ended SNRestimator. The proposed method achieved a good correlation withmean subjective ratings and clearly outperformed the standardspeech quality measures and the SNR estimator.

Modeling Categorical Perception with the ReceptiveFields of Auditory Neurons

Chris Neufeld; University of Maryland, USATue-O-4-2-6, Time: 15:10–15:30

This paper demonstrates that a low-level, linear description of theresponse properties of auditory neurons can exhibit some of thehigh-level properties of the categorical perception of human speech.In particular, it is shown that the non-linearities observed in thehuman perception of speech sounds which span a categorical bound-aries can be understood as arising rather naturally from a low-levelstatistical description of phonemic contrasts in the time-frequencyplane, understood here as the receptive field of auditory neurons.The TIMIT database was used to train a model auditory neuron whichdiscriminates between /s/ and /sh/, and a computer simulation wasconducted which demonstrates that the neuron responds categori-cally to a linear continuum of synthetic fricative sounds which spanthe /s/-/sh/ boundary. The response of the model provides a goodfit to human labeling behavior, and in addition, is able to account forasymmetries in reaction time across the two categories.

Tue-O-4-4 : Source Separation and AuditoryScene AnalysisB4, 13:30–15:30, Tuesday, 22 Aug. 2017Chairs: Mahadeva Prasanna, Géza Németh

A Maximum Likelihood Approach to Deep NeuralNetwork Based Nonlinear Spectral Mapping forSingle-Channel Speech Separation

Yannan Wang 1, Jun Du 1, Li-Rong Dai 1, Chin-Hui Lee 2;1USTC, China; 2Georgia Institute of Technology, USATue-O-4-4-1, Time: 13:30–13:50

In contrast to the conventional minimum mean squared error (MMSE)training criterion for nonlinear spectral mapping based on deepneural networks (DNNs), we propose a probabilistic learning frame-work to estimate the DNN parameters for single-channel speechseparation. A statistical analysis of the prediction error vector at theDNN output reveals that it follows a unimodal density for each logpower spectral component. By characterizing the prediction errorvector as a multivariate Gaussian density with zero mean vector andan unknown covariance matrix, we present a maximum likelihood(ML) approach to DNN parameter learning. Our experiments on theSpeech Separation Challenge (SSC) corpus show that the proposedlearning approach can achieve a better generalization capability anda faster convergence than MMSE-based DNN learning. Furthermore,we demonstrate that the ML-trained DNN consistently outperformsMMSE-trained DNN in all the objective measures of speech qualityand intelligibility in single-channel speech separation.

Deep Clustering-Based Beamforming for Separationwith Unknown Number of Sources

Takuya Higuchi, Keisuke Kinoshita, Marc Delcroix,Katerina Žmolíková, Tomohiro Nakatani; NTT, JapanTue-O-4-4-2, Time: 13:50–14:10

This paper extends a deep clustering algorithm for use with time-frequency masking-based beamforming and perform separationwith an unknown number of sources. Deep clustering is a recentlyproposed single-channel source separation algorithm, which projectsinputs into the embedding space and performs clustering in the em-bedding domain. In deep clustering, bi-directional long short-termmemory (BLSTM) recurrent neural networks are trained to make em-bedding vectors orthogonal for different speakers and concurrent for

Notes

124

the same speaker. Then, by clustering the embedding vectors at testtime, we can estimate time-frequency masks for separation. In thispaper, we extend the deep clustering algorithm to a multiple micro-phone setup and incorporate deep clustering-based time-frequencymask estimation into masking-based beamforming, which has beenshown to be more effective than masking for automatic speechrecognition. Moreover, we perform source counting by computingthe rank of the covariance matrix of the embedding vectors. With ourproposed approach, we can perform masking-based beamforming ina multiple-speaker case without knowing the number of speakers.Experimental results show that our proposed deep clustering-basedbeamformer achieves comparable source separation performanceto that obtained with a complex Gaussian mixture model-basedbeamformer, which requires the number of sources in advance formask estimation.

Time-Frequency Masking for Blind Source Separationwith Preserved Spatial Cues

Shadi Pirhosseinloo, Kostas Kokkinakis; University ofKansas, USATue-O-4-4-3, Time: 14:10–14:30

In this paper, we address the problem of speech source separationby relying on time-frequency binary masks to segregate binauralmixtures. We describe an algorithm which can tackle reverberantmixtures and can extract the original sources while preservingtheir original spatial locations. The performance of the proposedalgorithm is evaluated objectively and subjectively, by assessing theestimated interaural time differences versus their theoretical valuesand by testing for localization acuity in normal-hearing listenersfor different spatial locations in a reverberant room. Experimentalresults indicate that the proposed algorithm is capable of preservingthe spatial information of the recovered source signals while keepingthe signal-to-distortion and signal-to-interference ratios high.

Variational Recurrent Neural Networks for SpeechSeparation

Jen-Tzung Chien, Kuan-Ting Kuo; National Chiao TungUniversity, TaiwanTue-O-4-4-4, Time: 14:30–14:50

We present a new stochastic learning machine for speech separationbased on the variational recurrent neural network (VRNN). ThisVRNN is constructed from the perspectives of generative stochasticnetwork and variational auto-encoder. The idea is to faithfullycharacterize the randomness of hidden state of a recurrent neuralnetwork through variational learning. The neural parameters underthis latent variable model are estimated by maximizing the varia-tional lower bound of log marginal likelihood. An inference networkdriven by the variational distribution is trained from a set of mixedsignals and the associated source targets. A novel supervised VRNNis developed for speech separation. The proposed VRNN providesa stochastic point of view which accommodates the uncertainty inhidden states and facilitates the analysis of model construction.The masking function is further employed in network outputs forspeech separation. The benefit of using VRNN is demonstrated bythe experiments on monaural speech separation.

Detecting Overlapped Speech on Short TimeframesUsing Deep Learning

Valentin Andrei, Horia Cucu, Corneliu Burileanu; UPB,RomaniaTue-O-4-4-5, Time: 14:50–15:10

The intent of this work is to demonstrate how deep learning tech-niques can be successfully used to detect overlapped speech on

independent short timeframes. A secondary objective is to providean understanding on how the duration of the signal frame influencesthe accuracy of the method. We trained a deep neural network withheterogeneous layers and obtained close to 80% inference accuracyon frames going as low as 25 milliseconds. The proposed systemprovides higher detection quality than existing work and can predictoverlapped speech with up to 3 simultaneous speakers. The methodexposes low response latency and does not require a high amount ofcomputing power.

Ideal Ratio Mask Estimation Using Deep NeuralNetworks for Monaural Speech Segregation in NoisyReverberant Conditions

Xu Li, Junfeng Li, Yonghong Yan; Chinese Academy ofSciences, ChinaTue-O-4-4-6, Time: 15:10–15:30

Monaural speech segregation is an important problem in robustspeech processing and has been formulated as a supervised learningproblem. In supervised learning methods, the ideal binary mask(IBM) is usually used as the target because of its simplicity andlarge speech intelligibility gains. Recently, the ideal ratio mask(IRM) has been found to improve the speech quality over the IBM.However, the IRM was originally defined in anechoic conditions anddid not consider the effect of reverberation. In this paper, the IRMis extended to reverberant conditions where the direct sound andearly reflections of target speech are regarded as the desired signal.Deep neural networks (DNNs) is employed to estimate the extendedIRM in the noisy reverberant conditions. The estimated IRM is thenapplied to the noisy reverberant mixture for speech segregation. Ex-perimental results show that the estimated IRM provides substantialimprovements in speech intelligibility and speech quality over theunprocessed mixture signals under various noisy and reverberantconditions.

Tue-O-4-6 : Prosody: Tone and IntonationC6, 13:30–15:30, Tuesday, 22 Aug. 2017Chairs: Mariapaola D’Imperio, Oliver Niebuhr

The Vocative Chant and Beyond: German CallingMelodies Under Routine and Urgent Contexts

Sergio I. Quiroz, Marzena Zygis; Leibniz-ZAS, GermanyTue-O-4-6-1, Time: 13:30–13:50

This paper investigates calling melodies produced by 21 StandardGerman native speakers on a discourse completion task across twocontexts: (i) routine context — calling a child from afar to comein for dinner; (ii) urgent context — calling a child from afar for achastising. The intent of this investigation is to bring attention toother calling melodies found in German beside the vocative chantand to give an insight to their acoustic profile.

Three major melodies were identified in the two contexts: vocativechant (100% of routine context productions), urgent call (100% ofmale urgent context productions, 52.2% female productions), andstern call (47.8% female urgent context productions). A subsequentquantitative analysis was carried out on these calls across theseparameters: (i) tonal scaling at tonal landmarks; (ii) proportionalalignment of selected tonal landmarks with respect to the stressedor last vowel; and (iii) amplitude (integral and RMS) and (iv) durationof the stressed vowel, stressed syllable, and word. The resulting datawere analyzed using a linear mixed model approach.

The results point to significant differences in the contours producedin the aforementioned parameters. We also proposed a phonologicaldescription of the contours in the framework of Autosegmental-Metrical Phonology.

Notes

125

Comparing Languages Using Hierarchical ProsodicAnalysis

Juraj Šimko, Antti Suni, Katri Hiovain, Martti Vainio;University of Helsinki, FinlandTue-O-4-6-2, Time: 13:50–14:10

We present a novel, data-driven approach to assessing mutual simi-larities and differences among a group of languages, based on purelyprosodic characteristics, namely f0 and energy envelope signals.These signals are decomposed using continuous wavelet transform;the components represent f0 and energy patterns on three levelsof prosodic hierarchy roughly corresponding to syllables, wordsand phrases. Unigram language models with states derived froma combination of Δ-features obtained from these components aretrained and compared using a mutual perplexity measure. In thispilot study we apply this approach to a small corpus of spoken ma-terial from seven languages (Estonian, Finnish, Hungarian, German,Swedish, Russian and Slovak) with a rich history of mutual languagecontacts. We present similarity trees (dendrograms) derived fromthe models using the hierarchically decomposed prosodic signalsseparately as well as combined, and compare them with patternsobtained from non-decomposed signals. We show that (1) plausiblesimilarity patterns, reflecting language family relationships and theknown contact history can be obtained even from a relatively smalldata set, and (2) the hierarchical decomposition approach using bothf0 and energy provides the most comprehensive results.

Intonation Facilitates Prediction of Focus Even in thePresence of Lexical Tones

Martin Ho Kwan Ip, Anne Cutler; Western SydneyUniversity, AustraliaTue-O-4-6-3, Time: 14:10–14:30

In English and Dutch, listeners entrain to prosodic contours topredict where focus will fall in an utterance. However, is this strategyuniversally available, even in languages with different phonologicalsystems? In a phoneme detection experiment, we examined whetherprosodic entrainment is also found in Mandarin Chinese, a tonelanguage, where in principle the use of pitch for lexical identitymay take precedence over the use of pitch cues to salience. Con-sistent with the results from Germanic languages, response timeswere facilitated when preceding intonation predicted accent on thetarget-bearing word. Acoustic analyses revealed greater F0 range inthe preceding intonation of the predicted-accent sentences. Thesefindings have implications for how universal and language-specificmechanisms interact in the processing of salience.

Mind the Peak: When Museum is TemporarilyUnderstood as Musical in Australian English

Katharina Zahner 1, Heather Kember 2, Bettina Braun 1;1Universität Konstanz, Germany; 2Western SydneyUniversity, AustraliaTue-O-4-6-4, Time: 14:30–14:50

Intonation languages signal pragmatic functions (e.g. informationstructure) by means of different pitch accent types. Acoustically,pitch accent types differ in the alignment of pitch peaks (and valleys)in regard to stressed syllables, which makes the position of pitchpeaks an unreliable cue to lexical stress (even though pitch peaksand lexical stress often coincide in intonation languages). We hereinvestigate the effect of pitch accent type on lexical activation inEnglish. Results of a visual-world eye-tracking study show that Aus-tralian English listeners temporarily activate SWW-words (musical)if presented with WSW-words (museum) with early-peak accents(H+!H*), compared to medial-peak accents (L+H*). Thus, in addition

to signalling pragmatic functions, the alignment of tonal targetsimmediately affects lexical activation in English.

Pashto Intonation Patterns

Luca Rognoni, Judith Bishop, Miriam Corris; Appen,AustraliaTue-O-4-6-5, Time: 14:50–15:10

A hand-labelled Pashto speech data set containing spontaneous con-versations is analysed in order to propose an intonational inventoryof Pashto. Basic intonation patterns observed in the language aresummarised. The relationship between pitch accent and part ofspeech (PoS), which was also annotated for each word in the data set,is briefly addressed.

The results are compared with the intonational literature on Persian,a better-described and closely-related language. The results showthat Pashto intonation patterns are similar to Persian, as well asreflecting common intonation patterns such as falling tone for state-ments and WH-questions, and yes/no questions ending in a risingtone. The data also show that the most frequently used intonationpattern in Pashto is the so-called hat pattern. The distribution ofpitch accent is quite free both in Persian and Pashto, but there is astronger association of pitch accent with content than with functionwords, as is typical of stress-accent languages.

The phonetic realisation of focus appears to be conveyed with thesame acoustic cues as in Persian, with a higher pitch excursion andlonger duration of the stressed syllable of the word in focus. Thedata also suggest that post-focus compression (PFC) is present inPashto.

A New Model of Final Lowering in SpontaneousMonologue

Kikuo Maekawa; NINJAL, JapanTue-O-4-6-6, Time: 15:10–15:30

F0 downtrend observed in spontaneous monologues in the Corpusof Spontaneous Japanese was analyzed with special attention to themodeling of final lowering. In addition to the previous finding thatthe domain of final lowering covers all tones in the final accentualphrase, it turned out that the last L tone in the penultimate accentualphrase played important role in the control of final lowering. Itis this tone that first reached the bottom of the speaker’s pitchrange in the time course of utterance; it also turned out that thephonetic realization of this tone is the most stable of all tones interms of the F0 variability. Regression model of F0 downtrends isgenerated by generalized linear mixed-effect modeling and evaluatedby cross-validation. The mean prediction error of z-normalized F0values in the best model was 0.25 standard deviation.

Tue-O-4-8 : Emotion ModelingD8, 13:30–15:30, Tuesday, 22 Aug. 2017Chairs: Koichi Shinoda, Anton Batliner

Speech Emotion Recognition with Emotion-Pair BasedFramework Considering Emotion DistributionInformation in Dimensional Emotion Space

Xi Ma, Zhiyong Wu, Jia Jia, Mingxing Xu, Helen Meng,Lianhong Cai; Tsinghua University, ChinaTue-O-4-8-1, Time: 13:30–13:50

In this work, an emotion-pair based framework is proposed forspeech emotion recognition, which constructs more discriminativefeature subspaces for every two different emotions (emotion-pair) to

Notes

126

generate more precise emotion bi-classification results. Furthermore,it is found that in the dimensional emotion space, the distancesbetween some of the archetypal emotions are closer than the others.Motivated by this, a Naive Bayes classifier based decision fusionstrategy is proposed, which aims at capturing such useful emotiondistribution information in deciding the final emotion category foremotion recognition. We evaluated the classification framework onthe USC IEMOCAP database. Experimental results demonstrate thatthe proposed method outperforms the hierarchical binary decisiontree approach on both weighted accuracy (WA) and unweighted accu-racy (UA). Moreover, our framework possesses the advantages that itcan be fully automatically generated without empirical guidance andis easier to be parallelized.

Adversarial Auto-Encoders for Speech Based EmotionRecognition

Saurabh Sahu 1, Rahul Gupta 2, Ganesh Sivaraman 1,Wael AbdAlmageed 3, Carol Espy-Wilson 1; 1University ofMaryland, USA; 2Amazon.com, USA; 3University ofSouthern California, USATue-O-4-8-2, Time: 13:50–14:10

Recently, generative adversarial networks and adversarial auto-encoders have gained a lot of attention in machine learning com-munity due to their exceptional performance in tasks such as digitclassification and face recognition. They map the auto-encoder’sbottleneck layer output (termed as code vectors) to different noiseProbability Distribution Functions (PDFs), that can be further reg-ularized to cluster based on class information. In addition, theyalso allow a generation of synthetic samples by sampling the codevectors from the mapped PDFs. Inspired by these properties, we in-vestigate the application of adversarial auto-encoders to the domainof emotion recognition. Specifically, we conduct experiments on thefollowing two aspects: (i) their ability to encode high dimensionalfeature vector representations for emotional utterances into a com-pressed space (with a minimal loss of emotion class discriminabilityin the compressed space), and (ii) their ability to regenerate syntheticsamples in the original feature space, to be later used for purposessuch as training emotion recognition classifiers. We demonstratepromise of adversarial auto-encoders with regards to these aspectson the Interactive Emotional Dyadic Motion Capture (IEMOCAP)corpus and present our analysis.

An Investigation of Emotion Prediction UncertaintyUsing Gaussian Mixture Regression

Ting Dang, Vidhyasaharan Sethu, Julien Epps,Eliathamby Ambikairajah; University of New SouthWales, AustraliaTue-O-4-8-3, Time: 14:10–14:30

Existing continuous emotion prediction systems implicitly assumethat prediction certainty does not vary with time. However, per-ception differences among raters and other possible sources ofvariability suggest that prediction certainty varies with time, whichwarrants deeper consideration. In this paper, the correlation betweenthe inter-rater variability and the uncertainty of predicted emotionis firstly studied. A new paradigm that estimates the uncertainty inprediction is proposed based on the strong correlation uncovered inthe RECOLA database. This is implemented by including the inter-rater variability as a representation of the uncertainty informationin a probabilistic Gaussian Mixture Regression (GMR) model. Inaddition, we investigate the correlation between the uncertainty andthe performance of a typical emotion prediction system utilizingaverage rating as the ground truth, by comparing the prediction per-formance in the lower and higher uncertainty regions. As expected,

it is observed that the performance in lower uncertainty regions isbetter than that in higher uncertainty regions, providing a path forimproving emotion prediction systems.

Capturing Long-Term Temporal Dependencies withConvolutional Networks for Continuous EmotionRecognition

Soheil Khorram 1, Zakaria Aldeneh 1, DimitriosDimitriadis 2, Melvin McInnis 1, Emily Mower Provost 1;1University of Michigan, USA; 2IBM, USATue-O-4-8-4, Time: 14:30–14:50

The goal of continuous emotion recognition is to assign an emotionvalue to every frame in a sequence of acoustic features. We showthat incorporating long-term temporal dependencies is critical forcontinuous emotion recognition tasks. To this end, we first investi-gate architectures that use dilated convolutions. We show that eventhough such architectures outperform previously reported systems,the output signals produced from such architectures undergo erraticchanges between consecutive time steps. This is inconsistent withthe slow moving ground-truth emotion labels that are obtainedfrom human annotators. To deal with this problem, we model adownsampled version of the input signal and then generate theoutput signal through upsampling. Not only does the resultingdownsampling/upsampling network achieve good performance, italso generates smooth output trajectories. Our method yields thebest known audio-only performance on the RECOLA dataset.

Voice-to-Affect Mapping: Inferences on LanguageVoice Baseline Settings

Ailbhe Ní Chasaide, Irena Yanushevskaya, Christer Gobl;Trinity College Dublin, IrelandTue-O-4-8-5, Time: 14:50–15:10

Modulations of the voice convey affect, and the precise mapping ofvoice-to-affect may vary for different languages. However, affect-related modulations occur relative to the baseline affect-neutralvoice, which tends to differ from language to language. Little isknown about the characteristic long-term voice settings for differ-ent languages, and how they influence the use of voice quality tosignal affect. In this paper, data from a voice-to-affect perceptiontest involving Russian, English, Spanish and Japanese subjects isre-examined to glean insights concerning likely baseline settings inthese languages. The test used synthetic stimuli with different voicequalities (modelled on a male voice), with or without extreme f0 con-tours as might be associated with affect. Cross-language differencesin affect ratings for modal and tense voice suggest that the baselinein Spanish and Japanese is inherently tenser than in Russian andEnglish, and that as a corollary, tense voice serves as a more potentcue to high-activation affects in the latter languages. A relativelytenser baseline in Japanese and Spanish is further suggested by thefact that tense voice can be associated with intimate, a low activationstate, just as readily as with the high-activation state interested.

Attentive Convolutional Neural Network BasedSpeech Emotion Recognition: A Study on the Impactof Input Features, Signal Length, and Acted Speech

Michael Neumann, Ngoc Thang Vu; UniversitätStuttgart, GermanyTue-O-4-8-6, Time: 15:10–15:30

Speech emotion recognition is an important and challenging task inthe realm of human-computer interaction. Prior work proposed avariety of models and feature sets for training a system. In this work,

Notes

127

we conduct extensive experiments using an attentive convolutionalneural network with multi-view learning objective function. Wecompare system performance using different lengths of the inputsignal, different types of acoustic features and different types ofemotion speech (improvised/scripted). Our experimental resultson the Interactive Emotional Motion Capture (IEMOCAP) databasereveal that the recognition performance strongly depends on thetype of speech data independent of the choice of input features.Furthermore, we achieved state-of-the-art results on the improvisedspeech data of IEMOCAP.

Tue-O-4-10 : Voice Conversion 1E10, 13:30–15:30, Tuesday, 22 Aug. 2017Chairs: Hema Murthy, S.R.M. Prasanna

Voice Conversion Using Sequence-to-SequenceLearning of Context Posterior Probabilities

Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi,Hiroshi Saruwatari; University of Tokyo, JapanTue-O-4-10-1, Time: 13:30–13:50

Voice conversion (VC) using sequence-to-sequence learning of contextposterior probabilities is proposed. Conventional VC using sharedcontext posterior probabilities predicts target speech parametersfrom the context posterior probabilities estimated from the sourcespeech parameters. Although conventional VC can be built fromnon-parallel data, it is difficult to convert speaker individuality suchas phonetic property and speaking rate contained in the posteriorprobabilities because the source posterior probabilities are directlyused for predicting target speech parameters. In this work, weassume that the training data partly include parallel speech dataand propose sequence-to-sequence learning between the source andtarget posterior probabilities. The conversion models perform non-linear and variable-length transformation from the source probabilitysequence to the target one. Further, we propose a joint trainingalgorithm for the modules. In contrast to conventional VC, whichseparately trains the speech recognition that estimates posteriorprobabilities and the speech synthesis that predicts target speechparameters, our proposed method jointly trains these modules alongwith the proposed probability conversion modules. Experimentalresults demonstrate that our approach outperforms the conventionalVC.

Learning Latent Representations for SpeechGeneration and Transformation

Wei-Ning Hsu, Yu Zhang, James Glass; MIT, USATue-O-4-10-2, Time: 13:50–14:10

An ability to model a generative process and learn a latent repre-sentation for speech in an unsupervised fashion will be crucial toprocess vast quantities of unlabelled speech data. Recently, deepprobabilistic generative models such as Variational Autoencoders(VAEs) have achieved tremendous success in modeling naturalimages. In this paper, we apply a convolutional VAE to modelthe generative process of natural speech. We derive latent spacearithmetic operations to disentangle learned latent representations.We demonstrate the capability of our model to modify the phoneticcontent or the speaker identity for speech segments using thederived operations, without the need for parallel supervisory data.

Parallel-Data-Free Many-to-Many Voice ConversionBased on DNN Integrated with Eigenspace Using aNon-Parallel Speech Corpus

Tetsuya Hashimoto, Hidetsugu Uchida, Daisuke Saito,Nobuaki Minematsu; University of Tokyo, JapanTue-O-4-10-3, Time: 14:10–14:30

This paper proposes a novel approach to parallel-data-free andmany-to-many voice conversion (VC). As 1-to-1 conversion has lessflexibility, researchers focus on many-to-many conversion, wherespeaker identity is often represented using speaker space bases.In this case, utterances of the same sentences have to be collectedfrom many speakers. This study aims at overcoming this constraintto realize a parallel-data-free and many-to-many conversion. Thisis made possible by integrating deep neural networks (DNNs) witheigenspace using a non-parallel speech corpus. In our previousstudy, many-to-many conversion was implemented using DNN,whose training was assisted by EVGMM conversion. By realizing thefunction of EVGMM equivalently by constructing eigenspace with anon-parallel speech corpus, the desired conversion is made possible.A key technique here is to estimate covariance terms without givenparallel data between source and target speakers. Experiments showthat objective assessment scores are comparable to those of thebaseline system trained with parallel data.

Sequence-to-Sequence Voice Conversion withSimilarity Metric Learned Using GenerativeAdversarial Networks

Takuhiro Kaneko, Hirokazu Kameoka, KaoruHiramatsu, Kunio Kashino; NTT, JapanTue-O-4-10-4, Time: 14:30–14:50

We propose a training framework for sequence-to-sequence voiceconversion (SVC). A well-known problem regarding a conventionalVC framework is that acoustic-feature sequences generated from aconverter tend to be over-smoothed, resulting in buzzy-soundingspeech. This is because a particular form of similarity metric ordistribution for parameter training of the acoustic model is assumedso that the generated feature sequence that averagely fits the trainingtarget example is considered optimal. This over-smoothing occursas long as a manually constructed similarity metric is used. Toovercome this limitation, our proposed SVC framework uses asimilarity metric implicitly derived from a generative adversarialnetwork, enabling the measurement of the distance in the high-levelabstract space. This would enable the model to mitigate the over-smoothing problem caused in the low-level data space. Furthermore,we use convolutional neural networks to model the long-rangecontext-dependencies. This also enables the similarity metric to havea shift-invariant property; thus, making the model robust againstmisalignment errors involved in the parallel data. We tested ourframework on a non-native-to-native VC task. The experimental re-sults revealed that the use of the proposed framework had a certaineffect in improving naturalness, clarity, and speaker individuality.

A Mouth Opening Effect Based on Pole Modificationfor Expressive Singing Voice Transformation

Luc Ardaillon, Axel Roebel; STMS (UMR 9912), FranceTue-O-4-10-5, Time: 14:50–15:10

Improving expressiveness in singing voice synthesis systems requiresto perform realistic timbre transformations, e.g. for varying voiceintensity. In order to sing louder, singers tend to open their mouthmore widely, which changes the vocal tract’s shape and resonances.This study shows, by means of signal analysis and simulations, thatthe main effect of mouth opening is an increase of the 1st formant’s

Notes

128

frequency (F1) and a decrease of its bandwidth (BW1). From theseobservations, we then propose a rule for producing a mouth openingeffect, by modifying F1 and BW1, and an approach to apply this effecton real voice sounds. This approach is based on pole modification,by changing the AR coefficients of an estimated all-pole model of thespectral envelope. Finally, listening tests have been conducted toevaluate the effectiveness of the proposed effect.

Siamese Autoencoders for Speech Style Extractionand Switching Applied to Voice Identification andConversion

Seyed Hamidreza Mohammadi, Alexander Kain; OregonHealth & Science University, USATue-O-4-10-6, Time: 15:10–15:30

We propose an architecture called siamese autoencoders for ex-tracting and switching pre-determined styles of speech signalswhile retaining the content. We apply this architecture to a voiceconversion task in which we define the content to be the linguisticmessage and the style to be the speaker’s voice. We assume two ormore data streams with the same content but unique styles. Thearchitecture is composed of two or more separate but shared-weightautoencoders that are joined by loss functions at the hidden layers.A hidden vector is composed of style and content sub-vectors andthe loss functions constrain the encodings to decompose styleand content. We can select an intended target speaker either bysupplying the associated style vector, or by extracting a new stylevector from a new utterance, using a proposed style extractionalgorithm. We focus on in-training speakers but perform some initialexperiments for out-of-training speakers as well. We propose andstudy several types of loss functions. The experiment results showthat the proposed many-to-many model is able to convert voicessuccessfully; however, its performance does not surpass that of thestate-of-the-art one-to-one model’s.

Tue-O-5-1 : Neural Network Acoustic Modelsfor ASR 2Aula Magna, 16:00–18:00, Tuesday, 22 Aug. 2017Chairs: Mark Gales, Tara Sainath

Recurrent Neural Aligner: An Encoder-DecoderNeural Network Model for Sequence to SequenceMapping

Hasim Sak, Matt Shannon, Kanishka Rao, FrançoiseBeaufays; Google, USATue-O-5-1-1, Time: 16:00–16:20

We introduce an encoder-decoder recurrent neural network modelcalled Recurrent Neural Aligner (RNA) that can be used for sequenceto sequence mapping tasks. Like connectionist temporal classifica-tion (CTC) models, RNA defines a probability distribution over targetlabel sequences including blank labels corresponding to each timestep in input. The probability of a label sequence is calculated bymarginalizing over all possible blank label positions. Unlike CTC,RNA does not make a conditional independence assumption for labelpredictions; it uses the predicted label at time t-1 as an additionalinput to the recurrent model when predicting the label at time t. Weapply this model to end-to-end speech recognition. RNA is capable ofstreaming recognition since the decoder does not employ attentionmechanism. The model is trained on transcribed acoustic data topredict graphemes and no external language and pronunciationmodels are used for decoding. We employ an approximate dynamicprogramming method to optimize negative log likelihood, and a

sampling-based sequence discriminative training technique to fine-tune the model to minimize expected word error rate. We show thatthe model achieves competitive accuracy without using an externallanguage model nor doing beam search decoding.

Highway-LSTM and Recurrent Highway Networks forSpeech Recognition

Golan Pundak, Tara N. Sainath; Google, USATue-O-5-1-2, Time: 16:20–16:40

Recently, very deep networks, with as many as hundreds of layers,have shown great success in image classification tasks. One keycomponent that has enabled such deep models is the use of “skipconnections”, including either residual or highway connections, toalleviate the vanishing and exploding gradient problems. While theseconnections have been explored for speech, they have mainly beenexplored for feed-forward networks. Since recurrent structures,such as LSTMs, have produced state-of-the-art results on many ofour Voice Search tasks, the goal of this work is to thoroughly inves-tigate different approaches to adding depth to recurrent structures.Specifically, we experiment with novel Highway-LSTM models withbottlenecks skip connections and show that a 10 layer model canoutperform a state-of-the-art 5 layer LSTM model with the samenumber of parameters by 2% relative WER. In addition, we experi-ment with Recurrent Highway layers and find these to be on par withHighway-LSTM models, when given sufficient depth.

Improving Speech Recognition by Revising GatedRecurrent Units

Mirco Ravanelli 1, Philemon Brakel 2, MaurizioOmologo 1, Yoshua Bengio 2; 1FBK, Italy; 2Université deMontréal, CanadaTue-O-5-1-3, Time: 16:40–17:00

Speech recognition is largely taking advantage of deep learning,showing that substantial benefits can be obtained by modern Re-current Neural Networks (RNNs). The most popular RNNs are LongShort-Term Memory (LSTMs), which typically reach state-of-the-artperformance in many tasks thanks to their ability to learn long-termdependencies and robustness to vanishing gradients. Nevertheless,LSTMs have a rather complex design with three multiplicative gates,that might impair their efficient implementation. An attempt tosimplify LSTMs has recently led to Gated Recurrent Units (GRUs),which are based on just two multiplicative gates.

This paper builds on these efforts by further revising GRUs andproposing a simplified architecture potentially more suitable forspeech recognition. The contribution of this work is two-fold. First,we suggest to remove the reset gate in the GRU design, resulting in amore efficient single-gate architecture. Second, we propose to replacetanh with ReLU activations in the state update equations. Resultsshow that, in our implementation, the revised architecture reducesthe per-epoch training time with more than 30% and consistentlyimproves recognition performance across different tasks, inputfeatures, and noisy conditions when compared to a standard GRU.

Stochastic Recurrent Neural Network for SpeechRecognition

Jen-Tzung Chien, Chen Shen; National Chiao TungUniversity, TaiwanTue-O-5-1-4, Time: 17:00–17:20

This paper presents a new stochastic learning approach to con-struct a latent variable model for recurrent neural network (RNN)based speech recognition. A hybrid generative and discriminativestochastic network is implemented to build a deep classification

Notes

129

model. In the implementation, we conduct stochastic modeling forhidden states of recurrent neural network based on the variationalauto-encoder. The randomness of hidden neurons is representedby the Gaussian distribution with mean and variance parametersdriven by neural weights and learned from variational inference.Importantly, the class labels of input speech frames are incorpo-rated to regularize this deep model to sample the informative anddiscriminative features for reconstruction of classification outputs.We accordingly propose the stochastic RNN (SRNN) to reflect theprobabilistic property in RNN classification system. A stochasticerror backpropagation algorithm is implemented. The experimentson speech recognition using TIMIT and Aurora4 show the merit ofthe proposed SRNN.

Frame and Segment Level Recurrent Neural Networksfor Phone Classification

Martin Ratajczak 1, Sebastian Tschiatschek 2, FranzPernkopf 1; 1Technische Universität Graz, Austria; 2ETHZürich, SwitzerlandTue-O-5-1-5, Time: 17:20–17:40

We introduce a simple and efficient frame and segment level RNNmodel (FS-RNN) for phone classification. It processes the input atframe level and segment level by bidirectional gated RNNs. This typeof processing is important to exploit the (temporal) informationmore effectively compared to (i) models which solely process theinput at frame level and (ii) models which process the input onsegment level using features obtained by heuristic aggregation offrame level features. Furthermore, we incorporated the activationsof the last hidden layer of the FS-RNN as an additional feature typein a neural higher-order CRF (NHO-CRF). In experiments, we demon-strated excellent performance on the TIMIT phone classification task,reporting a performance of 13.8% phone error rate for the FS-RNNmodel and 11.9% when combined with the NHO-CRF. In both caseswe significantly exceeded the state-of-the-art performance.

Deep Learning-Based Telephony Speech Recognitionin the Wild

Kyu J. Han, Seongjun Hahm, Byung-Hak Kim, JungsukKim, Ian Lane; Capio, USATue-O-5-1-6, Time: 17:40–18:00

In this paper, we explore the effectiveness of a variety of DeepLearning-based acoustic models for conversational telephonyspeech, specifically TDNN, bLSTM and CNN-bLSTM models. We eval-uated these models on both research testsets, such as Switchboardand CallHome, as well as recordings from a real-world call-center ap-plication. Our best single system, consisting of a single CNN-bLSTMacoustic model, obtained a WER of 5.7% on the Switchboard testset,and in combination with other models a WER of 5.3% was obtained.On the CallHome testset a WER of 10.1% was achieved with modelcombination. On the test data collected from real-world call-centers,even with model adaptation using application specific data, the WERwas significantly higher at 15.0%. We performed an error analysis onthe real-world data and highlight the areas where speech recognitionstill has challenges.

Tue-O-5-2 : Speaker Recognition EvaluationA2, 16:00–18:00, Tuesday, 22 Aug. 2017Chairs: Kong Aik Lee, Rahim Saeidi

The I4U Mega Fusion and Collaboration for NISTSpeaker Recognition Evaluation 2016

SRE’

16 I4U Group;Tue-O-5-2-1, Time: 16:00–16:20

The 2016 speaker recognition evaluation (SRE’16) is the latest editionin the series of benchmarking events conducted by the NationalInstitute of Standards and Technology (NIST). I4U is a joint entryto SRE’16 as the result from the collaboration and active exchangeof information among researchers from sixteen Institutes andUniversities across 4 continents. The joint submission and severalof its 32 sub-systems were among top-performing systems. A lotof efforts have been devoted to two major challenges, namely,unlabeled training data and dataset shift from Switchboard-Mixer tothe new Call My Net dataset. This paper summarizes the lessonslearned, presents our shared view from the sixteen research groupson recent advances, major paradigm shift, and common tool chainused in speaker recognition as we have witnessed in SRE’16. Moreimportantly, we look into the intriguing question of fusing a largeensemble of sub-systems and the potential benefit of large-scalecollaboration.

The MIT-LL, JHU and LRDE NIST 2016 SpeakerRecognition Evaluation System

Pedro A. Torres-Carrasquillo 1, Fred Richardson 1,Shahan Nercessian 1, Douglas Sturim 1, WilliamCampbell 1, Youngjune Gwon 1, Swaroop Vattam 1,Najim Dehak 2, Harish Mallidi 2, Phani SankarNidadavolu 2, Ruizhi Li 2, Reda Dehak 3; 1MIT LincolnLaboratory, USA; 2Johns Hopkins University, USA;3EPITA LRDE, FranceTue-O-5-2-2, Time: 16:20–16:40

In this paper, the NIST 2016 SRE system that resulted from thecollaboration between MIT Lincoln Laboratory and the team at JohnsHopkins University is presented. The submissions for the 2016evaluation consisted of three fixed condition submissions and asingle system open condition submission. The primary submissionon the fixed (and core) condition resulted in an actual DCF of .618.Details of the submissions are discussed along with some discussionand observations of the 2016 evaluation campaign.

Nuance - Politecnico di Torino’s 2016 NIST SpeakerRecognition Evaluation System

Daniele Colibro 1, Claudio Vair 1, Emanuele Dalmasso 1,Kevin Farrell 2, Gennady Karvitsky 3, Sandro Cumani 4,Pietro Laface 4; 1Nuance Communications, Italy;2Nuance Communications, USA; 3NuanceCommunications, Israel; 4Politecnico di Torino, ItalyTue-O-5-2-3, Time: 16:40–17:00

This paper describes the Nuance–Politecnico di Torino (NPT) speakerrecognition system submitted to the NIST SRE16 evaluation cam-paign. Included are the results of post-evaluation tests, focusing onthe analysis of the performance of generative and discriminative clas-sifiers, and of score normalization. The submitted system combinesthe results of four GMM-IVector models, two DNN-IVector modelsand a GMM-SVM acoustic system. Each system exploits acoustic

Notes

130

1A*STAR, SingaporeKong Aik Lee 1,

front-end parameters that differ by feature type and dimension. Weanalyze the main components of our submission, which contributedto obtaining 8.1% EER and 0.532 actual Cprimary in the challengingSRE16 Fixed condition.

UTD-CRSS Systems for 2016 NIST SpeakerRecognition Evaluation

Chunlei Zhang, Fahimeh Bahmaninezhad, ShiveshRanjan, Chengzhu Yu, Navid Shokouhi, John H.L.Hansen; University of Texas at Dallas, USATue-O-5-2-4, Time: 17:00–17:20

This study describes systems submitted by the Center for RobustSpeech Systems (CRSS) from the University of Texas at Dallas (UTD)to the 2016 National Institute of Standards and Technology (NIST)Speaker Recognition Evaluation (SRE).We developed 4 UBM and DNNi-vector based speaker recognition systems with alternate data setsand feature representations. Given that the emphasis of the NIST SRE2016 is on language mismatch between training and enrollment/testdata, so-called domain mismatch, in our system development wefocused on: (i) utilizing unlabeled in-domain data for centralizingi-vectors to alleviate the domain mismatch; (ii) selecting the properdata sets and optimizing configurations for training LDA/PLDA;(iii) introducing a newly proposed dimension reduction techniquewhich incorporates unlabeled in-domain data before PLDA training;(iv) unsupervised speaker clustering of unlabeled data and usingthem alone or with previous SREs for PLDA training, and finally (v)score calibration using unlabeled data with “pseudo” speaker labelsgenerated from speaker clustering. NIST evaluations show that ourproposed methods were very successful for the given task.

Analysis and Description of ABC Submission to NISTSRE 2016

Oldrich Plchot 1, Pavel Matejka 1, Anna Silnova 1, OndrejNovotný 1, Mireia Diez Sánchez 1, Johan Rohdin 1,Ondrej Glembek 1, Niko Brümmer 2, Albert Swart 2, JesúsJorrín-Prieto 3, Paola García 3, Luis Buera 3, PatrickKenny 4, Jahangir Alam 4, Gautam Bhattacharya 4;1Brno University of Technology, Czech Republic;2Nuance Communications, South Africa; 3NuanceCommunications, Spain; 4CRIM, CanadaTue-O-5-2-5, Time: 17:20–17:40

We present a condensed description and analysis of the joint sub-mission for NIST SRE 2016, by Agnitio, BUT and CRIM (ABC). Weconcentrate on challenges that arose during development and weanalyze the results obtained on the evaluation data and on ourdevelopment sets. We show that testing on mismatched, non-Englishand short duration data introduced in NIST SRE 2016 is a difficultproblem for current state-of-the-art systems. Testing on this databrought back the issue of score normalization and it also revealedthat the bottleneck features (BN), which are superior when used fortelephone English, are lacking in performance against the standardacoustic features like Mel Frequency Cepstral Coefficients (MFCCs).We offer ABC’s insights, findings and suggestions for building arobust system suitable for mismatched, non-English and relativelynoisy data such as those in NIST SRE 2016.

The 2016 NIST Speaker Recognition Evaluation

Seyed Omid Sadjadi 1, Timothée Kheyrkhah 1, AudreyTong 1, Craig Greenberg 1, Douglas Reynolds 2, ElliotSinger 2, Lisa Mason 3, Jaime Hernandez-Cordero 3;1NIST, USA; 2MIT Lincoln Laboratory, USA; 3DoD, USATue-O-5-2-6, Time: 17:40–18:00

In 2016, the National Institute of Standards and Technology (NIST)conducted the most recent in an ongoing series of speaker recogni-tion evaluations (SRE) to foster research in robust text-independentspeaker recognition, as well as measure performance of currentstate-of-the-art systems. Compared to previous NIST SREs, SRE16introduced several new aspects including: an entirely online eval-uation platform, a fixed training data condition, more variability intest segment duration (uniformly distributed between 10s and 60s),the use of non-English (Cantonese, Cebuano, Mandarin and Tagalog)conversational telephone speech (CTS) collected outside NorthAmerica, and providing labeled and unlabeled development (a.k.a.validation) sets for system hyperparameter tuning and adaptation.The introduction of the new non-English CTS data made SRE16 morechallenging due to domain/channel and language mismatches ascompared to previous SREs. A total of 66 research organizationsfrom industry and academia registered for SRE16, out of which 43teams submitted 121 valid system outputs that produced scores.This paper presents an overview of the evaluation and analysis ofsystem performance over all primary evaluation conditions. Initialresults indicate that effective use of the development data wasessential for the top performing systems, and that domain/channel,language, and duration mismatch had an adverse impact on systemperformance.

Tue-O-5-4 : Glottal Source ModelingB4, 16:00–18:00, Tuesday, 22 Aug. 2017Chairs: João Cabral, Thomas Drugman

A New Cosine Series Antialiasing Function and itsApplication to Aliasing-Free Glottal Source Models forSpeech and Singing Synthesis

Hideki Kawahara 1, Ken-Ichi Sakakibara 2, MasanoriMorise 3, Hideki Banno 4, Tomoki Toda 5, Toshio Irino 1;1Wakayama University, Japan; 2Health ScienceUniversity of Hokkaido, Japan; 3University ofYamanashi, Japan; 4Meijo University, Japan; 5NagoyaUniversity, JapanTue-O-5-4-1, Time: 16:00–16:20

We formulated and implemented a procedure to generate aliasing-free excitation source signals. It uses a new antialiasing filter in thecontinuous time domain followed by an IIR digital filter for responseequalization. We introduced a cosine-series-based general designprocedure for the new antialiasing function. We applied this newprocedure to implement the antialiased Fujisaki-Ljungqvist model.We also applied it to revise our previous implementation of theantialiased Fant-Liljencrants model. A combination of these signalsand a lattice implementation of the time varying vocal tract modelprovides a reliable and flexible basis to test f o extractors and sourceaperiodicity analysis methods. MATLAB implementations of theseantialiased excitation source models are available as part of our opensource tools for speech science.

Notes

131

Speaking Style Conversion from Normal to LombardSpeech Using a Glottal Vocoder and Bayesian GMMs

Ana Ramírez López, Shreyas Seshadri, Lauri Juvela,Okko Räsänen, Paavo Alku; Aalto University, FinlandTue-O-5-4-2, Time: 16:20–16:40

Speaking style conversion is the technology of converting naturalspeech signals from one style to another. In this study, we focuson normal-to-Lombard conversion. This can be used, for example,to enhance the intelligibility of speech in noisy environments. Wepropose a parametric approach that uses a vocoder to extract speechfeatures. These features are mapped using Bayesian GMMs fromutterances spoken in normal style to the corresponding features ofLombard speech. Finally, the mapped features are converted to aLombard speech waveform with the vocoder. Two vocoders werecompared in the proposed normal-to-Lombard conversion: a recentlydeveloped glottal vocoder that decomposes speech into glottal flowexcitation and vocal tract, and the widely used STRAIGHT vocoder.The conversion quality was evaluated in two subjective listeningtests measuring subjective similarity and naturalness. The similaritytest results show that the system is able to convert normal speechinto Lombard speech for the two vocoders. However, the subjectivenaturalness of the converted Lombard speech was clearly betterusing the glottal vocoder in comparison to STRAIGHT.

Reducing Mismatch in Training of DNN-Based GlottalExcitation Models in a Statistical ParametricText-to-Speech System

Lauri Juvela 1, Bajibabu Bollepalli 1, Junichi Yamagishi 2,Paavo Alku 1; 1Aalto University, Finland; 2NII, JapanTue-O-5-4-3, Time: 16:40–17:00

Neural network-based models that generate glottal excitation wave-forms from acoustic features have been found to give improvedquality in statistical parametric speech synthesis. Until now, how-ever, these models have been trained separately from the acousticmodel. This creates mismatch between training and synthesis, as thesynthesized acoustic features used for the excitation model inputdiffer from the original inputs, with which the model was trained on.Furthermore, due to the errors in predicting the vocal tract filter, theoriginal excitation waveforms do not provide perfect reconstructionof the speech waveform even if predicted without error. To addressthese issues and to make the excitation model more robust againsterrors in acoustic modeling, this paper proposes two modificationsto the excitation model training scheme. First, the excitation model istrained in a connected manner, with inputs generated by the acousticmodel. Second, the target glottal waveforms are re-estimated byperforming glottal inverse filtering with the predicted vocal tractfilters. The results show that both of these modifications improveperformance measured in MSE and MFCC distortion, and slightlyimprove the subjective quality of the synthetic speech.

Semi Parametric Concatenative TTS with InstantVoice Modification Capabilities

Alexander Sorin, Slava Shechtman, Asaf Rendel; IBM,IsraelTue-O-5-4-4, Time: 17:00–17:20

Recently, a glottal vocoder has been integrated in the IBM concatena-tive TTS system and certain configurable global voice transformationswere defined in the vocoder parameter space. The vocoder analysisemploys a novel robust glottal source parameter estimation strategy.The vocoder is applied to the voiced speech only, while unvoicedspeech is kept unparameterized, thus contributing to the perceivednaturalness of the synthesized speech.

The semi-parametric system enables independent modificationsof the glottal source and vocal tract components on-the-fly byembedding the voice transformations in the synthesis process. Thetransformations effect ranges from slight voice altering to a completechange of the perceived speaker personality. Pitch modificationsenhance these changes. At the same time, the voice transformationsare simple enough to be easily controlled externally to the system.This allows the users either to fine tune the voice sound or to createinstantly multiple distinct virtual voices. In both cases, the synthesisis based on a large and meticulously cleaned concatenative TTSvoice with a broad phonetic coverage. In this paper we present thesystem and provide subjective evaluations of its voice modificationcapabilities.

The technology presented in this paper is implemented in IBMWatson TTS service.

Modeling Laryngeal Muscle Activation Noise forLow-Order Physiological Based Speech Synthesis

Rodrigo Manríquez 1, Sean D. Peterson 2, Pavel Prado 1,Patricio Orio 3, Matías Zañartu 1; 1Universidad TécnicaFederico Santa María, Chile; 2University of Waterloo,Canada; 3Universidad de Valparaíso, ChileTue-O-5-4-5, Time: 17:20–17:40

Physiological-based synthesis using low order lumped-mass mod-els of phonation have been shown to mimic and predict complexphysical phenomena observed in normal and pathological speechproduction, and have received significant attention due to theirability to efficiently perform comprehensive parametric investiga-tions that are cost prohibitive with more advanced computationaltools. Even though these numerical models have been shown to beuseful research and clinical tools, several physiological aspects ofthem remain to be explored. One of the key components that hasbeen neglected is the natural fluctuation of the laryngeal muscleactivity that affects the configuration of the model parameters. Inthis study, a physiologically-based laryngeal muscle activation modelthat accounts for random fluctuations is proposed. The method isexpected to improve the ability to model muscle related pathologies,such as muscle tension dysphonia and Parkinson’s disease. Themathematical framework and underlying assumptions are described,and the effects of the added random muscle activity is tested in awell-known body-cover model of the vocal folds with acoustic propa-gation and interaction. Initial simulations illustrate that the randomfluctuations in the muscle activity impact the resulting kinematics tovarying degrees depending on the laryngeal configuration.

Direct Modelling of Magnitude and Phase Spectra forStatistical Parametric Speech Synthesis

Felipe Espic, Cassia Valentini Botinhao, Simon King;University of Edinburgh, UKTue-O-5-4-6, Time: 17:40–18:00

We propose a simple new representation for the FFT spectrumtailored to statistical parametric speech synthesis. It consists of fourfeature streams that describe magnitude, phase and fundamentalfrequency using real numbers. The proposed feature extractionmethod does not attempt to decompose the speech structure (e.g.,into source+filter or harmonics+noise). By avoiding the simplifica-tions inherent in decomposition, we can dramatically reduce the“phasiness” and “buzziness” typical of most vocoders. The methoduses simple and computationally cheap operations and can operateat a lower frame rate than the 200 frames-per-second typical in manysystems. It avoids heuristics and methods requiring approximate oriterative solutions, including phase unwrapping.

Two DNN-based acoustic models were built — from male and female

Notes

132

speech data — using the Merlin toolkit. Subjective comparisonswere made with a state-of-the-art baseline, using the STRAIGHTvocoder. In all variants tested, and for both male and female voices,the proposed method substantially outperformed the baseline. Weprovide source code to enable our complete system to be replicated.

Tue-O-5-6 : Prosody: Rhythm, Stress,Quantity and PhrasingC6, 16:00–18:00, Tuesday, 22 Aug. 2017Chairs: Plinio Barbosa, Pärtel Lippus

Similar Prosodic Structure Perceived Differently inGerman and English

Heather Kember 1, Ann-Kathrin Grohe 2, KatharinaZahner 3, Bettina Braun 3, Andrea Weber 2, AnneCutler 1; 1Western Sydney University, Australia;2Universität Tübingen, Germany; 3Universität Konstanz,GermanyTue-O-5-6-1, Time: 16:00–16:20

English and German have similar prosody, but their speakers realizesome pitch falls (not rises) in subtly different ways. We here testfor asymmetry in perception. An ABX discrimination task requiringF0 slope or duration judgements on isolated vowels revealed nocross-language difference in duration or F0 fall discrimination, butdiscrimination of rises (realized similarly in each language) was lessaccurate for English than for German listeners. This unexpectedfinding may reflect greater sensitivity to rising patterns by Germanlisteners, or reduced sensitivity by English listeners as a result ofextensive exposure to phrase-final rises (“uptalk”) in their language.

Disambiguate or not? — The Role of Prosody inUnambiguous and Potentially Ambiguous AnaphoraProduction in Strictly Mandarin Parallel Structures

Luying Hou, Bert Le Bruyn, René Kager; UniversiteitUtrecht, The NetherlandsTue-O-5-6-2, Time: 16:20–16:40

It has been observed that the interpretation of pronouns can dependon their accentuation patterns in parallel sentences as “John hitBill and then George hit him”, in which ‘him’ refers to Bill whenunaccented but shifts to John when accented. While accentuation iswidely regarded as a means of disambiguation, some studies havenoticed that it also extends to unambiguous anaphors [7–10]. Fromthe perspective of production, however, no strong experimental con-firmation was found for the ‘shift’ function of accented pronouns,which is due to the fact that production research has mainly focusedon corpora [5, 6]. Hence, the nature of the accent on anaphors stillremains obscure. By manipulating referential shift and ambiguity,this study explores the role of prosody in anaphora production instrictly Mandarin parallel structures. The results reveal a signifi-cantly higher F0 and longer duration for anaphors in referentiallyshifted conditions, suggesting that anaphoric accentuation signalsa referential change in strictly parallel structures in Mandarin.No evidence was found that ambiguity plays a role in anaphoricaccentuation. This finding challenges the general view on accentedpronouns and will deepen our understanding on semantics-prosodyrelationship.

Acoustic Properties of Canonical and Non-CanonicalStress in French, Turkish, Armenian and BrazilianPortuguese

Angeliki Athanasopoulou 1, Irene Vogel 2, HossepDolatian 2; 1University of California at San Diego, USA;2University of Delaware, USATue-O-5-6-3, Time: 16:40–17:00

Languages are often categorized as having either predictable (fixedor quantity-sensitive) or non-predictable stress. Despite their name,fixed stress languages may have exceptions, so in fact, their stressdoes not always appear in the same position. Since predictability hasbeen shown to affect certain speech phenomena, with additional orredundant acoustic cues being provided when the linguistic contentis less predictable (e.g., Smooth Signal Redundancy Hypothesis), weinvestigate whether, and to what extent, the predictability of stressposition affects the manifestation of stress in different languages.We examine the acoustic properties of stress in three languagesclassified as having fixed stress (Turkish, French, Armenian), with ex-ceptions, and in one language with non-predictable-stress, BrazilianPortuguese. Specifically, we compare the manifestation of stress inthe canonical stress (typically “fixed”) position with its manifestationin the non-canonical (exceptional) position, where it would poten-tially be less predictable. We also compare these patterns with themanifestation of stress in Portuguese, in both the “default” penul-timate and the less common final position. Our results show thatstress is manifested quite similarly in canonical and non-canonicalpositions in the “fixed” stress languages and stress is most clearlyproduced when it is least predictable.

Phonological Complexity, Segment Rate and SpeechTempo Perception

Leendert Plug 1, Rachel Smith 2; 1University of Leeds, UK;2University of Glasgow, UKTue-O-5-6-4, Time: 17:00–17:20

Studies of speech tempo commonly use syllable or segment rate as aproxy measure for perceived tempo. In languages whose phonologiesallow substantial syllable complexity these measures can producefigures on quite different scales; however, little is known aboutthe correlation between syllable and segment rate measurementson the one hand and naïve listeners’ tempo judgements on the other.

We follow up on the findings of one relevant study on German [1],which suggest that listeners attend to both syllable and segmentrates in making tempo estimates, through a weighted average ofthe rates in which syllable rate carries more weight. We report onan experiment in which we manipulate phonological complexity inEnglish utterance pairs that are constant in syllable rate. Listenersdecide for each pair which utterance sounds faster. Our resultssuggest that differences in segment rate that do not correspond todifferences in syllable rate have little impact on perceived speechtempo in English.

On the Duration of Mandarin Tones

Jing Yang 1, Yu Zhang 2, Aijun Li 3, Li Xu 2; 1University ofCentral Arkansas, USA; 2Ohio University, USA; 3ChineseAcademy of Social Sciences, ChinaTue-O-5-6-5, Time: 17:20–17:40

The present study compared the duration of Mandarin tones inthree types of speech contexts: isolated monosyllables, formaltext-reading passages, and casual conversations. A total of 156adult speakers was recruited. The speech materials included 44monosyllables recorded from each of 121 participants, 18 passagesread by 2 participants, and 20 conversations conducted by 33

Notes

133

participants. The duration pattern of the four lexical tones in theisolated monosyllables was consistent with the pattern described inprevious literature. However, the duration of the four lexical tonesbecame much shorter and tended to converge to that of the neutraltone (i.e., tone 0) in the text-reading and conversational speech.The maximum-likelihood estimator revealed that the durational cuecontributed to tone recognition in the isolated monosyllables. With asingle speaker, the average tone recognition based on duration alonecould reach approximately 65% correct. As the number of speakersincreased (e.g., ≥ 4), tone recognition performance dropped toapproximately 45% correct. In conversational speech, the maximumlikelihood estimation of tones based on duration cues was only 23%correct. The tone duration provided little useful cue to differentiateMandarin tonal identity in everyday situations.

The Formant Dynamics of Long Close Vowels inThree Varieties of Swedish

Otto Ewald 1, Eva Liina Asu 2, Susanne Schötz 1; 1LundUniversity, Sweden; 2University of Tartu, EstoniaTue-O-5-6-6, Time: 17:40–18:00

This study compares the acoustic realisation of /i: y: 0: u:/ in threevarieties of Swedish: Central Swedish, Estonian Swedish, and FinlandSwedish. Vowel tokens were extracted from isolated words producedby six elderly female speakers from each variety. Trajectories of thefirst three formants were modelled with discrete cosine transform(DCT) coefficients, enabling the comparison of the formant meansas well as the direction and magnitude of the formant movement.Cross-dialectal differences were found in all measures and in allvowels. The most noteworthy feature of the Estonian Swedish longclose vowel inventory is the lack of /y:/. For Finland Swedish it wasshown that /i:/ and /y:/ are more close than in Central Swedish. Therealisation of /0:/ varies from front in Central Swedish, to centralin Estonian Swedish, and back in Finland Swedish. On average,the Central Swedish vowels exhibited a higher degree of formantmovement than the vowels in the other two varieties. In the presentstudy, regional variation in Swedish vowels was for the first timeinvestigated using DCT coefficients. The results stress the impor-tance of taking formant dynamics into account even in the analysisof nominal monophthongs.

Tue-O-5-8 : Speech Recognition forLearningD8, 16:00–18:00, Tuesday, 22 Aug. 2017Chairs: Tatsuya Kawahara, Martin Russell

Bidirectional LSTM-RNN for Improving AutomatedAssessment of Non-Native Children’s Speech

Yao Qian, Keelan Evanini, Xinhao Wang, Chong Min Lee,Matthew Mulholland; Educational Testing Service, USATue-O-5-8-1, Time: 16:00–16:20

Recent advances in ASR and spoken language processing have led toimproved systems for automated assessment for spoken language.However, it is still challenging for automated scoring systems toachieve high performance in terms of the agreement with humanexperts when applied to non-native children’s spontaneous speech.The subpar performance is mainly caused by the relatively lowrecognition rate on non-native children’s speech. In this paper, weinvestigate different neural network architectures for improving non-native children’s speech recognition and the impact of the featuresextracted from the corresponding ASR output on the automatedassessment of speaking proficiency. Experimental results showthat bidirectional LSTM-RNN can outperform feed-forward DNN in

ASR, with an overall relative WER reduction of 13.4%. The improvedspeech recognition can then boost the language proficiency assess-ment performance. Correlations between the rounded automatedscores and expert scores range from 0.66 to 0.70 for the threespeaking tasks studied, similar to the human-human agreementlevels for these tasks.

Automatic Scoring of Shadowing Speech Based onDNN Posteriors and Their DTW

Junwei Yue 1, Fumiya Shiozawa 1, Shohei Toyama 1,Yutaka Yamauchi 2, Kayoko Ito 3, Daisuke Saito 1,Nobuaki Minematsu 1; 1University of Tokyo, Japan;2Tokyo International University, Japan; 3KyotoUniversity, JapanTue-O-5-8-2, Time: 16:20–16:40

Shadowing has become a well-known method to improve learners’overall proficiency. Our previous studies realized automatic scor-ing of shadowing speech using HMM phoneme posteriors, calledGOP (Goodness of Pronunciation) and learners’ TOEIC scores werepredicted adequately. In this study, we enhance our studies frommultiple angles: 1) a much larger amount of shadowing speech iscollected, 2) manual scoring of these utterances is done by two nativeteachers, 3) DNN posteriors are introduced instead of HMM ones, 4)language-independent shadowing assessment based on posteriors-based DTW (Dynamic Time Warping) is examined. Experimentssuggest that, compared to HMM, DNN can improve teacher-machinecorrelation largely by 0.37 and DTW based on DNN posteriors showsas high correlation as 0.74 even when posterior calculation is doneusing a different language from the target language of learning.

Off-Topic Spoken Response Detection Using SiameseConvolutional Neural Networks

Chong Min Lee, Su-Youn Yoon, Xihao Wang, MatthewMulholland, Ikkyu Choi, Keelan Evanini; EducationalTesting Service, USATue-O-5-8-3, Time: 16:40–17:00

In this study, we developed an off-topic response detection system tobe used in the context of the automated scoring of non-native Englishspeakers’ spontaneous speech. Based on transcriptions generatedfrom an ASR system trained on non-native speakers’ speech andvarious semantic similarity features, the system classified each testresponse as an on-topic or off-topic response. The recent successof deep neural networks (DNN) in text similarity detection led usto explore DNN-based document similarity features. Specifically,we used a siamese adaptation of the convolutional network, dueto its efficiency in learning similarity patterns simultaneously fromboth responses and questions used to elicit responses. In addition,a baseline system was developed using a standard vector spacemodel (VSM) trained on sample responses for each question. Theaccuracy of the siamese CNN-based system was 0.97 and there wasa 50% relative error reduction compared to the standard VSM-basedsystem. Furthermore, the accuracy of the siamese CNN-based systemwas consistent across different questions.

Phonological Feature Based MispronunciationDetection and Diagnosis Using Multi-Task DNNs andActive Learning

Vipul Arora 1, Aditi Lahiri 1, Henning Reetz 2; 1Universityof Oxford, UK; 2Goethe-Universität Frankfurt, GermanyTue-O-5-8-4, Time: 17:00–17:20

This paper presents a phonological feature based computer aidedpronunciation training system for the learners of a new language

Notes

134

Language

(L2). Phonological features allow analysing the learners’ mispronun-ciations systematically and rendering the feedback more effectively.The proposed acoustic model consists of a multi-task deep neuralnetwork, which uses a shared representation for estimating thephonological features and HMM state probabilities. Moreover, anactive learning based scheme is proposed to efficiently deal with thecost of annotation, which is done by expert teachers, by selecting themost informative samples for annotation. Experimental evaluationsare carried out for German and Italian native-speakers speakingEnglish. For mispronunciation detection, the proposed feature-basedsystem outperforms conventional GOP measure and classifier basedmethods, while providing more detailed diagnosis. Evaluations alsodemonstrate the advantage of active learning based sampling overrandom sampling.

Detection of Mispronunciations and Disfluencies inChildren Reading Aloud

Jorge Proença 1, Carla Lopes 1, Michael Tjalve 2, AndreasStolcke 2, Sara Candeias 3, Fernando Perdigão 1;1Instituto de Telecomunicações, Portugal; 2Microsoft,USA; 3Microsoft, PortugalTue-O-5-8-5, Time: 17:20–17:40

To automatically evaluate the performance of children readingaloud or to follow a child’s reading in reading tutor applications,different types of reading disfluencies and mispronunciations mustbe accounted for. In this work, we aim to detect most of these dis-fluencies in sentence and pseudoword reading. Detecting incorrectlypronounced words, and quantifying the quality of word pronun-ciations, is arguably the hardest task. We approach the challengeas a two-step process. First, a segmentation using task-specificlattices is performed, while detecting repetitions and false startsand providing candidate segments for words. Then, candidatesare classified as mispronounced or not, using multiple featuresderived from likelihood ratios based on phone decoding and forcedalignment, as well as additional meta-information about the word.Several classifiers were explored (linear fit, neural networks, supportvector machines) and trained after a feature selection stage to avoidoverfitting. Improved results are obtained using feature combinationcompared to using only the log likelihood ratio of the reference word(22% versus 27% miss rate at constant 5% false alarm rate).

Automatic Assessment of Non-Native Prosody byMeasuring Distances on Prosodic Label Sequences

David Escudero-Mancebo 1, César González-Ferreras 1,Lourdes Aguilar 2, Eva Estebas-Vilaplana 3; 1Universidadde Valladolid, Spain; 2Universidad Autónoma deBarcelona, Spain; 3UNED, SpainTue-O-5-8-6, Time: 17:40–18:00

The aim of this paper is to investigate how automatic prosodiclabeling systems contribute to the evaluation of non-native pronun-ciation. In particular, it examines the efficiency of a group of metricsto evaluate the prosodic competence of non-native speakers, basedon the information provided by sequences of labels in the analysisof both native and non-native speech. A group of Sp_ToBI labelswere obtained by means of an automatic labeling system for thespeech of native and non-native speakers who read the same texts.The metrics assessed the differences in the prosodic labels for bothspeech samples. The results showed the efficiency of the metrics toset apart both groups of speakers. Furthermore, they exhibited hownon-native speakers (American and Japanese speakers) improvedtheir Spanish productions after doing a set of listening and repeatingactivities. Finally, this study also shows that the results provided bythe metrics are correlated with the scores given by human evaluatorson the productions of the different speakers.

Tue-O-5-10 : Stance, Credibility, andDeceptionE10, 16:00–18:00, Tuesday, 22 Aug. 2017Chairs: Julien Epps, Carlos Busso

Inferring Stance from Prosody

Nigel G. Ward 1, Jason C. Carlson 1, Olac Fuentes 1, DiegoCastan 2, Elizabeth E. Shriberg 2, Andreas Tsiartas 2;1University of Texas at El Paso, USA; 2SRI International,USATue-O-5-10-1, Time: 16:00–16:20

Speech conveys many things beyond content, including aspects ofstance and attitude that have not been much studied. Considering 14aspects of stance as they occur in radio news stories, we investigatedthe extent to which they could be inferred from prosody. By usingtime-spread prosodic features and by aggregating local estimates,many aspects of stance were at least somewhat predictable, withresults significantly better than chance for many stance aspects,including, across English, Mandarin and Turkish, good, typical, local,background, new information, and relevant to a large group.

Exploring Dynamic Measures of Stance in SpokenInteraction

Gina-Anne Levow, Richard A. Wright; University ofWashington, USATue-O-5-10-2, Time: 16:20–16:40

Stance-taking, the expression of opinions or attitudes, informs theprocess of negotiation, argumentation, and decision-making. Whilereceiving significant attention in text materials in work on the relatedareas of subjectivity and sentiment analysis, the expression of stancein speech remains less explored. Prior analysis of the acousticsof stance-expression in conversational speech has identified somesignificant differences across dimensions of stance-related behavior.However, that analysis, as in much prior work, relied on simplefunctionals of pitch, energy, and duration, including maxima, min-ima, means, and ranges. In contrast, the current work focuses onexploiting measures that capture the dynamics of the pitch and en-ergy contour. We employ features based on subband autocorrelationmeasures of pitch change and variants of the modulation spectrum.Using a corpus of conversational speech manually annotated fordimensions of stance-taking, we demonstrate that these measures ofpitch and energy dynamics can help to characterize and distinguishamong stance-related behaviors in speech.

Opinion Dynamics Modeling for Movie ReviewTranscripts Classification with Hidden ConditionalRandom Fields

Valentin Barriere, Chloé Clavel, Slim Essid; LTCI, FranceTue-O-5-10-3, Time: 16:40–17:00

In this paper, the main goal is to detect a movie reviewer’s opinionusing hidden conditional random fields. This model allows us tocapture the dynamics of the reviewer’s opinion in the transcripts oflong unsegmented audio reviews that are analyzed by our system.High level linguistic features are computed at the level of inter-pausalsegments. The features include syntactic features, a statistical wordembedding model and subjectivity lexicons. The proposed systemis evaluated on the ICT-MMMO corpus. We obtain a F1-score of 82%,which is better than logistic regression and recurrent neural networkapproaches. We also offer a discussion that sheds some light on thecapacity of our system to adapt the word embedding model learned

Notes

135

from general written texts data to spoken movie reviews and thusmodel the dynamics of the opinion.

Transfer Learning Between Concepts for HumanBehavior Modeling: An Application to Sincerity andDeception Prediction

Qinyi Luo 1, Rahul Gupta 2, Shrikanth S. Narayanan 2;1Tsinghua University, China; 2University of SouthernCalifornia, USATue-O-5-10-4, Time: 17:00–17:20

Transfer learning (TL) involves leveraging information from sourcesoutside the domain at hand for enhancing model performances.Popular TL methods either directly use the data or adapt the modelslearned on out-of-domain resources and incorporate them withinin-domain models. TL methods have shown promise in severalapplications such as text classification, cross-domain languageclassification and emotion recognition. In this paper, we proposeTL methods to computational human behavioral trait modeling.Many behavioral traits are abstract constructs (e.g., sincerity of anindividual), and are often conceptually related to other constructs(e.g., level of deception) making TL methods an attractive optionfor their modeling. We consider the problem of automaticallypredicting human sincerity and deception from behavioral datawhile leveraging transfer of knowledge from each other. We compareour methods against baseline models trained only on in-domaindata. Our best models achieve an Unweighted Average Recall (UAR)of 72.02% in classifying deception (baseline: 69.64%). Similarly,applied methods achieve Spearman’s/Pearson’s correlation values of49.37%/48.52% between true and predicted sincerity scores (baseline:46.51%/41.58%), indicating the success and the potential of TL forsuch human behavior tasks.

The Sound of Deception — What Makes a SpeakerCredible?

Anne Schröder, Simon Stone, Peter Birkholz; TechnischeUniversität Dresden, GermanyTue-O-5-10-5, Time: 17:20–17:40

The detection of deception in human speech is a difficult task butcan be performed above chance level by human listeners even whenonly audio data is provided. Still, it is highly contested, whichspeech features could be used to help identify lies. In this study,we examined a set of phonetic and paralinguistic cues and theirinfluence on the credibility of speech using an analysis-by-synthesisapproach. 33 linguistically neutral utterances with different ma-nipulated cues (unfilled pauses, phonation type, higher speechrate, tremolo and raised F0) were synthesized using articulatorysynthesis. These utterances were presented to 50 subjects who wereasked to choose the more credible utterance. From those choices,a credibility score was calculated for each cue. The results show asignificant increase in credibility when a tremolo is inserted or thebreathiness is increased, and a decrease in credibility when a pauseis inserted or the F0 is raised. Other cues also had a significant,but less pronounced influence on the credibility while some onlyshowed trends. In summary, the study showed that the credibilityof a factually unverifiable utterance is in parts controlled by thepresented paralinguistic cues.

Hybrid Acoustic-Lexical Deep Learning Approach forDeception Detection

Gideon Mendels, Sarah Ita Levitan, Kai-Zhan Lee, JuliaHirschberg; Columbia University, USATue-O-5-10-6, Time: 17:40–18:00

Automatic deception detection is an important problem with far-reaching implications for many disciplines. We present a seriesof experiments aimed at automatically detecting deception fromspeech. We use the Columbia X-Cultural Deception (CXD) Corpus,a large-scale corpus of within-subject deceptive and non-deceptivespeech, for training and evaluating our models. We compare theuse of spectral, acoustic-prosodic, and lexical feature sets, usingdifferent machine learning models. Finally, we design a single hybriddeep model with both acoustic and lexical features trained jointlythat achieves state-of-the-art results on the CXD corpus.

Tue-P-3-1 : Short Utterances SpeakerRecognitionPoster 1, 10:00–12:00, Tuesday, 22 Aug. 2017Chair: Anthony Larcher

A Generative Model for Score Normalization inSpeaker Recognition

Albert Swart, Niko Brümmer; Nuance Communications,South AfricaTue-P-3-1-1, Time: 10:00–12:00

We propose a theoretical framework for thinking about score nor-malization, which confirms that normalization is not needed under(admittedly fragile) ideal conditions. If, however, these conditionsare not met, e.g. under data-set shift between training and runtime,our theory reveals dependencies between scores that could be ex-ploited by strategies such as score normalization. Indeed, it has beendemonstrated over and over experimentally, that various ad-hocscore normalization recipes do work. We present a first attempt atusing probability theory to design a generative score-space normal-ization model which gives similar improvements to ZT-norm on thetext-dependent RSR 2015 database.

Content Normalization for Text-Dependent SpeakerVerification

Subhadeep Dey, Srikanth Madikeri, Petr Motlicek, MarcFerras; Idiap Research Institute, SwitzerlandTue-P-3-1-2, Time: 10:00–12:00

Subspace based techniques, such as i-vector and Joint Factor Analysis(JFA) have shown to provide state-of-the-art performance for fixedphrase based text-dependent speaker verification. However, the errorrates of such systems on the random digit task of RSR dataset arehigher than that of Gaussian Mixture Model-Universal BackgroundModel (GMM-UBM). In this paper, we aim at improving i-vector systemby normalizing the content of the enrollment data to match the testdata. We estimate i-vectors for each frames of a speech utterance(also called online i-vectors). The largest similarity scores acrossframes between enrollment and test are taken using these onlinei-vectors to obtain speaker verification scores. Experiments on Part3of RSR corpora show that the proposed approach achieves 12%relative improvement in equal error rate over a GMM-UBM basedbaseline system.

Notes

136

End-to-End Text-Independent Speaker Verificationwith Triplet Loss on Short Utterances

Chunlei Zhang 1, Kazuhito Koishida 2; 1University ofTexas at Dallas, USA; 2Microsoft, USATue-P-3-1-3, Time: 10:00–12:00

Text-independent speaker verification against short utterances isstill challenging despite of recent advances in the field of speakerrecognition with i-vector framework. In general, to get a robusti-vector representation, a satisfying amount of data is needed in theMAP adaptation step, which is hard to meet under short durationconstraint. To overcome this, we present an end-to-end systemwhich directly learns a mapping from speech features to a compactfixed length speaker discriminative embedding where the Euclideandistance is employed for measuring similarity within trials. To learnthe feature mapping, a modified Inception Net with residual blockis proposed to optimize the triplet loss function. The input of ourend-to-end system is a fixed length spectrogram converted froman arbitrary length utterance. Experiments show that our systemconsistently outperforms a conventional i-vector system on shortduration speaker verification tasks. To test the limit under variousduration conditions, we also demonstrate how our end-to-end systembehaves with different duration from 2s–4s.

Adversarial Network Bottleneck Features for NoiseRobust Speaker Verification

Hong Yu 1, Zheng-Hua Tan 2, Zhanyu Ma 1, Jun Guo 1;1BUPT, China; 2Aalborg University, DenmarkTue-P-3-1-4, Time: 10:00–12:00

In this paper, we propose a noise robust bottleneck feature repre-sentation which is generated by an adversarial network (AN). TheAN includes two cascade connected networks, an encoding network(EN) and a discriminative network (DN). Mel-frequency cepstral coef-ficients (MFCCs) of clean and noisy speech are used as input to theEN and the output of the EN is used as the noise robust feature. TheEN and DN are trained in turn, namely, when training the DN, noisetypes are selected as the training labels and when training the EN,all labels are set as the same, i.e., the clean speech label, which aimsto make the AN features invariant to noise and thus achieve noiserobustness. We evaluate the performance of the proposed featureon a Gaussian Mixture Model-Universal Background Model basedspeaker verification system, and make comparison to MFCC featuresof speech enhanced by short-time spectral amplitude minimummean square error (STSA-MMSE) and deep neural network-basedspeech enhancement (DNN-SE) methods. Experimental results on theRSR2015 database show that the proposed AN bottleneck feature(AN-BN) dramatically outperforms the STSA-MMSE and DNN-SE basedMFCCs for different noise types and signal-to-noise ratios. Further-more, the AN-BN feature is able to improve the speaker verificationperformance under the clean condition.

What Does the Speaker Embedding Encode?

Shuai Wang, Yanmin Qian, Kai Yu; Shanghai Jiao TongUniversity, ChinaTue-P-3-1-5, Time: 10:00–12:00

Developing a good speaker embedding has received tremendousinterest in the speech community. Speaker representations such as i-vector, d-vector have shown their superiority in speaker recognition,speaker adaptation and other related tasks. However, not much isknown about which properties are exactly encoded in these speakerembeddings. In this work, we make an in-depth investigation on threekinds of speaker embeddings, i.e. i-vector, d-vector and RNN/LSTMbased sequence-vector (s-vector). Classification tasks are carefullydesigned to facilitate better understanding of these encoded speaker

representations. Their abilities of encoding different properties arerevealed and compared, such as speaker identity, gender, speakingrate, text content and channel information. Moreover, a new archi-tecture is proposed to integrate different speaker embeddings, sothat the advantages can be combined. The new advanced speakerembedding (i-s-vector) outperforms the others, and shows a morethan 50% EER reduction compared to the i-vector baseline on theRSR2015 content mismatch trials.

Incorporating Local Acoustic Variability Informationinto Short Duration Speaker Verification

Jianbo Ma 1, Vidhyasaharan Sethu 1, EliathambyAmbikairajah 1, Kong Aik Lee 2; 1University of NewSouth Wales, Australia; 2A*STAR, SingaporeTue-P-3-1-6, Time: 10:00–12:00

State-of-the-art speaker verification systems are based on the to-tal variability model to compactly represent the acoustic space.However, short duration utterances only contain limited phoneticcontent, potentially resulting in an incomplete representation be-ing captured by the total variability model thus leading to poorspeaker verification performance. In this paper, a technique toincorporate component-wise local acoustic variability informationinto the speaker verification framework is proposed. Specifically,Gaussian Probabilistic Linear Discriminant Analysis (G-PLDA) of thesupervector space, with a block diagonal covariance assumption,is used in conjunction with the traditional total variability model.Experimental results obtained using the NIST SRE 2010 dataset showthat the incorporation of the proposed method leads to relativeimprovements of 20.48% and 18.99% in the 3 second condition formale and female speech respectively.

DNN i-Vector Speaker Verification with Short,Text-Constrained Test Utterances

Jinghua Zhong 1, Wenping Hu 2, Frank K. Soong 2, HelenMeng 1; 1Chinese University of Hong Kong, China;2Microsoft, ChinaTue-P-3-1-7, Time: 10:00–12:00

We investigate how to improve the performance of DNN i-vectorbased speaker verification for short, text-constrained test utterances,e.g. connected digit strings. A text-constrained verification, due toits smaller, limited vocabulary, can deliver better performance thana text-independent one for a short utterance. We study the problemwith “phonetically aware” Deep Neural Net (DNN) in its capability on“stochastic phonetic-alignment” in constructing supervectors andestimating the corresponding i-vectors with two speech databases:a large vocabulary, conversational, speaker independent database(Fisher) and a small vocabulary, continuous digit database (RSR2015Part III). The phonetic alignment efficiency and resultant speaker veri-fication performance are compared with differently sized senone setswhich can characterize the phonetic pronunciations of utterancesin the two databases. Performance on RSR2015 Part III evaluationshows a relative improvement of EER, i.e., 7.89% for male speakersand 3.54% for female speakers with only digit related senones. TheDNN bottleneck features were also studied to investigate their ca-pability of extracting phonetic sensitive information which is usefulfor text-independent or text-constrained speaker verifications. Wefound that by tandeming MFCC with bottleneck features, EERs canbe further reduced.

Notes

137

Time-Varying Autoregressions for SpeakerVerification in Reverberant Conditions

Ville Vestman 1, Dhananjaya Gowda 2, Md. Sahidullah 1,Paavo Alku 3, Tomi Kinnunen 1; 1University of EasternFinland, Finland; 2Samsung Electronics, Korea; 3AaltoUniversity, FinlandTue-P-3-1-8, Time: 10:00–12:00

In poor room acoustics conditions, speech signals received by amicrophone might become corrupted by the signals’ delayed ver-sions that are reflected from the room surfaces (e.g. wall, floor).This phenomenon, reverberation, drops the accuracy of automaticspeaker verification systems by causing mismatch between thetraining and testing. Since reverberation causes temporal smearingto the signal, one way to tackle its effects is to study robust fea-ture extraction, particularly based on long-time temporal featureextraction. This approach has been adopted previously in the formof 2-dimensional autoregressive (2DAR) feature extraction schemeby using frequency domain linear prediction (FDLP). In 2DAR, FDLPprocessing is followed by time domain linear prediction (TDLP). Inthe current study, we propose modifying the latter part of the 2DARfeature extraction scheme by replacing TDLP with time-varying linearprediction (TVLP) to add an extra layer of temporal processing.Our speaker verification experiments using the proposed featureswith the text-dependent RedDots corpus show small but consistentimprovements in clean and reverberant conditions (up to 6.5%) overthe 2DAR features and large improvements over the MFCC featuresin reverberant conditions (up to 46.5%).

Deep Speaker Embeddings for Short-Duration SpeakerVerification

Gautam Bhattacharya 1, Jahangir Alam 2, PatrickKenny 2; 1McGill University, Canada; 2CRIM, CanadaTue-P-3-1-9, Time: 10:00–12:00

The performance of a state-of-the-art speaker verification systemis severely degraded when it is presented with trial recordingsof short duration. In this work we propose to use deep neuralnetworks to learn short-duration speaker embeddings. We focus onthe 5s-5s condition, wherein both sides of a verification trial are 5seconds long. In our previous work we established that learning anon-linear mapping from i-vectors to speaker labels is beneficial forspeaker verification [1]. In this work we take the idea of learning aspeaker classifier one step further — we apply deep neural networksdirectly to time-frequency speech representations. We propose twofeed-forward network architectures for this task. Our best modelis based on a deep convolutional architecture wherein recordingsare treated as images. From our experimental findings we advocatetreating utterances as images or ‘speaker snapshots’, much like inface recognition. Our convolutional speaker embeddings performsignificantly better than i-vectors when scoring is done using cosinedistance, where the relative improvement is 23.5%. The proposeddeep embeddings combined with cosine distance also outperform astate-of-the-art i-vector verification system by 1%, providing furtherempirical evidence in favor of our learned speaker features.

Using Voice Quality Features to ImproveShort-Utterance, Text-Independent SpeakerVerification Systems

Soo Jin Park, Gary Yeung, Jody Kreiman, Patricia A.Keating, Abeer Alwan; University of California at LosAngeles, USATue-P-3-1-10, Time: 10:00–12:00

Due to within-speaker variability in phonetic content and/or speak-

ing style, the performance of automatic speaker verification (ASV)systems degrades especially when the enrollment and test utterancesare short. This study examines how different types of variabilityinfluence performance of ASV systems. Speech samples (< 2 sec)from the UCLA Speaker Variability Database containing 5 differentread sentences by 200 speakers were used to study content vari-ability. Other samples (about 5 sec) that contained speech directedtowards pets, characterized by exaggerated prosody, were used toanalyze style variability. Using the i-vector/PLDA framework, theASV system error rate with MFCCs had a relative increase of at least265% and 730% in content-mismatched and style-mismatched trials,respectively. A set of features that represents voice quality (F0, F1,F2, F3, H1-H2, H2-H4, H4-H2k, A1, A2, A3, and CPP) was also used.Using score fusion with MFCCs, all conditions saw decreases in errorrates. In addition, using the NIST SRE10 database, score fusionprovided relative improvements of 11.78% for 5-second utterances,12.41% for 10-second utterances, and a small improvement for longutterances (about 5 min). These results suggest that voice qualityfeatures can improve short-utterance text-independent ASV systemperformance.

Gain Compensation for Fast i-Vector Extraction OverShort Duration

Kong Aik Lee 1, Haizhou Li 2; 1A*STAR, Singapore; 2NUS,SingaporeTue-P-3-1-11, Time: 10:00–12:00

I-vector is widely described as a compact and effective representa-tion of speech utterances for speaker recognition. Standard i-vectorextraction could be an expensive task for applications where com-puting resource is limited, for instance, on handheld devices. Fastapproximate inference of i-vector aims to reduce the computationalcost required in i-vector extraction where run-time requirementis critical. Most fast approaches hinge on certain assumptionsto approximate the i-vector inference formulae with little loss ofaccuracy. In this paper, we analyze the uniform assumption thatwe had proposed earlier. We show that the assumption generallyhold for long utterances but inadequate for utterances of shortduration. We then propose to compensate for the negative effectsby applying a simple gain factor on the i-vectors estimated fromshort utterances. The assertion is confirmed through analysis andexperiments conducted on NIST SRE’08 and SRE’10 datasets.

Joint Training of Expanded End-to-End DNN forText-Dependent Speaker Verification

Hee-soo Heo, Jee-weon Jung, IL-ho Yang, Sung-hyunYoon, Ha-jin Yu; University of Seoul, KoreaTue-P-3-1-12, Time: 10:00–12:00

We propose an expanded end-to-end DNN architecture for speakerverification based on b-vectors as well as d-vectors. We embeddedthe components of a speaker verification system such as modelingframe-level features, extracting utterance-level features, dimension-ality reduction of utterance-level features, and trial-level scoring inan expanded end-to-end DNN architecture. The main contributionof this paper is that, instead of using DNNs as parts of the systemtrained independently, we train the whole system jointly with afine-tune cost after pre-training each part. The experimental resultsshow that the proposed system outperforms the baseline d-vectorsystem and i-vector PLDA system.

Notes

138

Tue-P-3-2 : Speaker Characterization andRecognitionPoster 2, 10:00–12:00, Tuesday, 22 Aug. 2017Chair: Michael Wagner

Speaker Verification via Estimating Total VariabilitySpace Using Probabilistic Partial Least Squares

Chen Chen, Jiqing Han, Yilin Pan; Harbin Institute ofTechnology, ChinaTue-P-3-2-1, Time: 10:00–12:00

The i-vector framework is one of the most popular methods inspeaker verification, and estimating a total variability space (TVS) is akey part in the i-vector framework. Current estimation methods payless attention on the discrimination of TVS, but the discriminationis so important that it will influence the improvement of perfor-mance. So we focus on the discrimination of TVS to achieve a betterperformance. In this paper, a discriminative estimating method ofTVS based on probabilistic partial least squares (PPLS) is proposed.In this method, the discrimination is improved by using the prioriinformation (labels) of speaker, so both the correlation of intra-classand the discrimination of interclass are fully utilized. Meanwhile,it also introduces a probabilistic view of the partial least squares(PLS) method to overcome the disadvantage of high computationalcomplexity and the inability of channel compensation. And alsothis proposed method can achieve a better performance than thetraditional TVS estimation method as well as the PLS-based method.

Deep Speaker Feature Learning for Text-IndependentSpeaker Verification

Lantian Li, Yixiang Chen, Ying Shi, Zhiyuan Tang, DongWang; Tsinghua University, ChinaTue-P-3-2-2, Time: 10:00–12:00

Recently deep neural networks (DNNs) have been used to learnspeaker features. However, the quality of the learned features isnot sufficiently good, so a complex back-end model, either neural orprobabilistic, has to be used to address the residual uncertainty whenapplied to speaker verification. This paper presents a convolutionaltime-delay deep neural network structure (CT-DNN) for speakerfeature learning. Our experimental results on the Fisher databasedemonstrated that this CT-DNN can produce high-quality speakerfeatures: even with a single feature (0.3 seconds including thecontext), the EER can be as low as 7.68%. This effectively confirmedthat the speaker trait is largely a deterministic short-time propertyrather than a long-time distributional pattern, and therefore can beextracted from just dozens of frames.

Duration Mismatch Compensation UsingFour-Covariance Model and Deep Neural Network forSpeaker Verification

Pierre-Michel Bousquet, Mickael Rouvier; LIA (EA 4128),FranceTue-P-3-2-3, Time: 10:00–12:00

Duration mismatch between enrollment and test utterances stillremains a major concern for reliability of real-life speaker recog-nition applications. Two approaches are proposed here to dealwith this case when using the i-vector representation. The firstone is an adaptation of Gaussian Probabilistic Linear DiscriminantAnalysis (PLDA) modeling, which can be extended to the case ofany shift between i-vectors drawn from two distinct distributions.The second one attempts to map i-vectors of truncated segments of

an utterance to the i-vector of the full segment, by the use of deepneural networks (DNN). Our results show that both new approachesoutperform the standard PLDA by about 10% relative, noting thatthese back-end methods could complement those quantifying thei-vector uncertainty during its extraction process, in the case ofduration gap.

Extended Variability Modeling and UnsupervisedAdaptation for PLDA Speaker Recognition

Alan McCree, Gregory Sell, Daniel Garcia-Romero; JohnsHopkins University, USATue-P-3-2-4, Time: 10:00–12:00

Probabilistic Linear Discriminant Analysis (PLDA) continues to bethe most effective approach for speaker recognition in the i-vectorspace. This paper extends the PLDA model to include both en-rollment and test cut duration as well as to distinguish betweensession and channel variability. In addition, we address the taskof unsupervised adaptation to unknown new domains in two ways:speaker-dependent PLDA parameters and cohort score normaliza-tion using Bayes rule. Experimental results on the NIST SRE16task show that these principled techniques provide state-of-the-artperformance with negligible increase in complexity over a PLDAbaseline.

Improving the Effectiveness of Speaker VerificationDomain Adaptation with Inadequate In-Domain Data

Bengt J. Borgström 1, Elliot Singer 1, Douglas Reynolds 1,Seyed Omid Sadjadi 2; 1MIT Lincoln Laboratory, USA;2NIST, USATue-P-3-2-5, Time: 10:00–12:00

This paper addresses speaker verification domain adaptation withinadequate in-domain data. Specifically, we explore the cases wherein-domain data sets do not include speaker labels, contain speakerswith few samples, or contain speakers with low channel diver-sity. Existing domain adaptation methods are reviewed, and theirshortcomings are discussed. We derive an unsupervised versionof fully Bayesian adaptation which reduces the reliance on richin-domain data. When applied to domain adaptation with inadequatein-domain data, the proposed approach yields competitive resultswhen the samples per speaker are reduced, and outperforms existingsupervised methods when the channel diversity is low, even withoutrequiring speaker labels. These results are validated on the NISTSRE16, which uses a highly inadequate in-domain data set.

i-Vector DNN Scoring and Calibration for NoiseRobust Speaker Verification

Zhili Tan, Man-Wai Mak; Hong Kong PolytechnicUniversity, ChinaTue-P-3-2-6, Time: 10:00–12:00

This paper proposes applying multi-task learning to train deepneural networks (DNNs) for calibrating the PLDA scores of speakerverification systems under noisy environments. To facilitate theDNNs to learn the main task (calibration), several auxiliary taskswere introduced, including the prediction of SNR and duration fromi-vectors and classifying whether an i-vector pair belongs to thesame speaker or not. The possibility of replacing the PLDA modelby a DNN during the scoring stage is also explored. Evaluationson noise contaminated speech suggest that the auxiliary tasks areimportant for the DNNs to learn the main calibration task and thatthe uncalibrated PLDA scores are an essential input to the DNNs.Without this input, the DNNs can only predict the score shiftsaccurately, suggesting that the PLDA model is indispensable.

Notes

139

Analysis of Score Normalization in MultilingualSpeaker Recognition

Pavel Matejka, Ondrej Novotný, Oldrich Plchot, LukášBurget, Mireia Diez Sánchez, Jan Cernocký; BrnoUniversity of Technology, Czech RepublicTue-P-3-2-7, Time: 10:00–12:00

NIST Speaker Recognition Evaluation 2016 has revealed the im-portance of score normalization for mismatched data conditions.This paper analyzes several score normalization techniques for testconditions with multiple languages. The best performing one for aPLDA classifier is an adaptive s-norm with 30% relative improvementover the system without any score normalization. The analysisshows that the adaptive score normalization (using top scoring filesper trial) selects cohorts that in 68% contain recordings from thesame language and in 92% of the same gender as the enrollment andtest recordings. Our results suggest that the data to select scorenormalization cohorts should be a pool of several languages andchannels and if possible, its subset should contain data from thetarget domain.

Alternative Approaches to Neural Network BasedSpeaker Verification

Anna Silnova, Lukáš Burget, Jan Cernocký; BrnoUniversity of Technology, Czech RepublicTue-P-3-2-8, Time: 10:00–12:00

Just like in other areas of automatic speech processing, featureextraction based on bottleneck neural networks was recently foundvery effective for the speaker verification task. However, betterresults are usually reported with more complex neural networkarchitectures (e.g. stacked bottlenecks), which are difficult toreproduce. In this work, we experiment with the so called deepfeatures, which are based on a simple feed-forward neural networkarchitecture. We study various forms of applying deep features toi-vector/PDA based speaker verification. With proper settings, betterverification performance can be obtained by means of this simplearchitecture as compared to the more elaborate bottleneck features.Also, we further experiment with multi-task training, where theneural network is trained for both speaker recognition and senonerecognition objectives. Results indicate that, with a careful weightingof the two objectives, multi-task training can result in significantlybetter performing deep features.

A Distribution Free Formulation of the TotalVariability Model

Ruchir Travadi, Shrikanth S. Narayanan; University ofSouthern California, USATue-P-3-2-9, Time: 10:00–12:00

The Total Variability Model (TVM) [1] has been widely used in audiosignal processing as a framework for capturing differences in featurespace distributions across variable length sequences by mappingthem into a fixed-dimensional representation. Its formulationrequires making an assumption about the source data distributionbeing a Gaussian Mixture Model (GMM). In this paper, we show that itis possible to arrive at the same model formulation without requiringsuch an assumption about distribution of the data, by showingasymptotic normality of the statistics used to estimate the model.We highlight some connections between TVM and heteroscedasticPrincipal Component Analysis (PCA), as well as the matrix completionproblem, which lead to a computationally efficient formulation ofthe Maximum Likelihood estimation problem for the model.

Domain Mismatch Modeling of Out-Domain i-Vectorsfor PLDA Speaker Verification

Md. Hafizur Rahman, Ivan Himawan, David Dean,Sridha Sridharan; Queensland University of Technology,AustraliaTue-P-3-2-10, Time: 10:00–12:00

The state-of-the-art i-vector based probabilistic linear discriminantanalysis (PLDA) trained on non-target (or out-domain) data sig-nificantly affects the speaker verification performance due to thedomain mismatch between training and evaluation data. To improvethe speaker verification performance, sufficient amount of domainmismatch compensated out-domain data must be used to train thePLDA models successfully. In this paper, we propose a domainmismatch modeling (DMM) technique using maximum-a-posteriori(MAP) estimation to model and compensate the domain variabilityfrom the out-domain training i-vectors. From our experimentalresults, we found that the DMM technique can achieve at least a 24%improvement in EER over an out-domain only baseline when speakerlabels are available. Further improvement of 3% is obtained whencombining DMM with domain-invariant covariance normalization(DICN) approach. The DMM/DICN combined technique is shown toperform better than in-domain PLDA system with only 200 labeledspeakers or 2,000 unlabeled i-vectors.

Tue-P-4-1 : Acoustic Models for ASR 1Poster 1, 13:30–15:30, Tuesday, 22 Aug. 2017Chair: Michiel Bacchiani

An Exploration of Dropout with LSTMs

Gaofeng Cheng 1, Vijayaditya Peddinti 2, Daniel Povey 2,Vimal Manohar 2, Sanjeev Khudanpur 2, YonghongYan 1; 1Chinese Academy of Sciences, China; 2JohnsHopkins University, USATue-P-4-1-1, Time: 13:30–15:30

Long Short-Term Memory networks (LSTMs) are a component of manystate-of-the-art DNN-based speech recognition systems. Dropout isa popular method to improve generalization in DNN training. In thispaper we describe extensive experiments in which we investigatedthe best way to combine dropout with LSTMs — specifically, pro-jected LSTMs (LSTMP). We investigated various locations in the LSTMto place the dropout (and various combinations of locations), and avariety of dropout schedules. Our optimized recipe gives consistentimprovements in WER across a range of datasets, including Switch-board, TED-LIUM and AMI.

Residual LSTM: Design of a Deep RecurrentArchitecture for Distant Speech Recognition

Jaeyoung Kim, Mostafa El-Khamy, Jungwon Lee;Samsung Semiconductor, USATue-P-4-1-2, Time: 13:30–15:30

In this paper, a novel architecture for a deep recurrent neuralnetwork, residual LSTM is introduced. A plain LSTM has an internalmemory cell that can learn long term dependencies of sequentialdata. It also provides a temporal shortcut path to avoid vanishingor exploding gradients in the temporal domain. The residual LSTMprovides an additional spatial shortcut path from lower layersfor efficient training of deep networks with multiple LSTM layers.Compared with the previous work, highway LSTM, residual LSTMseparates a spatial shortcut path with temporal one by using out-put layers, which can help to avoid a conflict between spatial andtemporal-domain gradient flows. Furthermore, residual LSTM reuses

Notes

140

the output projection matrix and the output gate of LSTM to controlthe spatial information flow instead of additional gate networks,which effectively reduces more than 10% of network parameters. Anexperiment for distant speech recognition on the AMI SDM corpusshows that 10-layer plain and highway LSTM networks presented13.7% and 6.2% increase in WER over 3-layer baselines, respectively.On the contrary, 10-layer residual LSTM networks provided the low-est WER 41.0%, which corresponds to 3.3% and 2.8% WER reductionover plain and highway LSTM networks, respectively.

Unfolded Deep Recurrent Convolutional NeuralNetwork with Jump Ahead Connections for AcousticModeling

Dung T. Tran, Marc Delcroix, Shigeki Karita, MichaelHentschel, Atsunori Ogawa, Tomohiro Nakatani; NTT,JapanTue-P-4-1-3, Time: 13:30–15:30

Recurrent neural networks (RNNs) with jump ahead connectionshave been used in the computer vision tasks. Still, they have notbeen investigated well for automatic speech recognition (ASR) tasks.In other words, unfolded RNN has been shown to be an effectivemodel for acoustic modeling tasks. This paper investigates how toelaborate a sophisticated unfolded deep RNN architecture in whichrecurrent connections use a convolutional neural network (CNN) tomodel a short-term dependence between hidden states. In this study,our unfolded RNN architecture is a CNN that process a sequence ofinput features sequentially. Each time step, the CNN inputs a smallblock of the input features and the output of the hidden layer fromthe preceding block in order to compute the output of its hiddenlayer. In addition, by exploiting either one or multiple jump aheadconnections between time steps, our network can learn long-termdependencies more effectively. We carried experiments on the CHiME3 task showing the effectiveness of our proposed approach.

Forward-Backward Convolutional LSTM for AcousticModeling

Shigeki Karita, Atsunori Ogawa, Marc Delcroix,Tomohiro Nakatani; NTT, JapanTue-P-4-1-4, Time: 13:30–15:30

An automatic speech recognition (ASR) performance has greatlyimproved with the introduction of convolutional neural network(CNN) or long-short term memory (LSTM) for acoustic modeling. Re-cently, a convolutional LSTM (CLSTM) has been proposed to directlyuse convolution operation within the LSTM blocks and combinethe advantages of both CNN and LSTM structures into a singlearchitecture. This paper presents the first attempt to use CLSTMs foracoustic modeling. In addition, we propose a new forward-backwardarchitecture to exploit long-term left/right context efficiently. Theproposed scheme combines forward and backward LSTMs at differ-ent time points of an utterance with the aim of modeling long termframe invariant information such as speaker characteristics, channeletc. Furthermore, the proposed forward-backward architecture canbe trained with truncated back-propagation-through-time unlikeconventional bidirectional LSTM (BLSTM) architectures. Therefore,we are able to train deeply stacked CLSTM acoustic models, whichis practically challenging with conventional BLSTMs. Experimentalresults show that both CLSTM and forward-backward LSTM improveword error rates significantly compared to standard CNN and LSTMarchitectures.

Convolutional Recurrent Neural Networks forSmall-Footprint Keyword Spotting

Sercan Ö. Arık, Markus Kliegl, Rewon Child, JoelHestness, Andrew Gibiansky, Chris Fougner, RyanPrenger, Adam Coates; Baidu Research, USATue-P-4-1-5, Time: 13:30–15:30

Keyword spotting (KWS) constitutes a major component of human-technology interfaces. Maximizing the detection accuracy at a lowfalse alarm (FA) rate, while minimizing the footprint size, latency andcomplexity are the goals for KWS. Towards achieving them, we studyConvolutional Recurrent Neural Networks (CRNNs). Inspired bylarge-scale state-of-the-art speech recognition systems, we combinethe strengths of convolutional layers and recurrent layers to exploitlocal structure and long-range context. We analyze the effect ofarchitecture parameters, and propose training strategies to improveperformance. With only ∼230k parameters, our CRNN model yieldsacceptably low latency, and achieves 97.71% accuracy at 0.5 FA/hourfor 5 dB signal-to-noise ratio.

Deep Activation Mixture Model for SpeechRecognition

Chunyang Wu, Mark J.F. Gales; University of Cambridge,UKTue-P-4-1-6, Time: 13:30–15:30

Deep learning approaches achieve state-of-the-art performance in arange of applications, including speech recognition. However, theparameters of the deep neural network (DNN) are hard to interpret,which makes regularisation and adaptation to speaker or acousticconditions challenging. This paper proposes the deep activationmixture model (DAMM) to address these problems. The output ofone hidden layer is modelled as the sum of a mixture and residualmodels. The mixture model forms an activation function contourwhile the residual one models fluctuations around the contour.The use of the mixture model gives two advantages: First, it intro-duces a novel regularisation on the DNN. Second, it allows noveladaptation schemes. The proposed approach is evaluated on alarge-vocabulary U.S. English broadcast news task. It yields a slightlybetter performance than the DNN baselines, and on the utterance-level unsupervised adaptation, the adapted DAMM acquires furtherperformance gains.

Ensembles of Multi-Scale VGG Acoustic Models

Michael Heck 1, Masayuki Suzuki 1, Takashi Fukuda 1,Gakuto Kurata 1, Satoshi Nakamura 2; 1IBM, Japan;2NAIST, JapanTue-P-4-1-7, Time: 13:30–15:30

We present our work on constructing multi-scale deep convolutionalneural networks for automatic speech recognition. Several VGG netshave been trained that differ solely in the kernel size of the convolu-tional layers. The general idea is that receptive fields of varying sizesmatch structures of different scales, thus supporting more robustrecognition when combined appropriately. We construct a largemulti-scale system by means of system combination. We use ROVERand the fusion of posterior predictions as examples of late combi-nation, and knowledge distillation using soft labels from a modelensemble as a way of early combination. In this work, distillation isapproached from the perspective of knowledge transfer pre-training,which is followed by a fine-tuning on the original hard labels. Ourresults show that it is possible to bundle the individual recognitionstrengths of the VGGs in a much simpler CNN architecture that yieldsequal performance with the best late combination.

Notes

141

Training Context-Dependent DNN Acoustic ModelsUsing Probabilistic Sampling

Tamás Grósz 1, Gábor Gosztolya 1, László Tóth 2;1University of Szeged, Hungary; 2MTA-SZTE RGAI,HungaryTue-P-4-1-8, Time: 13:30–15:30

In current HMM/DNN speech recognition systems, the purpose ofthe DNN component is to estimate the posterior probabilities oftied triphone states. In most cases the distribution of these statesis uneven, meaning that we have a markedly different number oftraining samples for the various states. This imbalance of thetraining data is a source of suboptimality for most machine learningalgorithms, and DNNs are no exception. A straightforward solutionis to re-sample the data, either by upsampling the rarer classes or bydownsampling the more common classes. Here, we experiment withthe so-called probabilistic sampling method that applies downsam-pling and upsampling at the same time. For this, it defines a newclass distribution for the training data, which is a linear combinationof the original and the uniform class distributions. As an extensionto previous studies, we propose a new method to re-estimate theclass priors, which is required to remedy the mismatch between thetraining and the test data distributions introduced by re-sampling.Using probabilistic sampling and the proposed modification wereport 5% and 6% relative error rate reductions on the TED-LIUM andon the AMI corpora, respectively.

A Comparative Evaluation of GMM-Free State TyingMethods for ASR

Tamás Grósz 1, Gábor Gosztolya 1, László Tóth 2;1University of Szeged, Hungary; 2MTA-SZTE RGAI,HungaryTue-P-4-1-9, Time: 13:30–15:30

Deep neural network (DNN) based speech recognizers have recentlyreplaced Gaussian mixture (GMM) based systems as the state-of-the-art. While some of the modeling techniques developed forthe GMM based framework may directly be applied to HMM/DNNsystems, others may be inappropriate. One such example is thecreation of context-dependent tied states, for which an efficientdecision tree state tying method exists. The tied states used to trainDNNs are usually obtained using the same tying algorithm, eventhough it is based on likelihoods of Gaussians, hence it is moreappropriate for HMM/GMMs. Recently, however, several refinementshave been published which seek to adapt the state tying algorithmto the HMM/DNN hybrid architecture. Unfortunately, these studiesreported results on different (and sometimes very small) datasets,which does not allow their direct comparison. Here, we tested fourof these methods on the same LVCSR task, and compared theirperformance under the same circumstances. We found that, besideschanging the input of the context-dependent state tying algorithm,it is worth adjusting the tying criterion as well. The methods whichutilized a decision criterion designed directly for neural networksconsistently, and significantly, outperformed those which employedthe standard Gaussian-based algorithm.

Tue-P-4-2 : Acoustic Models for ASR 2Poster 2, 13:30–15:30, Tuesday, 22 Aug. 2017Chair: Karen Livescu

Backstitch: Counteracting Finite-Sample Bias viaNegative Steps

Yiming Wang, Vijayaditya Peddinti, Hainan Xu, XiaohuiZhang, Daniel Povey, Sanjeev Khudanpur; JohnsHopkins University, USATue-P-4-2-1, Time: 13:30–15:30

In this paper we describe a modification to Stochastic GradientDescent (SGD) that improves generalization to unseen data. Itconsists of doing two steps for each minibatch: a backward stepwith a small negative learning rate, followed by a forward step witha larger learning rate. The idea was initially inspired by ideas fromadversarial training, but we show that it can be viewed as a crude wayof canceling out certain systematic biases that come from training onfinite data sets. The method gives ∼ 10% relative improvement overour best acoustic models based on lattice-free MMI, across multipledatasets with 100–300 hours of data.

Node Pruning Based on Entropy of Weights and NodeActivity for Small-Footprint Acoustic Model Based onDeep Neural Networks

Ryu Takeda 1, Kazuhiro Nakadai 2, KazunoriKomatani 1; 1Osaka University, Japan; 2Honda ResearchInstitute Japan, JapanTue-P-4-2-2, Time: 13:30–15:30

This paper describes a node-pruning method for an acoustic modelbased on deep neural networks (DNNs). Node pruning is a promisingmethod to reduce the memory usage and computational cost ofDNNs. A score function is defined to measure the importance ofeach node, and less important nodes are pruned. The entropy ofthe activity of each node has been used as a score function to findnodes with outputs that do not change at all. We introduce entropyof weights of each node to consider the number of weights andtheir patterns of each node. Because the number of weights and thepatterns differ at each layer, the importance of the node should alsobe measured using the related weights of the target node. We thenpropose a score function that integrates the entropy of weights andnode activity, which will prune less important nodes more efficiently.Experimental results showed that the proposed pruning methodsuccessfully reduced the number of parameters by about 6% withoutany accuracy loss compared with a score function based only on theentropy of node activity.

End-to-End Training of Acoustic Models for LargeVocabulary Continuous Speech Recognition withTensorFlow

Ehsan Variani, Tom Bagby, Erik McDermott, MichielBacchiani; Google, USATue-P-4-2-3, Time: 13:30–15:30

This article discusses strategies for end-to-end training of state-of-the-art acoustic models for Large Vocabulary Continuous SpeechRecognition (LVCSR), with the goal of leveraging TensorFlow compo-nents so as to make efficient use of large-scale training sets, largemodel sizes, and high-speed computation units such as GraphicalProcessing Units (GPUs). Benchmarks are presented that evaluatethe efficiency of different approaches to batching of training data,unrolling of recurrent acoustic models, and device placement of Ten-

Notes

142

sorFlow variables and operations. An overall training architecturedeveloped in light of those findings is then described. The approachmakes it possible to take advantage of both data parallelism and highspeed computation on GPU for state-of-the-art sequence training ofacoustic models. The effectiveness of the design is evaluated fordifferent training schemes and model sizes, on a 15,000 hour VoiceSearch task.

An Efficient Phone N-Gram Forward-BackwardComputation Using Dense Matrix Multiplication

Khe Chai Sim, Arun Narayanan; Google, USATue-P-4-2-4, Time: 13:30–15:30

The forward-backward algorithm is commonly used to train neuralnetwork acoustic models when optimizing a sequence objectivelike MMI and sMBR. Recent work on lattice-free MMI training ofneural network acoustic models shows that the forward-backwardalgorithm can be computed efficiently in the probability domain as aseries of sparse matrix multiplications using GPUs. In this paper, wepresent a more efficient way of computing forward-backward usinga dense matrix multiplication approach. We do this by exploitingthe block-diagonal structure of the n-gram state transition matrix;instead of multiplying large sparse matrices, the proposed methodinvolves a series of smaller dense matrix multiplications, whichcan be computed in parallel. Efficient implementation can be easilyachieved by leveraging on the optimized matrix multiplication rou-tines provided by standard libraries, such as NumPy and TensorFlow.Runtime benchmarks show that the dense multiplication method isconsistently faster than the sparse multiplication method (on bothCPUs and GPUs), when applied to a 4-gram phone language model.This is still the case even when the sparse multiplication methoduses a more compact finite state model representation by excludingunseen n-grams.

Parallel Neural Network Features for ImprovedTandem Acoustic Modeling

Zoltán Tüske, Wilfried Michel, Ralf Schlüter, HermannNey; RWTH Aachen University, GermanyTue-P-4-2-5, Time: 13:30–15:30

The combination of acoustic models or features is a standardapproach to exploit various knowledge sources. This paper investi-gates the concatenation of different bottleneck (BN) neural network(NN) outputs for tandem acoustic modeling. Thus, combinationof NN features is performed via Gaussian mixture models (GMM).Complementarity between the NN feature representations is attainedby using various network topologies: LSTM recurrent, feed-forward,and hierarchical, as well as different non-linearities: hyperbolictangent, sigmoid, and rectified linear units. Speech recognition ex-periments are carried out on various tasks: telephone conversations,Skype calls, as well as broadcast news and conversations. Resultsindicate that LSTM based tandem approach is still competitive,and such tandem model can challenge comparable hybrid systems.The traditional steps of tandem modeling, speaker adaptive andsequence discriminative GMM training, improve the tandem resultsfurther. Furthermore, these “old-fashioned” steps remain applicableafter the concatenation of multiple neural network feature streams.Exploiting the parallel processing of input feature streams, it isshown that 2–5% relative improvement could be achieved over thesingle best BN feature set. Finally, we also report results after neuralnetwork based language model rescoring and examine the systemcombination possibilities using such complex tandem models.

Acoustic Feature Learning via Deep VariationalCanonical Correlation Analysis

Qingming Tang, Weiran Wang, Karen Livescu; TTIC, USATue-P-4-2-6, Time: 13:30–15:30

We study the problem of acoustic feature learning in the settingwhere we have access to another (non-acoustic) modality for featurelearning but not at test time. We use deep variational canonicalcorrelation analysis (VCCA), a recently proposed deep generativemethod for multi-view representation learning. We also extendVCCA with improved latent variable priors and with adversariallearning. Compared to other techniques for multi-view featurelearning, VCCA’s advantages include an intuitive latent variableinterpretation and a variational lower bound objective that can betrained end-to-end efficiently. We compare VCCA and its extensionswith previous feature learning methods on the University of Wiscon-sin X-ray Microbeam Database, and show that VCCA-based featurelearning improves over previous methods for speaker-independentphonetic recognition.

Tue-P-4-3 : Dialog ModelingPoster 3, 13:30–15:30, Tuesday, 22 Aug. 2017Chair: Kristiina Jokinen

Online End-of-Turn Detection from Speech Based onStacked Time-Asynchronous Sequential Networks

Ryo Masumura, Taichi Asami, Hirokazu Masataki, RyoIshii, Ryuichiro Higashinaka; NTT, JapanTue-P-4-3-1, Time: 13:30–15:30

This paper presents a novel modeling called stacked time-asynchronous sequential networks (STASNs) for online end-of-turndetection. An online end-of-turn detection that determines turn-taking points in a real-time manner is an essential componentfor human-computer interaction systems. In this study, we uselong-range sequential information of multiple time-asynchronous se-quential features, such as prosodic, phonetic, and lexical sequentialfeatures, to enhance online end-of-turn detection performance. Ourkey idea is to embed individual sequential features in a fixed-lengthcontinuous representation by using sequential networks. Thisenables us to simultaneously handle multiple time-asynchronoussequential features for end-of-turn detection. STASNs can embed allof the sequential information between a start-of-conversation andthe current end-of-utterance in a fixed-length continuous representa-tion that can be directly used for classification by stacking multiplesequential networks. Experiments show that STASNs outperformsconventional modeling with limited sequential information. Fur-thermore, STASNs with senone bottleneck features extracted usingsenone-based deep neural networks have superior performancewithout requiring lexical features decoded by an automatic speechrecognition process.

Improving Prediction of Speech Activity UsingMulti-Participant Respiratory State

Marcin Włodarczak 1, Kornel Laskowski 2, MattiasHeldner 1, Kätlin Aare 1; 1Stockholm University, Sweden;2Carnegie Mellon University, USATue-P-4-3-2, Time: 13:30–15:30

One consequence of situated face-to-face conversation is the co-observability of participants’ respiratory movements and sounds.We explore whether this information can be exploited in predictingincipient speech activity. Using a methodology called stochasticturn-taking modeling, we compare the performance of a model

Notes

143

trained on speech activity alone to one additionally trained on staticand dynamic lung volume features. The methodology permits auto-matic discovery of temporal dependencies across participants andfeature types. Our experiments show that respiratory informationsubstantially lowers cross-entropy rates, and that this generalizes tounseen data.

Turn-Taking Offsets and Dialogue Context

Peter A. Heeman, Rebecca Lunsford; Oregon Health &Science University, USATue-P-4-3-3, Time: 13:30–15:30

A number of researchers have studied turn-taking offsets in human-human dialogues. However, that work collapses over a wide numberof different turn-taking contexts. In this work, we delve into the turn-taking delays based on different contexts. We show that turn-takingbehavior, both who tends to take the turn next, and the turn-takingdelays, are dependent on the previous speech act type, the upcomingspeech act, and the nature of the dialogue. This strongly suggeststhat in studying turn-taking, all turn-taking events should not begrouped together. This also suggests that delays are due to cognitiveprocessing of what to say, rather than whether a speaker should takethe turn.

Towards Deep End-of-Turn Prediction for SituatedSpoken Dialogue Systems

Angelika Maier, Julian Hough, David Schlangen;Universität Bielefeld, GermanyTue-P-4-3-4, Time: 13:30–15:30

We address the challenge of improving live end-of-turn detectionfor situated spoken dialogue systems. While traditionally silencethresholds have been used to detect the user’s end-of-turn, suchan approach limits the system’s potential fluidity in interaction,restricting it to a purely reactive paradigm. By contrast, here wepresent a system which takes a predictive approach. The user’send-of-turn is predicted live as acoustic features and words areconsumed by the system. We compare the benefits of live lexicaland acoustic information by feature analysis and testing equivalentmodels with different feature sets with a common deep learningarchitecture, a Long Short-Term Memory (LSTM) network. We showthe usefulness of incremental enriched language model featuresin particular. Training and testing onWizard-of-Oz data collectedto train an agent in a simple virtual world, we are successful inimproving over a reactive baseline in terms of reducing latencywhilst minimising the cut-in rate.

End-of-Utterance Prediction by Prosodic Features andPhrase-Dependency Structure in SpontaneousJapanese Speech

Yuichi Ishimoto 1, Takehiro Teraoka 2, Mika Enomoto 2;1NINJAL, Japan; 2Tokyo University of Technology, JapanTue-P-4-3-5, Time: 13:30–15:30

This study is aimed at uncovering a way that participants in conver-sation predict end-of-utterance for spontaneous Japanese speech. Inspontaneous everyday conversation, the participants must predictthe ends of utterances of a speaker to perform smooth turn-takingwithout too much gap. We consider that they utilize not onlysyntactic factors but also prosodic factors for the end-of-utteranceprediction because of the difficulty of prediction of a syntacticcompletion point in spontaneous Japanese. In previous studies, wefound that prosodic features changed significantly in the final accen-tual phrase. However, it is not clear what prosodic features supportthe prediction. In this paper, we focused on dependency structureamong bunsetsu-phrases as the syntactic factor, and investigated

the relation between the phrase-dependency and prosodic features.The results showed that the average fundamental frequency and theaverage intensity for accentual phrases did not decline until the mod-ified phrase appeared. Next, to predict the end of utterance from thesyntactic and prosodic features, we constructed a generalized linearmixed model. The model provided higher accuracy than using theprosodic features only. These suggest the possibility that prosodicchanges and phrase-dependency relations inform the hearer that theutterance is approaching its end.

Turn-Taking Estimation Model Based on JointEmbedding of Lexical and Prosodic Contents

Chaoran Liu, Carlos Ishi, Hiroshi Ishiguro; ATR HIL,JapanTue-P-4-3-6, Time: 13:30–15:30

A natural conversation involves rapid exchanges of turns whiletalking. Taking turns at appropriate timing or intervals is a requisitefeature for a dialog system as a conversation partner. This paperproposes a model that estimates the timing of turn-taking duringverbal interactions. Unlike previous studies, our proposed modeldoes not rely on a silence region between sentences since a dialogsystem must respond without large gaps or overlaps. We proposea Recurrent Neural Network (RNN) based model that takes thejoint embedding of lexical and prosodic contents as its input toclassify utterances into turn-taking related classes and estimatesthe turn-taking timing. To this end, we trained a neural network toembed the lexical contents, the fundamental frequencies, and thespeech power into a joint embedding space. To learn meaningfulembedding spaces, the prosodic features from each single utteranceare pre-trained using RNN and combined with utterance lexicalembedding as the input of our proposed model. We tested thismodel on a spontaneous conversation dataset and confirmed that itoutperformed the use of word embedding-based features.

Social Signal Detection in Spontaneous DialogueUsing Bidirectional LSTM-CTC

Hirofumi Inaguma, Koji Inoue, Masato Mimura, TatsuyaKawahara; Kyoto University, JapanTue-P-4-3-7, Time: 13:30–15:30

Non-verbal speech cues such as laughter and fillers, which arecollectively called social signals, play an important role in humancommunication. Therefore, detection of them would be usefulfor dialogue systems to infer speaker’s intentions, emotions andengagements. The conventional approaches are based on frame-wiseclassifiers, which require precise time-alignment of these events fortraining. This work investigates the Connectionist Temporal Classi-fication (CTC) approach which can learn an alignment between theinput and its target label sequence. This allows for robust detectionof the events and efficient training without precise time information.Experimental evaluations with various settings demonstrate thatCTC based on bidirectional LSTM outperforms the conventional DNNand HMM based methods.

Entrainment in Multi-Party Spoken Dialogues atMultiple Linguistic Levels

Zahra Rahimi 1, Anish Kumar 1, Diane Litman 1,Susannah Paletz 2, Mingzhi Yu 1; 1University ofPittsburgh, USA; 2University of Maryland, USATue-P-4-3-8, Time: 13:30–15:30

Linguistic entrainment, the phenomena whereby dialogue partnersspeak more similarly to each other in a variety of dimensions, iskey to the success and naturalness of interactions. While there is

Notes

144

considerable evidence for both lexical and acoustic-prosodic entrain-ment, little work has been conducted to investigate the relationshipbetween these two different modalities using the same measuresin the same dialogues, specifically in multi-party dialogue. In thispaper, we measure lexical and acoustic-prosodic entrainment formulti-party teams to explore whether entrainment occurs at multi-ple levels during conversation and to understand the relationshipbetween these two modalities.

Measuring Synchrony in Task-Based Dialogues

Justine Reverdy, Carl Vogel; Trinity College Dublin,IrelandTue-P-4-3-9, Time: 13:30–15:30

In many contexts from casual everyday conversations to formaldiscussions, people tend to repeat their interlocutors, and them-selves. This phenomenon not only yields random repetitions onemight expect from a natural Zipfian distribution of linguistic forms,but also projects underlying discourse mechanisms and rhythmsthat researchers have suggested establishes conversational involve-ment and may support communicative progress towards mutualunderstanding. In this paper, advances in an automated method forassessing interlocutor synchrony in task-based Human-to-Humaninteractions are reported. The method focuses on dialogue struc-ture, rather than temporal distance, measuring repetition betweenspeakers and their interlocutors last n-turns (n = 1, however far backin the conversation that might have been) rather than utterancesduring a prior window fixed by duration. The significance of distinctlinguistic levels of repetition are assessed by observing contrastsbetween actual and randomized dialogues, in order to provide aquantifying measure of communicative success. Definite patternsof repetitions where identified, notably in contrasting the role ofparticipants (as information giver or follower). The extent to whichthose interacted sometime surprisingly with gender, eye-contact andfamiliarity is the principal contribution of this work.

Sequence to Sequence Modeling for User Simulationin Dialog Systems

Paul Crook, Alex Marin; Microsoft, USATue-P-4-3-10, Time: 13:30–15:30

User simulators are a principal offline method for training andevaluating human-computer dialog systems. In this paper, weexamine simple sequence-to-sequence neural network architecturesfor training end-to-end, natural language to natural language, usersimulators, using only raw logs of previous interactions without anyadditional human labelling. We compare the neural network-basedsimulators with a language model (LM)-based approach for creatingnatural language user simulators. Using both an automatic evalua-tion using LM perplexity and a human evaluation, we demonstratethat the sequence-to-sequence approaches outperform the LM-basedmethod. We show correlation between LM perplexity and the humanevaluation on this task, and discuss the benefits of different neuralnetwork architecture variations.

Human and Automated Scoring of Fluency,Pronunciation and Intonation During Human–MachineSpoken Dialog Interactions

Vikram Ramanarayanan, Patrick L. Lange, KeelanEvanini, Hillary R. Molloy, David Suendermann-Oeft;Educational Testing Service, USATue-P-4-3-11, Time: 13:30–15:30

We present a spoken dialog-based framework for the computer-assisted language learning (CALL) of conversational English. Inparticular, we leveraged the open-source HALEF dialog framework

to develop a job interview conversational application. We thenused crowdsourcing to collect multiple interactions with the systemfrom non-native English speakers. We analyzed human-rated scoresof the recorded dialog data on three different scoring dimensionscritical to the delivery of conversational English — fluency, pronun-ciation and intonation/stress — and further examined the efficacy ofautomatically-extracted, hand-curated speech features in predictingeach of these sub-scores. Machine learning experiments showed thattrained scoring models generally perform at par with the humaninter-rater agreement baseline in predicting human-rated scores ofconversational proficiency.

Hierarchical LSTMs with Joint Learning for EstimatingCustomer Satisfaction from Contact Center Calls

Atsushi Ando, Ryo Masumura, Hosana Kamiyama,Satoshi Kobashikawa, Yushi Aono; NTT, JapanTue-P-4-3-12, Time: 13:30–15:30

This paper presents a joint modeling of both turn-level and call-levelcustomer satisfaction in contact center dialogue. Our key idea is todirectly apply turn-level estimation results to call-level estimationand optimize them jointly; previous work treated both estimations asbeing independent. Proposed joint modeling is achieved by stackingtwo types of long short-term memory recurrent neural networks(LSTM-RNNs). The lower layer employs LSTM-RNN for sequentiallabeling of turn-level customer satisfaction in which each label isestimated from context information extracted from not only thetarget turn but also the surrounding turns. The upper layer usesanother LSTM-RNN to estimate call-level customer satisfaction labelsfrom all information of estimated turn-level customer satisfaction.These two networks can be efficiently optimized by joint learning ofboth types of labels. Experiments show that the proposed methodoutperforms a conventional support vector machine based methodin terms of both turn-level and call-level customer satisfaction withrelative error reductions of over 20%.

Domain-Independent User Satisfaction RewardEstimation for Dialogue Policy Learning

Stefan Ultes, Paweł Budzianowski, Iñigo Casanueva,Nikola Mrkšic, Lina Rojas-Barahona, Pei-Hao Su,Tsung-Hsien Wen, Milica Gašic, Steve Young; Universityof Cambridge, UKTue-P-4-3-13, Time: 13:30–15:30

Learning suitable and well-performing dialogue behaviour in statis-tical spoken dialogue systems has been in the focus of research formany years. While most work which is based on reinforcement learn-ing employs an objective measure like task success for modelling thereward signal, we propose to use a reward based on user satisfaction.We will show in simulated experiments that a live user satisfactionestimation model may be applied resulting in higher estimatedsatisfaction whilst achieving similar success rates. Moreover, we willshow that one satisfaction estimation model which has been trainedon one domain may be applied in many other domains which covera similar task. We will verify our findings by employing the modelto one of the domains for learning a policy from real users andcompare its performance to policies using the user satisfaction andtask success acquired directly from the users as reward.

Notes

145

Analysis of the Relationship Between ProsodicFeatures of Fillers and its Forms or OccurrencePositions

Shizuka Nakamura, Ryosuke Nakanishi, KatsuyaTakanashi, Tatsuya Kawahara; Kyoto University, JapanTue-P-4-3-14, Time: 13:30–15:30

Fillers are involved in the ease of understanding by listeners andturn-taking. However, the knowledge about its prosodic featuresis insufficient, and its modeling has not been done either. Forthese reasons, there is insufficient knowledge to generate naturaland appropriate fillers in a dialog system at present. Therefore,for the purpose of clarifying the prosodic features of fillers, itsrelationship with occurrence positions or forms were analyzed inthis research. ‘Ano’ and ‘Eto’ were used as forms, non-/boundary ofDialog Act and non-/turn-taking for occurrence positions. Duration,F0, and intensity were utilized as prosodic features. As a result,the followings were found out: the prosodic features are differentdepending on the difference of the occurrence positions even forfillers of the same form, and similar prosodic features are foundbetween the same occurrence positions even in different forms.

Cross-Subject Continuous Emotion Recognition UsingSpeech and Body Motion in Dyadic Interactions

Syeda Narjis Fatima, Engin Erzin; Koç Üniversitesi,TurkeyTue-P-4-3-15, Time: 13:30–15:30

Dyadic interactions encapsulate rich emotional exchange betweeninterlocutors suggesting a multimodal, cross-speaker and cross-dimensional continuous emotion dependency. This study exploresthe dynamic inter-attribute emotional dependency at the cross-subject level with implications to continuous emotion recognitionbased on speech and body motion cues. We propose a noveltwo-stage Gaussian Mixture Model mapping framework for the con-tinuous emotion recognition problem. In the first stage, we performcontinuous emotion recognition (CER) of both speakers from speechand body motion modalities to estimate activation, valence anddominance (AVD) attributes. In the second stage, we improve thefirst stage estimates by performing CER of the selected speakerusing her/his speech and body motion modalities as well as usingthe estimated affective attribute(s) of the other speaker. Our ex-perimental evaluations indicate that the second stage, cross-subjectcontinuous emotion recognition (CSCER), provides complementaryinformation to recognize the affective state, and delivers promisingimprovements for the continuous emotion recognition problem.

Tue-P-5-1 : L1 and L2 AcquisitionPoster 1, 16:00–18:00, Tuesday, 22 Aug. 2017Chair: Aoju Chen

An Automatically Aligned Corpus of Child-DirectedSpeech

Micha Elsner, Kiwako Ito; Ohio State University, USATue-P-5-1-1, Time: 16:00–18:00

Forced alignment would enable phonetic analyses of child directedspeech (CDS) corpora which have existing transcriptions. But existingalignment systems are inaccurate due to the atypical phonetics ofCDS. We adapt a Kaldi forced alignment system to CDS by extendingthe dictionary and providing it with heuristically-derived hints forvowel locations. Using this system, we present a new time-alignedCDS corpus with a million aligned segments. We manually correcta subset of the corpus and demonstrate that our system is 70%

accurate. Both our automatic and manually corrected alignments arepublically available at osf.io/ke44q.

A Comparison of Danish Listeners’ Processing Cost inJudging the Truth Value of Norwegian, Swedish, andEnglish Sentences

Ocke-Schwen Bohn, Trine Askjær-Jørgensen; AarhusUniversity, DenmarkTue-P-5-1-2, Time: 16:00–18:00

The present study used a sentence verification task to assess theprocessing cost involved in native Danish listeners’ attempts tocomprehend true/false statements spoken in Danish, Norwegian,Swedish, and English. Three groups of native Danish listeners heard40 sentences each which were translation equivalents, and assessedthe truth value of these statements. Group 1 heard sentences inDanish and Norwegian, Group 2 in Danish and Swedish, and Group3 in Danish and English. Response time and proportion of correctresponses were used as indices of processing cost. Both measuresindicate that the processing cost for native Danish listeners in com-prehending Danish and English statements is equivalent, whereasNorwegian and Swedish statements incur a much higher cost, bothin terms of response time and correct assessments. The results arediscussed with regard to the costs of inter-Scandinavian and Englishlingua franca communication.

On the Role of Temporal Variability in theAcquisition of the German Vowel Length Contrast

Felicitas Kleber; LMU München, GermanyTue-P-5-1-3, Time: 16:00–18:00

This study is part of a larger project investigating the acquisitionof stable vowel-plus-consonant timing patterns needed to conveythe phonemic vowel length and the voicing contrast in German.The research is motivated by findings showing greater temporalvariability in children until the age of 12. The specific aims of thecurrent study were to test (1) whether temporal variability in the pro-duction of the vowel length contrast decreases with increasing age(in general and more so when the variability is speech rate induced)and (2) whether duration cues are perceived more categorically withincreasing age. Production and perception data were obtained fromeleven preschool, five school children and eleven adults. Resultsrevealed that children produce the quantity contrast with temporalpatterns that are similar to adults’ patterns, although vowel durationwas overall longer and variability slightly higher in faster speech andyounger children. Apart from that, the two groups of children didnot differ in production. In perception, however, school children’sresponse patterns to a continuum from a long vowel to a shortvowel word were in between those of adults and preschool children.Findings are discussed with respect to motor control and phonemicabstraction.

A Data-Driven Approach for Perceptually ValidatedAcoustic Features for Children’s Sibilant FricativeProductions

Patrick F. Reidy 1, Mary E. Beckman 2, Jan Edwards 3,Benjamin Munson 4; 1University of Texas at Dallas, USA;2Ohio State University, USA; 3University of Maryland,USA; 4University of Minnesota, USATue-P-5-1-4, Time: 16:00–18:00

Both perceptual and acoustic studies of children’s speech indepen-dently suggest that phonological contrasts are continuously refinedduring acquisition. This paper considers two traditional acousticfeatures for the ‘s’-vs.-‘sh’ contrast (centroid and peak frequencies)

Notes

146

and a novel feature learned from data, evaluating these featuresrelative to perceptual ratings of children’s productions.

Productions of sibilant fricatives were elicited from 16 adults and69 preschool children. A second group of adults rated the children’sproductions on a visual analog scale (VAS). Each production wasrated by multiple listeners; mean VAS score for each production wasused as its perceptual goodness rating. For each production fromthe repetition task, a psychoacoustic spectrum was estimated bypassing it through a filter bank that modeled the auditory periphery.From these spectra centroid and peak frequencies were computed,two traditional features for a sibilant fricative’s place of articulation.A novel acoustic measure was derived by inputting the spectra to agraph-based dimensionality-reduction algorithm.

Simple regression analyses indicated that a greater amount of vari-ance in the VAS scores was explained by the novel feature (adjustedR2 = 0.569) than by either centroid (adjusted R2 = 0.468) or peakfrequency (adjusted R2 = 0.254).

Proficiency Assessment of ESL Learner’s SentenceProsody with TTS Synthesized Voice as Reference

Yujia Xiao 1, Frank K. Soong 2; 1SCUT, China; 2Microsoft,ChinaTue-P-5-1-5, Time: 16:00–18:00

We investigate how to assess the prosody quality of an ESL learner’sspoken sentence against native speaker’s natural recording or TTSsynthesized voice. A spoken English utterance read by an ESLleaner is compared with the recording of a native speaker, or TTSvoice. The corresponding F0 contours (with voicings) and breaks arecompared at the mapped syllable level via a DTW. The correlationsbetween the prosody patterns of learner and native speaker (or TTSvoice) of the same sentence are computed after the speech ratesand F0 distributions between speakers are equalized. Based uponcollected native and non-native speakers’ databases and correlationcoefficients, we use Gaussian mixtures to model them as continuousdistributions for training a two-class (native vs non-native) neural netclassifier. We found that classification accuracy between using nativespeaker’s and TTS reference is close, i.e., 91.2% vs 88.1%. To assessthe prosody proficiency of an ESL learner with one sentence input,the prosody patterns of our high quality TTS is almost as effective asthose of native speakers’ recordings, which are more expensive andinconvenient to collect.

Mechanisms of Tone Sandhi Rule Application byNon-Native Speakers

Si Chen 1, Yunjuan He 2, Chun Wah Yuen 1, Bei Li 1, YikeYang 1; 1Hong Kong Polytechnic University, China;2University of North Georgia, USATue-P-5-1-6, Time: 16:00–18:00

This study is the first to examine acquisition of two Mandarintone sandhi rules by Cantonese speakers. It designs both real anddifferent types of wug words to test whether learners may exploit alexical or computation mechanism in tone sandhi rule application.We also statistically compared their speech production with BeijingMandarin speakers. The results of functional data analysis showedthat non-native speakers applied tone sandhi rules both to realand wug words in a similar manner, indicating that they mightutilize a computation mechanism and compute the rules underphonological conditions. No significant differences in applying thesetwo phonological rules on reading wug words also suggest no bias inthe application of these two rules. However, their speech productiondiffered from native speakers. The application of third tone sandhirule was more categorical than native speakers in that Cantonesespeakers tended to neutralize the sandhi Tone 3 more with Tone 2produced in isolation compared to native speakers. Also, Cantonese

speakers might not have applied half-third tone sandhi rule fullysince they tended to raise f0 values more at the end of vowels.

Changes in Early L2 Cue-Weighting of Non-NativeSpeech: Evidence from Learners of Mandarin Chinese

Seth Wiener; Carnegie Mellon University, USATue-P-5-1-7, Time: 16:00–18:00

This study examined how cue-weighting of a non-native speech cuechanges during early adult second language (L2) acquisition. Tennative English speaking learners of Mandarin Chinese performeda speeded AX-discrimination task during months 1, 2, and 3 ofa first-year Chinese course. Results were compared to ten nativeMandarin speakers. Learners’ reaction time and d-prime resultsbecame more native-like after two months of classroom study butplateaued thereafter. Multidimensional scaling results showed asimilar shift to more native-like cue-weighting as learners attendedmore to pitch direction and less to pitch height. Despite the improve-ments, learners’ month 3 configuration of cue-weighting differedfrom that of native speakers; learners appeared to weight pitch endpoints rather than overall pitch directions. These results suggestthat learners’ warping of the weights of dimensions underlying theperceptual space changes rapidly during early acquisition and canplateau like other measures of L2 acquisition. Previous perceptuallearning studies may have only captured initial L2 perception gains,not the learning plateau that often follows. New methods of percep-tual learning, especially for tonal languages, are needed to advancelearners off the plateau.

Directing Attention During Perceptual Training: APreliminary Study of Phonetic Learning in SouthernMin by Mandarin Speakers

Ying Chen 1, Eric Pederson 2; 1NUST, China; 2Universityof Oregon, USATue-P-5-1-8, Time: 16:00–18:00

Previous studies have shown that directing learners’ attention duringperceptual training facilitates detection and learning of unfamiliarconsonant categories [1, 2]. The current study asks whether thisattentional directing can also facilitate other types of phoneticlearning. Monolingual Mandarin speakers were divided into twogroups directed to learn either 1) the consonants or 2) the tones inan identification training task with the same set of Southern Minmonosyllabic words containing the consonants /ph, p, b, kh, k, g, tCh,tC, C/ and the tones (55, 33, 22, 24, 41). All subjects were also testedwith an AXB discrimination task (with a distinct set of SouthernMin words) before and after the training. Unsurprisingly, bothgroups improved accuracy in the sound type to which they attended.However, the consonant-attending group did not improve in discrim-inating tones after training and neither did the tone-attending groupin discriminating consonants — despite both groups having equalexposure to the same training stimuli. When combined with previousresults for consonant and vowel training, these results suggest thatexplicitly directing learners’ attention has a broadly facilitative effecton phonetic learning including of tonal contrasts.

Prosody Analysis of L2 English for NaturalnessEvaluation Through Speech Modification

Dean Luo 1, Ruxin Luo 2, Lixin Wang 3; 1ShenzhenInstitute of Information Technology, China; 2ShenzhenPolytechnic, China; 3Shenzhen Seaskyland Technologies,ChinaTue-P-5-1-9, Time: 16:00–18:00

This study investigates how different prosodic features affect nativespeakers’ naturalness judgement of L2 English speech by Chinese

Notes

147

students. Through subjective judgment by native speakers andobjectively measured prosodic features, timing and pitch relatedprosodic features, as well as segmental goodness of pronunciationhave been found to play key roles in native speakers’ perceptionof naturalness. In order to eliminate segmental factors, we usedaccent conversion techniques that modify native reference speechwith learners’ erroneous prosodic cues without altering segmentalproperties. Experimental results show that without interference ofsegmental factors, both timing and pitch features affect naturalnessof L2 speech. Timing plays a more crucial role in naturalness thanpitch. Accent modification that corrects timing or pitch errors canimprove naturalness of the speech.

Measuring Encoding Efficiency in Swedish andEnglish Language Learner Speech Production

Gintare Grigonyte 1, Gerold Schneider 2; 1StockholmUniversity, Sweden; 2Universität Zürich, SwitzerlandTue-P-5-1-10, Time: 16:00–18:00

We use n-gram language models to investigate how far languageapproximates an optimal code for human communication in termsof Information Theory [1], and what differences there are betweenLearner proficiency levels. Although the language of lower levellearners is simpler, it is less optimal in terms of information theory,and as a consequence more difficult to process.

Lexical Adaptation to a Novel Accent in German: AComparison Between German, Swedish, and FinnishListeners

Adriana Hanulíková 1, Jenny Ekström 2;1Albert-Ludwigs-Universität Freiburg, Germany;2University of Stockholm, SwedenTue-P-5-1-11, Time: 16:00–18:00

Listeners usually adjust rapidly to unfamiliar regional and foreignaccents in their native (L1) language. Non-native (L2) listeners,however, usually struggle when confronted with unfamiliar accentsin their non-native language. The present study asks how nativelanguage background of L2 speakers influences lexical adjustmentsin a novel accent of German, in which several vowels were systemat-ically lowered. We measured word judgments on a lexical decisiontask before and after exposure to a 15-min story in the novel dialect,and compared German, Swedish and Finnish listeners’ performance.Swedish is a Germanic language and shares with German a numberof lexical roots and a relatively large vowel inventory. Finnish isa Finno-Ugric language and differs substantially from Germaniclanguages in both lexicon and phonology. The results were aspredicted: descriptively, all groups showed a similar pattern ofadaptation to the accented speech, but only German and Swedishparticipants showed a significant effect. Lexical and phonologicalrelatedness between the native and non-native languages may thuspositively influence lexical adaptation in an unfamiliar accent.

Qualitative Differences in L3 Learners’Neurophysiological Response to L1 versus L2Transfer

Alejandra Keidel Fernández, Thomas Hörberg;Stockholm University, SwedenTue-P-5-1-12, Time: 16:00–18:00

Third language (L3) acquisition differs from first language (L1) andsecond language (L2) acquisition. There are different views onwhether L1 or L2 is of primary influence on L3 acquisition in terms oftransfer. This study examines differences in the event-related brain

potentials (ERP) response to agreement incongruencies between L1Spanish speakers and L3 Spanish learners, comparing response dif-ferences to incongruencies that are transferrable from the learners’L1 (Swedish), or their L2 (English). Whereas verb incongruencies,available in L3 learners’ L2 but not their L1, engendered a similarresponse for L1 speakers and L3 learners, adjective incongruencies,available in L3 learners’ L1 but not their L2, elicited responses thatdiffered between groups: Adjective incongruencies engendered anegativity in the 450–550 ms time window for L1 speakers only. Bothcongruent and incongruent adjectives also engendered an enhancedP3 wave in L3 learners compared to L1 speakers. Since the P300correlates with task-related, strategic processing, this indicates thatL3 learners process grammatical features that are transferrable fromtheir L1 in a less automatic mode than features that are transferrablefrom their L2. L3 learners therefore seem to benefit more from theirknowledge of their L2 than their knowledge of their L1.

Articulation Rate in Swedish Child-Directed SpeechIncreases as a Function of the Age of the Child EvenWhen Surprisal is Controlled for

Johan Sjons 1, Thomas Hörberg 1, Robert Östling 1,Johannes Bjerva 2; 1Stockholm University, Sweden;2Rijksuniversiteit Groningen, The NetherlandsTue-P-5-1-13, Time: 16:00–18:00

In earlier work, we have shown that articulation rate in Swedishchild-directed speech (CDS) increases as a function of the age of thechild, even when utterance length and differences in articulationrate between subjects are controlled for. In this paper we showon utterance level in spontaneous Swedish speech that i) for theyoungest children, articulation rate in CDS is lower than in adult-directed speech (ADS), ii) there is a significant negative correlationbetween articulation rate and surprisal (the negative log probability)in ADS, and iii) the increase in articulation rate in Swedish CDS asa function of the age of the child holds, even when surprisal alongwith utterance length and differences in articulation rate betweenspeakers are controlled for. These results indicate that adults adjusttheir articulation rate to make it fit the linguistic capacity of thechild.

The Relationship Between the Perception andProduction of Non-Native Tones

Kaile Zhang, Gang Peng; Hong Kong PolytechnicUniversity, ChinaTue-P-5-1-14, Time: 16:00–18:00

To further investigate the relationship between non-native tone per-ception and production, the present study trained Mandarin speakersto learn Cantonese lexical tones with a speech shadowing paradigm.After two weeks’ training, both Mandarin speakers’ Cantonese toneperception and their production had improved significantly. Theoverall performances in Cantonese tone perception and productionare moderately correlated, but the degree of performance changeafter training among the two modalities shows no correlation,suggesting that non-native tone perception and production might bepartially correlated, but that the improvement of the two modalitiesis not synchronous. A comparison between the present studyand previous studies on non-native tone learning indicates thatexperience in lexical tone processing might be important in formingthe correlation between tone perception and production. Mandarinspeakers showed greater improvement in Cantonese tone perceptionthan in production after training, indicating that second language(L2) perception might precede production. Besides, both the firstlanguage (L1) and L2 tonal systems showed an influence on Mandarinspeakers’ learning of Cantonese tones.

Notes

148

MMN Responses in Adults After Exposure to Bimodaland Unimodal Frequency Distributions of RotatedSpeech

Ellen Marklund, Elísabet Eir Cortes, Johan Sjons;Stockholm University, SwedenTue-P-5-1-15, Time: 16:00–18:00

The aim of the present study is to further the understanding ofthe relationship between perceptual categorization and exposure todifferent frequency distributions of sounds. Previous studies haveshown that speech sound discrimination proficiency is influencedby exposure to different distributions of speech sound continuavarying along one or several acoustic dimensions, both in adultsand in infants. In the current study, adults were presented witheither a bimodal or a unimodal frequency distribution of spectrallyrotated sounds along a continuum (a vowel continuum beforerotation). Categorization of the sounds, quantified as amplitude ofthe event-related potential (ERP) component mismatch negativity(MMN) in response to two of the sounds, was measured before andafter exposure. It was expected that the bimodal group would havea larger MMN amplitude after exposure whereas the unimodal groupwould have a smaller MMN amplitude after exposure. Contrary toexpectations, the MMN amplitude was smaller overall after exposure,and no difference was found between groups. This suggests thateither the previously reported sensitivity to frequency distributionsof speech sounds is not present for non-speech sounds, or the MMNamplitude is not a sensitive enough measure of categorization todetect an influence from passive exposure, or both.

Tue-P-5-2 : Voice, Speech and HearingDisordersPoster 2, 16:00–18:00, Tuesday, 22 Aug. 2017Chair: Timothy Bunnell

Float Like a Butterfly Sting Like a Bee: Changes inSpeech Preceded Parkinsonism Diagnosis forMuhammad Ali

Visar Berisha 1, Julie Liss 1, Timothy Huston 1, AlanWisler 1, Yishan Jiao 1, Jonathan Eig 2; 1Arizona StateUniversity, USA; 2Independent Author, USATue-P-5-2-1, Time: 16:00–18:00

Early identification of the onset of neurological disease is critical fortesting drugs or interventions to halt or slow progression. Speechproduction has been proposed as an early indicator of neurologicalimpairment. However, for speech to be useful for early detection,speech changes should be measurable from uncontrolled conversa-tional speech collected passively in natural recording environmentsover extended periods of time. Such longitudinal speech data setsfor testing the robustness of algorithms are difficult to acquire.In this paper, we exploit YouTube interviews from Muhammad Alifrom 1968 to 1981, before his 1984 diagnosis of parkinsonism. Theinterviews are unscripted, conversational in nature, and of varyingfidelity. We measured changes in speech production from the Aliinterviews and analyzed these changes relative to a coded registryof blows Mr. Ali received in each of his boxing matches over time.This provided a rich and unique opportunity to evaluate speechchange as both a function of disease progression and as a function offight history. Multivariate analyses revealed changes in prosody andarticulation consistent with hypokinetic dysarthria over time, and arelationship between reduced speech intonation and the amount oftime elapsed since the most recent fight preceding the interview.

Cepstral and Entropy Analyses in Vowels Excerptedfrom Continuous Speech of Dysphonic and ControlSpeakers

Antonella Castellana 1, Andreas Selamtzis 2, GiampieroSalvi 2, Alessio Carullo 1, Arianna Astolfi 1; 1Politecnico diTorino, Italy; 2KTH, SwedenTue-P-5-2-2, Time: 16:00–18:00

There is a growing interest in Cepstral and Entropy analyses ofvoice samples for defining a vocal health indicator, due to theirreliability in investigating both regular and irregular voice signals.The purpose of this study is to determine whether the Cepstral PeakProminence Smoothed (CPPS) and Sample Entropy (SampEn) coulddifferentiate dysphonic speakers from normal speakers in vowelsexcerpted from readings and to compare their discrimination power.Results are reported for 33 patients and 31 controls, who read astandardized phonetically balanced passage while wearing a headmounted microphone. Vowels were excerpted from recordings usingAutomatic Speech Recognition and, after obtaining a measure foreach vowel, individual distributions and their descriptive statisticswere considered for CPPS and SampEn. The Receiver OperatingCurve analysis revealed that the mean of the distributions was theparameter with the highest discrimination power for both CPPS andSampEn. CPPS showed a higher diagnostic precision than SampEn,exhibiting an Area Under Curve (AUC) of 0.85 compared to 0.72. Anegative correlation between the parameters was found (Spearman;ρ = -0.61), with higher SampEn corresponding to lower CPPS. Theautomatic method used in this study could provide support to voicemonitorings in clinic and during individual’s daily activities.

Classification of Bulbar ALS from Kinematic Featuresof the Jaw and Lips: Towards Computer-MediatedAssessment

Andrea Bandini 1, Jordan R. Green 2, Lorne Zinman 3,Yana Yunusova 1; 1University Health Network, Canada;2MGH Institute of Health Professions, USA; 3SunnybrookHealth Sciences Centre, CanadaTue-P-5-2-3, Time: 16:00–18:00

Recent studies demonstrated that lip and jaw movements duringspeech may provide important information for the diagnosis ofamyotrophic lateral sclerosis (ALS) and for understanding its pro-gression. A thorough investigation of these movements is essentialfor the development of intelligent video- or optically-based facialtracking systems that could assist with early diagnosis and progressmonitoring. In this paper, we investigated the potential for a noveland expanded set of kinematic features obtained from lips andjaw to classify articulatory data into three stages of bulbar diseaseprogression (i.e., pre-symptomatic, early symptomatic, and latesymptomatic). Feature selection methods (Relief-F and mRMR) andclassification algorithm (SVM) were used for this purpose. Resultsshowed that even with a limited number of kinematic features it waspossible to obtain good classification accuracy (nearly 80%). Giventhe recent development of video-based markerless methods fortracking speech movements, these results provide strong rationalefor supporting the development of portable and cheap systems formonitoring the orofacial function in ALS.

Notes

149

Zero Frequency Filter Based Analysis of VoiceDisorders

Nagaraj Adiga 1, Vikram C.M. 1, Keerthi Pullela 2,S.R. Mahadeva Prasanna 1; 1IIT Guwahati, India; 2VITUniversity, IndiaTue-P-5-2-4, Time: 16:00–18:00

Pitch period and amplitude perturbations are widely used parametersto discriminate normal and voice disorder speech. Instantaneouspitch period and amplitude of glottal vibrations directly from thespeech waveform may not give an accurate estimation of jitterand shimmer. In this paper, the significance of epochs (glottalclosure instants) and strength of excitation (SoE) derived from thezero-frequency filter (ZFF) are exploited to discriminate the voicedisorder and normal speech. Pitch epoch derived from ZFF isused to compute the jitter, and SoE derived around each epoch isused compute the shimmer. The derived epoch-based features areanalyzed on the some of the voice disorders like Parkinson’s disease,vocal fold paralysis, cyst, and gastroesophageal reflux disease. Thesignificance of proposed epoch-based features for discriminatingnormal and pathological voices is analyzed and compared with thestate-of-the-art methods using a support vector machine classifier.The results show that epoch-based features performed significantlybetter than other methods both in clean and noisy conditions.

Hypernasality Severity Analysis in Cleft Lip andPalate Speech Using Vowel Space Area

Nikitha K. 1, Sishir Kalita 2, C.M. Vikram 2, M.Pushpavathi 1, S.R. Mahadeva Prasanna 2; 1AIISH, India;2IIT Guwahati, IndiaTue-P-5-2-5, Time: 16:00–18:00

Vowel space area (VSA) refers to a two-dimensional area, which isbounded by lines joining F1and F2 coordinates of vowels. In thespeech of individuals with cleft lip and palate (CLP), the effect ofhypernasality introduces the pole-zero pairs in the speech spectrum,which will shift the formants of a target sound. As a result, vowelspace in hypernasal speech gets affected. In this work, analysis ofvowel space area in normal, mild and moderate-severe hypernasalitygroups is analyzed and compared across the three groups. Also,the effect of hypernasality severity ratings across different phoneticcontexts i.e, /p/, /t/, and /k/ is studied. The results revealed that VSAis reduced in CLP children, compared to control participants, acrosssustained vowels and different phonetic contexts. Compared to nor-mal, the reduction in the vowel space is more for the moderate-severehypernasality group than that of mild. The CLP group exhibited atrend of having larger VSA for /p/, followed by /t/, and lastly by /k/.The statistical analysis revealed overall significant difference amongthe three groups (p < 0.05).

Automatic Prediction of Speech Evaluation Metricsfor Dysarthric Speech

Imed Laaridh 1, Waad Ben Kheder 1, Corinne Fredouille 1,Christine Meunier 2; 1LIA (EA 4128), France; 2LPL (UMR7309), FranceTue-P-5-2-6, Time: 16:00–18:00

During the last decades, automatic speech processing systems wit-nessed an important progress and achieved remarkable reliability.As a result, such technologies have been exploited in new areasand applications including medical practice. In disordered speechevaluation context, perceptual evaluation is still the most commonmethod used in clinical practice for the diagnosing and the followingof the condition progression of patients despite its well documentedlimits (such as subjectivity).

In this paper, we propose an automatic approach for the prediction of

dysarthric speech evaluation metrics (intelligibility, severity, articula-tion impairment) based on the representation of the speech acousticsin the total variability subspace based on the i-vectors paradigm. Theproposed approach, evaluated on 129 French dysarthric speakersfrom the DesPhoAPady and VML databases, is proven to be efficientfor the modeling of patient’s production and capable of detectingthe evolution of speech quality. Also, low RMSE and high correlationmeasures are obtained between automatically predicted metrics andperceptual evaluations.

Apkinson — A Mobile Monitoring Solution forParkinson’s Disease

Philipp Klumpp 1, Thomas Janu 1, TomásArias-Vergara 2, J.C. Vásquez-Correa 2, Juan RafaelOrozco-Arroyave 1, Elmar Nöth 1; 1FAUErlangen-Nürnberg, Germany; 2Universidad deAntioquia, ColombiaTue-P-5-2-7, Time: 16:00–18:00

In this paper we want to present our work on a smartphone applica-tion which aims to provide a mobile monitoring solution for patientssuffering from Parkinson’s disease. By unobtrusively analyzing thespeech signal during phone calls and with a dedicated speech test,we want to be able to determine the severity and the progressionof Parkinson’s disease for a patient much more frequently than itwould be possible with regular check-ups.

The application consists of four major parts. There is a phone calldetection which triggers the whole processing chain. Secondly, thereis the phone call recording which has proven to be more challengingthan expected. The signal analysis, another crucial component, isstill in development for the phone call analysis. Additionally, theapplication collects several pieces of meta information about thecalls to put the results into deeper context.

After describing how the speech signal is affected by Parkinson’sdisease, we sketch the overall application architecture and explainthe four major parts of the current implementation in further detail.We then present the promising results achieved with the first versionof a dedicated speech test. In the end, we outline how the projectcould receive further improvements in the future.

Dysprosody Differentiate Between Parkinson’sDisease, Progressive Supranuclear Palsy, and MultipleSystem Atrophy

Jan Hlavnicka 1, Tereza Tykalová 1, Roman Cmejla 1, JiríKlempír 2, Evžen Ružicka 2, Jan Rusz 1; 1CTU, CzechRepublic; 2Charles University, Czech RepublicTue-P-5-2-8, Time: 16:00–18:00

Parkinson’s disease (PD), progressive supranuclear palsy (PSP), andmultiple system atrophy (MSA) are distinctive neurodegenerativedisorders, which manifest similar motor features. Their differ-entiation is crucial but difficult. Dysfunctional speech, especiallydysprosody, is a common symptom accompanying PD, PSP, andMSA from early stages. We hypothesized that automated analysisof monologue could provide speech patterns distinguishing PD, PSP,and MSA. We analyzed speech recordings of 16 patients with PSP, 20patients with MSA, and 23 patients with PD. Our findings revealedthat deviant pause production differentiated between PSP, MSA, andPD. In addition, PSP showed greater deficits in speech respirationwhen compared to MSA and PD. Automated analysis of connectedspeech is easy to administer and could provide valuable informationabout underlying pathology for differentiation between PSP, MSA,and PD.

Notes

150

Interpretable Objective Assessment of DysarthricSpeech Based on Deep Neural Networks

Ming Tu, Visar Berisha, Julie Liss; Arizona StateUniversity, USATue-P-5-2-9, Time: 16:00–18:00

Improved performance in speech applications using deep neuralnetworks (DNNs) has come at the expense of reduced model in-terpretability. For consumer applications this is not a problem;however, for health applications, clinicians must be able to interpretwhy a predictive model made the decision that it did. In this paper,we propose an interpretable model for objective assessment ofdysarthric speech for speech therapy applications based on DNNs.Our model aims to predict a general impression of the severity of thespeech disorder; however, instead of directly generating a severityprediction from a high-dimensional input acoustic feature space, weadd an intermediate interpretable layer that acts as a bottle-neckfeature extractor and constrains the solution space of the DNNs.During inference, the model provides an estimate of severity at theoutput of the network and a set of explanatory features from theintermediate layer of the network that explain the final decision.We evaluate the performance of the model on a dysarthric speechdataset and show that the proposed model provides an interpretableoutput that is highly correlated with the subjective evaluation ofSpeech-Language Pathologists (SLPs).

Deep Autoencoder Based Speech Features forImproved Dysarthric Speech Recognition

Bhavik Vachhani, Chitralekha Bhat, Biswajit Das,Sunil Kumar Kopparapu; TCS Innovation Labs Mumbai,IndiaTue-P-5-2-10, Time: 16:00–18:00

Dysarthria is a motor speech disorder, resulting in mumbled, slurredor slow speech that is generally difficult to understand by bothhumans and machines. Traditional Automatic Speech Recognizers(ASR) perform poorly on dysarthric speech recognition tasks. In thispaper, we propose the use of deep autoencoders to enhance the MelFrequency Cepstral Coefficients (MFCC) based features in order toimprove dysarthric speech recognition. Speech from healthy controlspeakers is used to train an autoencoder which is in turn usedto obtain improved feature representation for dysarthric speech.Additionally, we analyze the use of severity based tempo adaptationfollowed by autoencoder based speech feature enhancement. Allevaluations were carried out on Universal Access dysarthric speechcorpus. An overall absolute improvement of 16% was achieved usingtempo adaptation followed by autoencoder based speech front endrepresentation for DNN-HMM based dysarthric speech recognition.

Prediction of Speech Delay from AcousticMeasurements

Jason Lilley, Madhavi Ratnagiri, H. Timothy Bunnell;Nemours Biomedical Research, USATue-P-5-2-11, Time: 16:00–18:00

Speech delay is characterized by a difficulty with producing orperceiving the sounds of language in comparison to one’s peers. Itis a common problem in young children, occurring at a rate of about5%. There are high rates of co-occurring problems with language,reading, learning, and social interactions, so intervention is neededfor most. The Goldman-Fristoe Test of Articulation (GFTA) is astandardized tool for the assessment of consonant articulation inAmerican English children. GFTA scores are normalized for age andcan be used to help diagnose and assess speech delay. The GFTA wasadministered to 65 young children, a mixture of delayed children

and controls. Their productions of the 39 GFTA words spoken inisolation were recorded and aligned to 3-state hidden Markov models.Seven measurements (state log likelihoods, state durations, and totalduration) were extracted from each target segment in each word.From a subset of these measures, cross-validated statistical modelswere used to predict the children’s GFTA scores and whether theywere delayed. The measurements most useful for prediction cameprimarily from approximants /r, l/. An analysis of the predictorsand discussion of the implications will be provided.

The Frequency Range of “The Ling Six Sounds” inStandard Chinese

Aijun Li 1, Hua Zhang 2, Wen Sun 2; 1Chinese Academy ofSocial Sciences, China; 2Beijing Tongren Hospital, ChinaTue-P-5-2-12, Time: 16:00–18:00

“The Ling Six Sounds” are a range of speech sounds encompassingthe speech frequencies that are widely used clinically to verify theeffectiveness of hearing aid fitting in children. This study focusedon the spectral features of the six sounds in Standard Chinese. Weexamined the frequency range of /m, u, a, i, ù, s/ as well as threeconsonants in syllables, i.e., /m(o)/, /ù(ğ)/, and /s(ę)/. We presentedthe frequency distribution of these sounds. Based on this, wefurther proposed guidelines to improve “the Ling Six-Sound Test”regarding tones in Standard Chinese. We also suggested furtherstudies in other dialects/languages spoken in China with regard totheir phonological specifics.

Production of Sustained Vowels and CategoricalPerception of Tones in Mandarin AmongCochlear-Implanted Children

Wentao Gu 1, Jiao Yin 1, James Mahshie 2; 1NanjingNormal University, China; 2George WashingtonUniversity, USATue-P-5-2-13, Time: 16:00–18:00

This study investigated both production and perception of Mandarinspeech, comparing two groups of 4-to-5-year-old children, a normal-hearing (NH) group and a cochlear-implanted (CI) hearing-impairedgroup; the perception ability of the CI group was tested under twoconditions, with and without hearing aids. In the production study,the participants were asked to produce sustained vowels /a/, /i/ and/u/, on which a set of acoustic parameters were then measured. Incomparison to the NH group, the CI group showed a higher F0, ahigher H1-H2, and a smaller acoustic space for vowels, demonstrat-ing both phonatory and articulatory impairments. In the perceptionstudy, the identification tests of two tone-pairs in Mandarin (T1-T2and T1-T4) were conducted, using two sets of synthetic speech stim-uli varying only along F0 continua. All groups/conditions showedcategorical effects in perception. The CI group in the unimodalcondition showed little difference from normal, while in the bimodalcondition the categorical effect became weaker in identifying theT1-T4 continuum, with the category boundary more biased to T4.This suggests that bimodal CI children may need more fine grainadjustments of hearing aids to take full advantage of the bimodaltechnology.

Notes

151

Tue-P-5-3 : Source Separation and VoiceActivity DetectionPoster 3, 16:00–18:00, Tuesday, 22 Aug. 2017Chair: Tom Bäckström

Audio Content Based Geotagging in Multimedia

Anurag Kumar, Benjamin Elizalde, Bhiksha Raj;Carnegie Mellon University, USATue-P-5-3-1, Time: 16:00–18:00

In this paper we propose methods to extract geographically relevantinformation in a multimedia recording using its audio content. Ourmethod primarily is based on the fact that urban acoustic environ-ment consists of a variety of sounds. Hence, location informationcan be inferred from the composition of sound events/classespresent in the audio. More specifically, we adopt matrix factorizationtechniques to obtain semantic content of recording in terms of dif-ferent sound classes. We use semi-NMF to for to do audio semanticcontent analysis using MFCCs. These semantic information are thencombined to identify the location of recording. We show that thesesemantic content based geotagging can perform significantly betterthan state of art methods.

Time Delay Histogram Based Speech SourceSeparation Using a Planar Array

Zhaoqiong Huang, Zhanzhong Cao, Dongwen Ying,Jielin Pan, Yonghong Yan; Chinese Academy of Sciences,ChinaTue-P-5-3-2, Time: 16:00–18:00

Bin-wise time delay is a valuable clue to form the time-frequency (TF)mask for speech source separation on the two-microphone array.On widely spaces microphones, however, the time delay estimationsuffers from spatial aliasing. Although histogram is a simple andeffective method to tackle the problem of spatial aliasing, it cannot be directly applied on planar arrays. This paper proposes ahistogram-based method to separate multiple speech sources onthe arbitrary-size planar array, where the spatial aliasing is resisted.Time delay histogram is firstly utilized to estimate the delays ofmultiple sources on each microphone pair. The estimated delays onall pairs are then incorporated into an azimuth histogram by meansof the pairwise combination test. From the azimuth histogram, thedirection-of-arrivals (DOAs) and the number of sources are obtained.Eventually, the TF mask is determined based on the estimatedDOAs. Some experiments were conducted under various conditions,confirming the superiority of the proposed method.

Excitation Source Features for Improving theDetection of Vowel Onset and Offset Points in aSpeech Sequence

Gayadhar Pradhan, Avinash Kumar, S.Shahnawazuddin; NIT Patna, IndiaTue-P-5-3-3, Time: 16:00–18:00

The task of detecting the vowel regions in a given speech signal isa challenging problem. Over the years, several works on accuratedetection of vowel regions and the corresponding vowel onset points(VOPs) and vowel end points (VEPs) have been reported. A novelfront-end feature extraction technique exploiting the temporal andspectral characteristics of the excitation source information in thespeech signal is proposed in this paper to improve the detectionof vowel regions, VOPs and VEPs. To do the same, a three-classclassifiers (vowel, non-vowel and silence) is developed on the TIMITdatabase using the proposed features as well as mel-frequency

cepstral coefficients (MFCC). Statistical modeling based on deepneural network has been employed for learning the parameters.Using the developed three-class classifier, a given speech sample isthen forced aligned against the trained acoustic models to detect thevowel regions. The use of proposed feature results in detection ofvowel regions quite different from those obtained through the MFCC.Exploiting the differences in the evidences obtained by using thetwo kinds of features, a technique to combine the evidences is alsoproposed in order to get a better estimate of the VOPs and VEPs.

A Contrast Function and Algorithm for BlindSeparation of Audio Signals

Wei Gao, Roberto Togneri, Victor Sreeram; University ofWestern Australia, AustraliaTue-P-5-3-4, Time: 16:00–18:00

This paper presents a contrast function and associated algorithm forblind separation of audio signals. The contrast function is based onsecond-order statistics to minimize the ratio between the product ofthe diagonal entries and the determinant of the covariance matrix.The contrast function can be minimized by a batch and adaptivegradient descent method to formulate a blind source separationalgorithm. Experimental results on realistic audio signals show thatthe proposed algorithm yielded comparable separation performancewith benchmark algorithms for speech signals, and outperformedbenchmark algorithms for music signals.

Weighted Spatial Covariance Matrix Estimation forMUSIC Based TDOA Estimation of Speech Source

Chenglin Xu 1, Xiong Xiao 2, Sining Sun 3, Wei Rao 2,Eng Siong Chng 1, Haizhou Li 2; 1NTU, Singapore;2TL@NTU, Singapore; 3Northwestern PolytechnicalUniversity, ChinaTue-P-5-3-5, Time: 16:00–18:00

We study the estimation of time difference of arrival (TDOA) undernoisy and reverberant conditions. Conventional TDOA estimationmethods such as MUltiple SIgnal Classification (MUSIC) are notrobust to noise and reverberation due to the distortion in the spatialcovariance matrix (SCM). To address this issue, this paper proposes arobust SCM estimation method, called weighted SCM (WSCM). In theWSCM estimation, each time-frequency (TF) bin of the input signal isweighted by a TF mask which is 0 for non-speech TF bins and 1 forspeech TF bins in ideal case. In practice, the TF mask takes valuesbetween 0 and 1 that are predicted by a long short term memory(LSTM) network trained from a large amount of simulated noisy andreverberant data. The use of mask weights significantly reducesthe contribution of low SNR TF bins to the SCM estimation, henceimproves the robustness of MUSIC. Experimental results on bothsimulated and real data show that we have significantly improvedthe robustness of MUSIC by using the weighted SCM.

Speaker Direction-of-Arrival Estimation Based onFrequency-Independent Beampattern

Feng Guo 1, Yuhang Cao 2, Zheng Liu 3, Jiaen Liang 2,Baoqing Li 1, Xiaobing Yuan 1; 1Chinese Academy ofSciences, China; 2Beijing Unisound InformationTechnology, China; 3Huawei Technologies, ChinaTue-P-5-3-6, Time: 16:00–18:00

The differential microphone array (DMA) becomes more and morepopular recently. In this paper, we derive the relationship betweenthe direction-of-arrival (DoA) and DMA’s frequency-independentbeampatterns. The derivation demonstrates that the DoA can beyielded by solving a trigonometric polynomial. Taking the dipolesas a special case of this relationship, we propose three methods

Notes

152

to estimate the DoA based on the dipoles. However, we find thesemethods are vulnerable to the axial directions under the reverber-ation environment. Fortunately, they can complement each otherowing to their robustness to different angles. Hence, to increasethe robustness to the reverberation, we proposed another newapproach by combining the advantages of these three dipole-basedmethods for the speaker DoA estimation. Both simulations andexperiments show that the proposed method not only outperformsthe traditional methods for small aperture array but also is muchmore computationally efficient with avoiding the spatial spectrumsearch.

A Mask Estimation Method Integrating Data FieldModel for Speech Enhancement

Xianyun Wang 1, Changchun Bao 1, Feng Bao 2; 1BeijingUniversity of Technology, China; 2University ofAuckland, New ZealandTue-P-5-3-7, Time: 16:00–18:00

In most approaches based on computational auditory scene analysis(CASA), the ideal binary mask (IBM) is often used for noise reduction.However, it is almost impossible to obtain the IBM result. The errorin IBM estimation may greatly violate smooth evolution nature ofspeech because of the energy absence in many speech-dominatedtime-frequency (T-F) units. To reduce the error, the ideal ratio mask(IRM) via modeling the spatial dependencies of speech spectrum isused as an optimal target mask because the predictive ratio mask isless sensitive to the error than the predictive binary mask. In thispaper, we introduce a data field (DF) to model the spatial dependen-cies of the cochleagram for obtaining the ratio mask. Firstly, initialT-F units of noise and speech are obtained from noisy speech. Thenwe can calculate the forms of the potentials of noise and speech.Subsequently, their optimal potentials which reflect their respectivedistribution of potential field are obtained by the optimal influencefactors of speech and noise. Finally, we exploit the potentials ofspeech and noise to obtain the ratio mask. Experimental resultsshow that the proposed method can obtain a better performancethan the reference methods in speech quality.

Improved End-of-Query Detection for StreamingSpeech Recognition

Matt Shannon, Gabor Simko, Shuo-Yiin Chang, CarolinaParada; Google, USATue-P-5-3-8, Time: 16:00–18:00

In many streaming speech recognition applications such as voicesearch it is important to determine quickly and accurately when theuser has finished speaking their query. A conventional approachto this task is to declare end-of-query whenever a fixed interval ofsilence is detected by a voice activity detector (VAD) trained to clas-sify each frame as speech or silence. However silence detection andend-of-query detection are fundamentally different tasks, and thecriterion used during VAD training may not be optimal. In particularthe conventional approach ignores potential acoustic cues such asfiller sounds and past speaking rate which may indicate whether agiven pause is temporary or query-final. In this paper we present asimple modification to make the conventional VAD training criterionmore closely related to end-of-query detection. A unidirectional longshort-term memory architecture allows the system to remember pastacoustic events, and the training criterion incentivizes the systemto learn to use any acoustic cues relevant to predicting future userintent. We show experimentally that this approach improves latencyat a given accuracy by around 100 ms for end-of-query detection forvoice search.

Using Approximated Auditory Roughness as aPre-Filtering Feature for Human Screaming andAffective Speech AED

Di He 1, Zuofu Cheng 2, Mark Hasegawa-Johnson 1,Deming Chen 1; 1University of Illinois atUrbana-Champaign, USA; 2Inspirit IoT, USATue-P-5-3-9, Time: 16:00–18:00

Detecting human screaming, shouting, and other verbal manifesta-tions of fear and anger are of great interest to security Audio EventDetection (AED) systems. The Internet of Things (IoT) approachallows wide-covering, powerful AED systems to be distributed acrossthe Internet. But a good feature to pre-filter the audio is criticalto these systems. This work evaluates the potential of detectingscreaming and affective speech using Auditory Roughness and pro-poses a very light-weight approximation method. Our approximationuses a similar amount of Multiple Add Accumulate (MAA) comparedto short-term energy (STE), and at least 10× less MAA than MFCC.We evaluated the performance of our approximated roughness onthe Mandarin Affective Speech corpus and a subset of the YoutubeAudioSet for screaming against other low-complexity features. Weshow that our approximated roughness returns higher accuracy.

Improving Source Separation via Multi-SpeakerRepresentations

Jeroen Zegers, Hugo Van hamme; Katholieke UniversiteitLeuven, BelgiumTue-P-5-3-10, Time: 16:00–18:00

Lately there have been novel developments in deep learning towardssolving the cocktail party problem. Initial results are very promisingand allow for more research in the domain. One technique thathas not yet been explored in the neural network approach to thistask is speaker adaptation. Intuitively, information on the speakersthat we are trying to separate seems fundamentally importantfor the speaker separation task. However, retrieving this speakerinformation is challenging since the speaker identities are not knowna priori and multiple speakers are simultaneously active. There isthus some sort of chicken and egg problem. To tackle this, sourcesignals and i-vectors are estimated alternately. We show that blindmulti-speaker adaptation improves the results of the network andthat (in our case) the network is not capable of adequately retrievingthis useful speaker information itself.

Multiple Sound Source Counting and LocalizationBased on Spatial Principal Eigenvector

Bing Yang, Hong Liu, Cheng Pang; Peking University,ChinaTue-P-5-3-11, Time: 16:00–18:00

Multiple sound source localization remains a challenging issue dueto the interaction between sources. Although traditional approachescan locate multiple sources effectively, most of them require thenumber of sound sources as a priori knowledge. However, the num-ber of sound sources is generally unknown in practical applications.To overcome this problem, a spatial principal eigenvector basedapproach is proposed to estimate the number and the direction ofarrivals (DOAs) of multiple speech sources. Firstly, a time-frequency(TF) bin weighting scheme is utilized to select the TF bins dominatedby single source. Then, for these selected bins, the spatial principaleigenvectors are extracted to construct a contribution functionwhich is used to simultaneously estimate the number of sources andcorresponding coarse DOAs. Finally, the coarse DOA estimationsare refined by iteratively optimizing the assignment of selected TFbins to each source. Experimental results validate that the proposed

Notes

153

approach yields favorable performance for multiple sound sourcecounting and localization in the environment with different levels ofnoise and reverberation.

Subband Selection for Binaural Speech SourceLocalization

Girija Ramesan Karthik, Prasanta Kumar Ghosh; IndianInstitute of Science, IndiaTue-P-5-3-12, Time: 16:00–18:00

We consider the task of speech source localization using binauralcues, namely interaural time and level difference (ITD & ILD). A typicalapproach is to process binaural speech using gammatone filters andcalculate frame-level ITD and ILD in each subband. The ITD, ILD andtheir combination (ITLD) in each subband are statistically modelledusing Gaussian mixture models for every direction during training.Given a binaural test-speech, the source is localized using maximumlikelihood criterion assuming that the binaural cues in each subbandare independent. We, in this work, investigate the robustness of eachsubband for localization and compare their performance against thefull-band scheme with 32 gammatone filters. We propose a subbandselection procedure using the training data where subbands are rankordered based on their localization performance. Experiments onSubject 003 from the CIPIC database reveal that, for high SNRs, theITD and ITLD of just one subband centered at 296Hz is sufficient toyield localization accuracy identical to that of the full-band schemewith a test-speech of duration 1sec. At low SNRs, in case of ITD, theselected subbands are found to perform better than the full-bandscheme.

Unmixing Convolutive Mixtures by ExploitingAmplitude Co-Modulation: Methods and Evaluationon Mandarin Speech Recordings

Bo-Rui Chen, Huang-Yi Lee, Yi-Wen Liu; National TsingHua University, TaiwanTue-P-5-3-13, Time: 16:00–18:00

This paper presents and evaluates two frequency-domain meth-ods for multi-channel sound source separation. The sources areassumed to couple to the microphones with unknown room re-sponses. Independent component analysis (ICA) is applied in thefrequency domain to obtain maximally independent amplitudeenvelopes (AEs) at every frequency. Due to the nature of ICA, theAEs across frequencies need to be de-permuted. To this end, we seekto assign AEs to the same source solely based on the correlation intheir magnitude variation against time. The resulted time-varyingspectra are inverse Fourier transformed to synthesize separatedsignals. Objective evaluation showed that both methods achievea signal-to-interference ratio (SIR) that is comparable to Mazur etal (2013). In addition, we created spoken Mandarin materials andrecruited age-matched subjects to perform word-by-word transcrip-tion. Results showed that, first, speech intelligibility significantlyimproved after unmixing. Secondly, while both methods achievedsimilar SIR, the subjects preferred to listen to the results that werepost-processed to ensure a speech-like spectral shape; the meanopinion scores were 2.9 vs. 4.3 (out of 5) between the two methods.The present results may provide suggestions regarding deploymentof the correlation-based source separation algorithms into deviceswith limited computational resources.

Bimodal Recurrent Neural Network for AudiovisualVoice Activity Detection

Fei Tao, Carlos Busso; University of Texas at Dallas, USATue-P-5-3-14, Time: 16:00–18:00

Voice activity detection (VAD) is an important preprocessing step inspeech-based systems, especially for emerging hand-free intelligent

assistants. Conventional VAD systems relying on audio-only featuresare normally impaired by noise in the environment. An alternativeapproach to address this problem is audiovisual VAD (AV-VAD)systems. Modeling timing dependencies between acoustic and visualfeatures is a challenge in AV-VAD. This study proposes a bimodalrecurrent neural network (RNN) which combines audiovisual featuresin a principled, unified framework, capturing the timing dependencywithin modalities and across modalities. Each modality is mod-eled with separate bidirectional long short-term memory (BLSTM)networks. The output layers are used as input of another BLSTMnetwork. The experimental evaluation considers a large audiovisualcorpus with clean and noisy recordings to assess the robustness ofthe approach. The proposed approach outperforms audio-only VADby 7.9% (absolute) under clean/ideal conditions (i.e., high definition(HD) camera, close-talk microphone). The proposed solution out-performs the audio-only VAD system by 18.5% (absolute) when theconditions are more challenging (i.e., camera and microphone froma tablet with noise in the environment). The proposed approachshows the best performance and robustness across a varieties ofconditions, demonstrating its potential for real-world applications.

Domain-Specific Utterance End-Point Detection forSpeech Recognition

Roland Maas, Ariya Rastrow, Kyle Goehner, GautamTiwari, Shaun Joseph, Björn Hoffmeister; Amazon.com,USATue-P-5-3-15, Time: 16:00–18:00

The task of automatically detecting the end of a device-directeduser request is particularly challenging in case of switching shortcommand and long free-form utterances. While low-latency end-pointing configurations typically lead to good user experiencesin the case of short requests, such as “play music”, it can be tooaggressive in domains with longer free-form queries, where userstend to pause noticeably between words and hence are easily cutoff prematurely. We previously proposed an approach for accurateend-pointing by continuously estimating pause duration featuresover all active recognition hypotheses. In this paper, we study the be-havior of these pause duration features and infer domain-dependentparametrizations. We furthermore propose to adapt the end-pointeraggressiveness on-the-fly by comparing the Viterbi scores of activeshort command vs. long free-form decoding hypotheses. Theexperimental evaluation evidences a 18% relative reduction in worderror rate on free-form requests while maintaining low latency onshort queries.

Speech Detection and Enhancement Using SingleMicrophone for Distant Speech Applications inReverberant Environments

Vinay Kothapally, John H.L. Hansen; University of Texasat Dallas, USATue-P-5-3-16, Time: 16:00–18:00

It is well known that in reverberant environments, the humanauditory system has the ability to pre-process reverberant signalsto compensate for reflections and obtain effective cues for im-proved recognition. In this study, we propose such a preprocessingtechnique for combined detection and enhancement of speechusing a single microphone in reverberant environments for distantspeech applications. The proposed system employs a frameworkwhere the target speech is synthesized using continuous auditorymasks estimated from sub-band signals. Linear gammatone analy-sis/synthesis filter banks are used as an auditory model for sub-bandprocessing. The performance of the proposed system is evaluatedon the UT-DistantReverb corpus which consists of speech recorded

Notes

154

in a reverberant racquetball court (T60∼9000msec). The currentsystem shows an average improvement of 15% STNR over an existingsingle-channel dereverberation algorithm and 17% improvement indetecting speech frames over G729B, SOHN & Combo-SAD unsu-pervised speech activity detectors on actual reverberant and noisyenvironments.

Tue-P-5-4 : Speech-enhancementPoster 4, 16:00–18:00, Tuesday, 22 Aug. 2017Chair: Timo Gerkmann

A Post-Filtering Approach Based on Locally LinearEmbedding Difference Compensation for SpeechEnhancement

Yi-Chiao Wu, Hsin-Te Hwang, Syu-Siang Wang,Chin-Cheng Hsu, Yu Tsao, Hsin-Min Wang; AcademiaSinica, TaiwanTue-P-5-4-1, Time: 16:00–18:00

This paper presents a novel difference compensation post-filteringapproach based on the locally linear embedding (LLE) algorithmfor speech enhancement (SE). The main goal of the proposedpost-filtering approach is to further suppress residual noises inSE-processed signals to attain improved speech quality and intel-ligibility. The proposed system can be divided into offline andonline stages. In the offline stage, we prepare paired differences:the estimated difference of {SE-processed speech; noisy speech} andthe ground-truth difference of {clean speech; noisy speech}. In theonline stage, on the basis of estimated difference of a test utterance,we first predict the corresponding ground-truth difference based onthe LLE algorithm, and then compensate the noisy speech with thepredicted difference. In this study, we integrate a deep denoisingautoencoder (DDAE) SE method with the proposed LLE-based differ-ence compensation post-filtering approach. The experiment resultsreveal that the proposed post-filtering approach obviously enhancedthe speech quality and intelligibility of the DDAE-based SE-processedspeech in different noise types and signal-to-noise-ratio levels.

Multi-Target Ensemble Learning for Monaural SpeechSeparation

Hui Zhang, Xueliang Zhang, Guanglai Gao; InnerMongolia University, ChinaTue-P-5-4-2, Time: 16:00–18:00

Speech separation can be formulated as a supervised learning prob-lem where a machine is trained to cast the acoustic features of thenoisy speech to a time-frequency mask, or the spectrum of the cleanspeech. These two categories of speech separation methods canbe generally referred as the masking-based and the mapping-basedmethods, but none of them can perfectly estimate the clean speech,since any target can only describe a part of the characteristics ofthe speech. However, the estimated masks and speech spectrumcan, sometimes, be complementary as the speech is described fromdifferent perspectives. In this paper, by adopting an ensembleframework, a multi-target deep neural network (DNN) based methodis proposed, which combines the masking-based and the mapping-based strategies, and the DNN is trained to jointly estimate thetime-frequency masks and the clean spectrum. We show that asexpected the mask and speech spectrum based targets yield partlycomplementary estimates, and the separation performance can beimproved by merging these estimates. Furthermore, a merging modeltrained jointly with the multi-target DNN is developed. Experimentalresults indicate that the proposed multi-target DNN based methodoutperforms the DNN based algorithm which optimizes a singletarget.

Improved Example-Based Speech Enhancement byUsing Deep Neural Network Acoustic Model for NoiseRobust Example Search

Atsunori Ogawa, Keisuke Kinoshita, Marc Delcroix,Tomohiro Nakatani; NTT, JapanTue-P-5-4-3, Time: 16:00–18:00

Example-based speech enhancement is a promising single-channelapproach for coping with highly nonstationary noise. Given a noisyspeech input, it first searches in a noisy speech corpus for the noisyspeech examples that best match the input. Then, it concatenatesthe clean speech examples that are paired with the matched noisyexamples to obtain an estimate of the underlying clean speech com-ponent in the input. The quality of the enhanced speech dependson how accurate an example search can be performed given a noisyspeech input. The example search is conventionally performedusing a Gaussian mixture model (GMM) with mel-frequency cepstralcoefficient features (MFCCs). To improve the noise robustness ofthe GMM-based example search, instead of using noise sensitiveMFCCs, we have proposed using bottleneck features (BNFs), whichare extracted from a deep neural network-based acoustic model(DNN-AM) built for automatic speech recognition. In this paper,instead of using a GMM with noise robust BNFs, we propose thedirect use of a DNN-AM in the example search to further improveits noise robustness. Experimental results on the Aurora4 corpusshow that the DNN-AM-based example search steadily improves theenhanced speech quality compared with the GMM-based examplesearch using BNFs.

Subjective Intelligibility of Deep NeuralNetwork-Based Speech Enhancement

Femke B. Gelderblom, Tron V. Tronstad, Erlend MagnusViggen; SINTEF, NorwayTue-P-5-4-4, Time: 16:00–18:00

Recent literature indicates increasing interest in deep neural net-works for use in speech enhancement systems. Currently, thesesystems are mostly evaluated through objective measures of speechquality and/or intelligibility. Subjective intelligibility evaluationsof these systems have so far not been reported. In this paper wereport the results of a speech recognition test with 15 participants,where the participants were asked to pick out words in backgroundnoise before and after enhancement using a common deep neuralnetwork approach. We found that, although the objective measureSTOI predicts that intelligibility should improve or at the very leaststay the same, the speech recognition threshold, which is a measureof intelligibility, deteriorated by 4 dB. These results indicate thatSTOI is not a good predictor for the subjective intelligibility of deepneural network-based speech enhancement systems. We also foundthat the postprocessing technique of global variance normalisationdoes not significantly affect subjective intelligibility.

Real-Time Modulation Enhancement of TemporalEnvelopes for Increasing Speech Intelligibility

Maria Koutsogiannaki 1, Holly Francois 2, Kihyun Choo 3,Eunmi Oh 3; 1BCBL, Spain; 2Samsung Electronics, UK;3Samsung Electronics, KoreaTue-P-5-4-5, Time: 16:00–18:00

In this paper, a novel approach is introduced for performing real-timespeech modulation enhancement to increase speech intelligibility innoise. The proposed modulation enhancement technique operatesindependently in the frequency and time domains. In the frequencydomain, a compression function is used to perform energy reallo-cation within a frame. This compression function contains novel

Notes

155

scaling operations to ensure speech quality. In the time domain, amathematical equation is introduced to reallocate energy from thelouder to the quieter parts of the speech. This proposed mathe-matical equation ensures that the long-term energy of the speechis preserved independently of the amount of compression, hencegaining full control of the time-energy reallocation in real-time.Evaluations on intelligibility and quality show that the suggestedapproach increases the intelligibility of speech while maintaining theoverall energy and quality of the speech signal.

On the Influence of Modifying Magnitude and PhaseSpectrum to Enhance Noisy Speech Signals

Hans-Günter Hirsch, Michael Gref; HochschuleNiederrhein, GermanyTue-P-5-4-6, Time: 16:00–18:00

Neural networks have proven their ability to be usefully applied ascomponent of a speech enhancement system. This is based on theknown feature of neural nets to map regions inside a feature spaceto other regions. It can be taken to map noisy magnitude spectrato clean spectra. This way the net can be used to substitute anadaptive filtering in the spectral domain. We set up such a systemand compared its performance against a known adaptive filteringapproach in terms of speech quality and in terms of recognition rate.It is a still not fully answered question how far the speech qualitycan be enhanced by modifying not only the magnitude but also thespectral phase and how this phase modification could be realized.Before trying to use a neural network for a possible modification ofthe phase spectrum we ran a set of oracle experiments to find outhow far the quality can be improved by modifying the magnitudeand/or the phase spectrum in voiced segments. It turns out that thesimultaneous modification of magnitude and phase spectrum hasthe potential for a considerable improvement of the speech qualityin comparison to modifying the magnitude or the phase only.

MixMax Approximation as a Super-GaussianLog-Spectral Amplitude Estimator for SpeechEnhancement

Robert Rehr, Timo Gerkmann; Technische UniversitätHamburg-Harburg, GermanyTue-P-5-4-7, Time: 16:00–18:00

For single-channel speech enhancement, most commonly, the noisyobservation is described as the sum of the clean speech signal andthe noise signal. For machine learning based enhancement schemeswhere speech and noise are modeled in the log-spectral domain,however, the log-spectrum of the noisy observation can be describedas the maximum of the speech and noise log-spectrum to simplifystatistical inference. This approximation is referred to as MixMaxmodel or log-max approximation. In this paper, we show how thisapproximation can be used in combination with non-trained, blindspeech and noise power estimators derived in the spectral domain.Our findings allow to interpret the MixMax based clean speechestimator as a super-Gaussian log-spectral amplitude estimator.This MixMax based estimator is embedded in a pre-trained speechenhancement scheme and compared to a log-spectral amplitude es-timator based on an additive mixing model. Instrumental measuresindicate that the MixMax based estimator causes less musical toneswhile it virtually yields the same quality for the enhanced speechsignal.

Binary Mask Estimation Strategies for ConstrainedImputation-Based Speech Enhancement

Ricard Marxer, Jon Barker; University of Sheffield, UKTue-P-5-4-8, Time: 16:00–18:00

In recent years, speech enhancement by analysis-resynthesis hasemerged as an alternative to conventional noise filtering approaches.Analysis-resynthesis replaces noisy speech with a signal that hasbeen reconstructed from a clean speech model. It can deliver high-quality signals with no residual noise, but at the expense of losinginformation from the original signal that is not well-represented bythe model. A recent compromise solution, called constrained resyn-thesis, solves this problem by only resynthesising spectro-temporalregions that are estimated to be masked by noise (conditioned on theevidence in the unmasked regions). In this paper we first extend theapproach by: i) introducing multi-condition training and a deep dis-criminative model for the analysis stage; ii) introducing an improvedresynthesis model that captures within-state cross-frequency depen-dencies. We then extend the previous stationary-noise evaluation byusing real domestic audio noise from the CHiME-2 evaluation. Wecompare various mask estimation strategies while varying the degreeof constraint by tuning the threshold for reliable speech detection.PESQ and log-spectral distance measures show that although maskestimation remains a challenge, it is only necessary to estimate a fewreliable signal regions in order to achieve performance close to thatachieved with an optimal oracle mask.

A Fully Convolutional Neural Network for SpeechEnhancement

Se Rim Park 1, Jin Won Lee 2; 1Carnegie MellonUniversity, USA; 2Qualcomm, USATue-P-5-4-9, Time: 16:00–18:00

The presence of babble noise degrades hearing intelligibility ofhuman speech greatly. However, removing the babble withoutcreating artifacts in human speech is a challenging task in a lowSNR environment. Here, we sought to solve the problem by findinga ‘mapping’ between noisy speech spectra and clean speech spec-tra via supervised learning. Specifically, we propose using fullyConvolutional Neural Networks, which consist of lesser number ofparameters than fully connected networks. The proposed network,Redundant Convolutional Encoder Decoder (R-CED), demonstratesthat a convolutional network can be 12 times smaller than a recur-rent network and yet achieves better performance, which shows itsapplicability for an embedded system.

Speech Enhancement Using Non-NegativeSpectrogram Models with Mel-Generalized CepstralRegularization

Li Li 1, Hirokazu Kameoka 2, Tomoki Toda 3, ShojiMakino 1; 1University of Tsukuba, Japan; 2NTT, Japan;3Nagoya University, JapanTue-P-5-4-10, Time: 16:00–18:00

Spectral domain speech enhancement algorithms based on non-negative spectrogram models such as non-negative matrix factor-ization (NMF) and non-negative matrix factor deconvolution arepowerful in terms of signal recovery accuracy, however they do notdirectly lead to an enhancement in the feature domain (e.g., cepstraldomain) or in terms of perceived quality. We have previouslyproposed a method that makes it possible to enhance speech inthe spectral and cepstral domains simultaneously. Although thismethod was shown to be effective, the devised algorithm was compu-tationally demanding. This paper proposes yet another formulationthat allows for a fast implementation by replacing the regularization

Notes

156

term with a divergence measure between the NMF model and themel-generalized cepstral (MGC) representation of the target spec-trum. Since the MGC is an auditory-motivated representation of anaudio signal widely used in parametric speech synthesis, we alsoexpect the proposed method to have an effect in enhancing theperceived quality. Experimental results revealed the effectiveness ofthe proposed method in terms of both the signal-to-distortion ratioand the cepstral distance.

A Comparison of Perceptually Motivated LossFunctions for Binary Mask Estimation in SpeechSeparation

Danny Websdale, Ben Milner; University of East Anglia,UKTue-P-5-4-11, Time: 16:00–18:00

This work proposes and compares perceptually motivated lossfunctions for deep learning based binary mask estimation for speechseparation. Previous loss functions have focused on maximisingclassification accuracy of mask estimation but we now propose lossfunctions that aim to maximise the hit minus false-alarm (HIT-FA)rate which is known to correlate more closely to speech intelligibility.The baseline loss function is binary cross-entropy (CE), a standardloss function used in binary mask estimation, which maximises clas-sification accuracy. We propose first a loss function that maximisesthe HIT-FA rate instead of classification accuracy. We then proposea second loss function that is a hybrid between CE and HIT-FA,providing a balance between classification accuracy and HIT-FA rate.Evaluations of the perceptually motivated loss functions with theGRID database show improvements to HIT-FA rate and ESTOI acrossbabble and factory noises. Further tests then explore applicationof the perceptually motivated loss functions to a larger vocabularydataset.

Conditional Generative Adversarial Networks forSpeech Enhancement and Noise-Robust SpeakerVerification

Daniel Michelsanti, Zheng-Hua Tan; Aalborg University,DenmarkTue-P-5-4-12, Time: 16:00–18:00

Improving speech system performance in noisy environments re-mains a challenging task, and speech enhancement (SE) is one ofthe effective techniques to solve the problem. Motivated by thepromising results of generative adversarial networks (GANs) ina variety of image processing tasks, we explore the potential ofconditional GANs (cGANs) for SE, and in particular, we make use ofthe image processing framework proposed by Isola et al. [1] to learna mapping from the spectrogram of noisy speech to an enhancedcounterpart. The SE cGAN consists of two networks, trained in anadversarial manner: a generator that tries to enhance the inputnoisy spectrogram, and a discriminator that tries to distinguishbetween enhanced spectrograms provided by the generator andclean ones from the database using the noisy spectrogram as acondition. We evaluate the performance of the cGAN method interms of perceptual evaluation of speech quality (PESQ), short-timeobjective intelligibility (STOI), and equal error rate (EER) of speakerverification (an example application). Experimental results showthat the cGAN method overall outperforms the classical short-timespectral amplitude minimum mean square error (STSA-MMSE) SEalgorithm, and is comparable to a deep neural network-based SEapproach (DNN-SE).

Speech Enhancement Using Bayesian Wavenet

Kaizhi Qian 1, Yang Zhang 1, Shiyu Chang 2, XuesongYang 1, Dinei Florêncio 3, Mark Hasegawa-Johnson 1;1University of Illinois at Urbana-Champaign, USA; 2IBM,USA; 3Microsoft, USATue-P-5-4-13, Time: 16:00–18:00

In recent years, deep learning has achieved great success in speechenhancement. However, there are two major limitations regardingexisting works. First, the Bayesian framework is not adopted inmany such deep-learning-based algorithms. In particular, the priordistribution for speech in the Bayesian framework has been shownuseful by regularizing the output to be in the speech space, and thusimproving the performance. Second, the majority of the existingmethods operate on the frequency domain of the noisy speech, suchas spectrogram and its variations. The clean speech is then recon-structed using the approach of overlap-add, which is limited by itsinherent performance upper bound. This paper presents a Bayesianspeech enhancement framework, called BaWN (Bayesian WaveNet),which directly operates on raw audio samples. It adopts the recentlyannounced WaveNet, which is shown to be effective in modelingconditional distributions of speech samples while generating naturalspeech. Experiments show that BaWN is able to recover clean andnatural speech.

Binaural Reverberant Speech Separation Based onDeep Neural Networks

Xueliang Zhang 1, DeLiang Wang 2; 1Inner MongoliaUniversity, China; 2Ohio State University, USATue-P-5-4-14, Time: 16:00–18:00

Supervised learning has exhibited great potential for speech sep-aration in recent years. In this paper, we focus on separatingtarget speech in reverberant conditions from binaural inputs usingsupervised learning. Specifically, deep neural network (DNN) isconstructed to map from both spectral and spatial features to atraining target. For spectral features extraction, we first convertbinaural inputs into a single signal by applying a fixed beamformer.A new spatial feature is proposed and extracted to complementspectral features. The training target is the recently suggested idealratio mask (IRM). Systematic evaluations and comparisons show thatthe proposed system achieves good separation performance andsubstantially outperforms existing algorithms under challengingmulti-source and reverberant environments.

On the Quality and Intelligibility of Noisy SpeechProcessed for Near-End Listening Enhancement

Tudor-Catalin Zorila, Yannis Stylianou; ToshibaResearch Europe, UKTue-P-5-4-15, Time: 16:00–18:00

Most current techniques for near-end speech intelligibility enhance-ment have focused on processing clean input signals, however, inrealistic environments, the input is often noisy. Processing noisyspeech for intelligibility enhancement using algorithms developedfor clean signals can lower the perceptual quality of the sampleswhen they are listened in quiet. Here we address the quality loss inthese conditions by combining noise reduction with a multi-bandversion of a state-of-the-art intelligibility enhancer for clean speechthat is based on spectral shaping and dynamic range compression(SSDRC). Subjective quality and intelligibility assessments with noisyinput speech showed that: (a) In quiet near-end conditions, theproposed system outperformed the baseline SSDRC in terms ofMean Opinion Score (MOS); (b) In speech-shaped near-end noise, theproposed system improved the intelligibility of unprocessed speechby a factor larger than three at the lowest tested signal-to-noise ratio

Notes

157

(SNR) however, overall, it yielded lower recognition scores than thestandard SSDRC.

Tue-S&T-3/4-A : Show & Tell 3E306, 10:00–12:00, 13:30–15:30, Tuesday, 22 Aug. 2017

Applications of the BBN Sage Speech ProcessingPlatform

Ralf Meermeier, Sean Colbath; Raytheon BBNTechnologies, USATue-S&T-3-A-1, Time: 10:00–12:00

As a follow-up to our paper at Interspeech 2016 [1], we propose toshowcase various applications that now all use BBN’s Sage SpeechProcessing Platform, demonstrating the platform’s versatility andease of integration.

In particular, we will showcase 1) BBN TransTalk: A turn-basedspeech-to-speech translation program running entirely on an An-droid smartphone, alongside a custom 3D-printed peripheral for it.2) A continuous transcription and translation application running ona Raspberry Pi 3) An offline OCR application utilizing Sage, runningon a COTS Windows laptop.

Bob Speaks Kaldi

Milos Cernak, Alain Komaty, Amir Mohammadi, AndréAnjos, Sébastien Marcel; Idiap Research Institute,SwitzerlandTue-S&T-3-A-2, Time: 10:00–12:00

This paper introduces and demonstrates Kaldi integration into Bobsignal-processing and machine learning toolbox. The motivation forthis integration is two-fold. Firstly, Bob benefits from using advancedspeech processing tools developed in Kaldi. Secondly, Kaldi benefitsfrom using complementary Bob modules, such as modulation-basedVAD with an adaptive thresholding. In addition, Bob is designedas an open science tool, and this integration might offer to theKaldi speech community a framework for better reproducibility ofstate-of-the-art research results.

Real Time Pitch Shifting with Formant StructurePreservation Using the Phase Vocoder

Michał Lenarczyk; Polish Academy of Sciences, PolandTue-S&T-3-A-3, Time: 10:00–12:00

Pitch shifting in speech is presented based on the use of the phasevocoder in combination with spectral whitening and envelope recon-struction, applied respectively before and after the transformation.A band preservation technique is introduced to contain qualitydegradation when downscaling the pitch. The transposition ratio isfixed in advance by selecting analysis and synthesis window sizes.Real time performance is demonstrated for window sizes havingadequate factorization required by fast Fourier transformation.

A Signal Processing Approach for Speaker SeparationUsing SFF Analysis

Nivedita Chennupati, B.H.V.S. Narayana Murthy, B.Yegnanarayana; IIIT Hyderabad, IndiaTue-S&T-3-A-4, Time: 10:00–12:00

Multi-speaker separation is necessary to increase intelligibility ofspeech signals or to improve accuracy of speech recognition systems.Ideal binary mask (IBM) has set a gold standard for speech separationby suppressing the undesired speakers and also by increasing intelli-

gibility of the desired speech. In this work, single frequency filtering(SFF) analysis is used to estimate the mask closer to IBM for speakerseparation. The SFF analysis gives good temporal resolution forextracting features such as glottal closure instants (GCIs), and highspectral resolution for resolving harmonics. The temporal resolutionin SFF gives impulse locations, which are used to calculate the timedelay. The delay compensation between two microphone signalsreinforces the impulses corresponding to one of the speakers. Thespectral resolution of the SFF is exploited to estimate the masks usingthe SFF magnitude spectra on the enhanced impulse-like sequencecorresponding to one of the speakers. The estimated mask is usedto refine the SFF magnitude. The refined SFF magnitude along withthe phase of the mixed microphone signal is used to obtain speakerseparation. Performance of proposed algorithm is demonstratedusing multi-speaker data collected in a real room environment.

Speech Recognition and Understanding onHardware-Accelerated DSP

Georg Stemmer 1, Munir Georges 1, Joachim Hofer 1,Piotr Rozen 2, Josef Bauer 1, Jakub Nowicki 2, TobiasBocklet 1, Hannah R. Colett 3, Ohad Falik 4, MichaelDeisher 3, Sylvia J. Downing 3; 1Intel, Germany; 2Intel,Poland; 3Intel, USA; 4Intel, IsraelTue-S&T-3-A-5, Time: 10:00–12:00

A smart home controller that responds to natural language inputis demonstrated on an Intel embedded processor. This devicecontains two DSP cores and a neural network co-processor whichshare 4MB SRAM. An embedded configuration of the Intel RealSpeechspeech recognizer and intent extraction engine runs on the DSPcores with neural network operations offloaded to the co-processor.The prototype demonstrates that continuous speech recognitionand understanding is possible on hardware with very low powerconsumption. As an example application, control of lights in ahome via natural language is shown. An Intel development kit isdemonstrated together with a set of tools. Conference attendees areencouraged to interact with the demo and development system.

MetaLab: A Repository for Meta-Analyses onLanguage Development, and More

Sho Tsuji 1, Christina Bergmann 2, Molly Lewis 3, MikaBraginsky 4, Page Piccinini 5, Michael C. Frank 6,Alejandrina Cristia 2; 1University of Pennsylvania, USA;2LSCP (UMR 8554), France; 3University of Chicago, USA;4MIT, USA; 5NPI (U955 E01), France; 6StanfordUniversity, USATue-S&T-3-A-6, Time: 10:00–12:00

MetaLab is a growing database of meta-analyses, shared in a githubrepository and via an interactive website. This website containsinteractive tools for community-augmented meta-analyses, poweranalyses, and experimental planning. It currently contains a dozenmeta-analyses spanning a number of phenomena in early lan-guage acquisition research, including infants’ vowel discrimination,acoustic wordform segmentation, and distributional learning in thelaboratory. During the Show and Tell, we will demonstrate how touse the online visualization tools, download data, and re-use ouranalysis scripts for other research purposes. We expect MetaLabdata to be particularly useful to researchers interested in earlyspeech perception. Additionally, the infrastructure and tools canbe adopted by speech scientists seeking to perform and utilize(meta-)meta-analyses in other fields.

Notes

158

Tue-S&T-3/4-B : Show & Tell 4E397, 10:00-12:00, 13:30–15:30, Tuesday, 22 Aug. 2017

Evolving Recurrent Neural Networks That Processand Classify Raw Audio in a Streaming Fashion

Adrien Daniel; NXP Semiconductors, FranceTue-S&T-3-B-1, Time: 10:00–12:00

The paper describes a neuroevolution-based novel approach totrain recurrent neural networks that can process and classify audiodirectly from the raw waveform signal, without any assumptionon the signal itself, on the features that should be extracted, oron the required network topology to perform the task. Resultingnetworks are relatively small in memory size, and their usage ina streaming fashion makes them particularly suited to embeddedreal-time applications.

Combining Gaussian Mixture Models and SegmentalFeature Models for Speaker Recognition

Milana Miloševic 1, Ulrike Glavitsch 2; 1University ofBelgrade, Serbia; 2EMPA, SwitzerlandTue-S&T-3-B-2, Time: 10:00–12:00

In most speaker recognition systems speech utterances are notconstrained in content or language. In a text-dependent speakerrecognition system lexical content of speech and language are knownin advance. The goal of this paper is to show that this informationcan be used by a segmental features (SF) approach to improve astandard Gaussian mixture model with MFCC features (GMM-MFCC).Speech features such as mean energy, delta energy, pitch, delta pitch,the formants F1–F4 and their bandwidths B1–B4 and the differencebetween F2 and F1 are calculated on segments and are associatedto phonemes and phoneme groups for each speaker. The SF andGMM-MFCC approaches are combined by multiplying the outputsof two classifiers. All the experiments are performed on the twoversions of TEVOID: TEVOID16 with 16 and the upgraded TEVOID50with 50 speakers. On TEVOID16, SF achieves 84.23%, GMM-MFCC91.75%, and the combined approach gives 95.12% recognition rate.On TEVOID50, the SF approach gives 68.69%, while both GMM-MFCCand the combined model achieve 95.84% recognition rate. On bothdatabases, the number of male/female confusions decreased for thecombined model. These results are promising for using segmentalfeatures to improve the recognition rate of text-dependent systems.

“Did you laugh enough today?” — Deep NeuralNetworks for Mobile and Wearable Laughter Trackers

Gerhard Hagerer 1, Nicholas Cummins 2, Florian Eyben 1,Björn Schuller 1; 1audEERING, Germany; 2UniversitätPassau, GermanyTue-S&T-3-B-3, Time: 10:00–12:00

In this paper we describe a mobile and wearable devices app thatrecognises laughter from speech in real-time. The laughter detectionis based on a deep neural network architecture, which runs smoothlyand robustly, even natively on a smartwatch. Further, this paperpresents results demonstrating that our approach achieves state-of-the-art laughter detection performance on the SSPNet VocalizationCorpus (SVC) from the 2013 Interspeech Computational Paralinguis-tics Challenge Social Signals Sub-Challenge. As this technology istailored for mobile and wearable devices, it enables and motivatesmany new use cases, for example, deployment in health care settingssuch as laughter tracking for psychological coaching, depressionmonitoring, and therapies.

Low-Frequency Ultrasonic Communication for SpeechBroadcasting in Public Transportation

Kwang Myung Jeon, Nam Kyun Kim, Chan Woong Kwak,Jung Min Moon, Hong Kook Kim; GIST, KoreaTue-S&T-3-B-4, Time: 10:00–12:00

Speech broadcasting via loudspeakers is widely used in public trans-portation to send broadcast notifications. However, listeners oftenfail to catch spoken context from speech broadcasts due to excessiveenvironmental noise. We propose an ultrasonic communicationmethod that can be applied to loudspeaker-based speech broad-casting to cope with this issue. In other words, text notificationsare modulated and carried over low-frequency ultrasonic wavesthrough loudspeakers to the microphones of each potential listener’smobile device. Then, the received ultrasonic stream is demodulatedback into the text and the listener hears the notification contextby a text-to-speech engine embedded in each mobile device. Sucha transmission system is realized with a 20 kHz carrier frequencybecause it is inaudible to most listeners but capable of being used incommunication between a loudspeaker and microphone. In addition,the performance of the proposed ultrasonic communication methodis evaluated by measuring the success rate of transmitted wordsunder various signal-to-noise ratio conditions.

Real-Time Speech Enhancement with GCC-NMF:Demonstration on the Raspberry Pi and NVIDIAJetson

Sean U.N. Wood, Jean Rouat; Université de Sherbrooke,CanadaTue-S&T-3-B-5, Time: 10:00–12:00

We demonstrate a real-time, open source implementation of theonline GCC-NMF stereo speech enhancement algorithm. Whilethe system runs on a variety of operating systems and hardwareplatforms, we highlight its potential for real-world mobile use bypresenting it on two embedded systems: the Raspberry Pi 3 andthe NVIDIA Jetson TX1. The effect of various algorithm parameterson subjective enhancement quality may be explored interactivelyvia a graphical user interface, with the results heard in real-time.The trade-off between interference suppression and target fidelity iscontrolled by manipulating the parameters of the coefficient maskingfunction. Increasing the pre-learned dictionary size improves overallspeech enhancement quality at increased computational cost. Weshow that real-time GCC-NMF has potential for real-world applica-tion, remaining purely unsupervised and retaining the simplicity andflexibility of offline GCC-NMF.

Reading Validation for Pronunciation Evaluation inthe Digitala Project

Aku Rouhe, Reima Karhila, Peter Smit, Mikko Kurimo;Aalto University, FinlandTue-S&T-3-B-6, Time: 10:00–12:00

We describe a recognition, validation and segmentation system asan intelligent preprocessor for automatic pronunciation evaluation.The system is developed for large-scale high stake foreign languagetests, where it is necessary to reduce human workload and ensurefair evaluation.

Notes

159

Keynote 2: Catherine PelachaudAula Magna, 08:30–09:30, Wednesday, 23 Aug. 2017Chair: Björn Granström

Conversing with Social Agents That Smile and Laugh

Catherine Pelachaud; ISIR (UMR 7222), FranceWed-K3-1, Time: 08:30–09:30

Our aim is to create virtual conversational partners. As such wehave developed computational models to enrich virtual characterswith socio-emotional capabilities that are communicated throughmultimodal behaviors. The approach we follow to build interactiveand expressive interactants relies on theories from human andsocial sciences as well as data analysis and user-perception-baseddesign. We have explored specific social signals such as smileand laughter, capturing their variation in production but also theirdifferent communicative functions and their impact in human-agentinteraction. Lately we have been interested in modeling agents withsocial attitudes. Our aim is to model how social attitudes color themultimodal behaviors of the agents. We have gathered a corpusof dyads that was annotated along two layers: social attitudes andnonverbal behaviors. By applying sequence mining methods we haveextracted behavior patterns involved in the change of perception ofan attitude. We are particularly interested in capturing the behaviorsthat correspond to a change of perception of an attitude. In thistalk I will present the GRETA/VIB platform where our research isimplemented.

Wed-SS-6-2 : Special Session: DigitalRevolution for Under-resourced Languages 1A2, 10:00–12:00, Wednesday, 23 Aug. 2017Chairs: Alexey Karpov, Kristiina Jokinen

Team ELISA System for DARPA LORELEI SpeechEvaluation 2016

Pavlos Papadopoulos 1, Ruchir Travadi 1, Colin Vaz 1,Nikolaos Malandrakis 1, Ulf Hermjakob 1, NimaPourdamghani 1, Michael Pust 1, Boliang Zhang 2,Xiaoman Pan 2, Di Lu 2, Ying Lin 2, Ondrej Glembek 3,Murali Karthick Baskar 3, Martin Karafiát 3, LukášBurget 3, Mark Hasegawa-Johnson 4, Heng Ji 2, JonathanMay 1, Kevin Knight 1, Shrikanth S. Narayanan 1;1University of Southern California, USA; 2RensselaerPolytechnic Institute, USA; 3Brno University ofTechnology, Czech Republic; 4University of Illinois atUrbana-Champaign, USAWed-SS-6-2-1, Time: 10:00–10:20

In this paper, we describe the system designed and developed byteam ELISA for DARPA’s LORELEI (Low Resource Languages forEmergent Incidents) pilot speech evaluation. The goal of the LORELEIprogram is to guide rapid resource deployment for humanitarianrelief (e.g. for natural disasters), with a focus on “low-resource”language locations, where the cost of developing technologies forautomated human language tools can be prohibitive both in mone-tary terms and timewise. In this phase of the program, the speechevaluation consisted of three separate tasks: detecting presence ofan incident, classifying incident type, and classifying incident typealong with identifying the location where it occurs. The performancemetric was area under curve of precision-recall curves. Team ELISAcompeted against five other teams and won all the subtasks.

First Results in Developing a Medieval Latin LanguageCharter Dictation System for the East-Central EuropeRegion

Péter Mihajlik 1, Lili Szabó 2, Balázs Tarján 1, AndrásBalog 2, Krisztina Rábai 3; 1BME, Hungary; 2THINKTechResearch Center, Hungary; 3University of HradecKrálové, Czech RepublicWed-SS-6-2-2, Time: 10:20–10:40

Latin had served as an official language across Europe from the Ro-man Empire until the 19th century. As a result, vast amount of Latinlanguage historical documents (charters, account books) survivedfrom the Middle Ages, waiting for recovery. In the digitization pro-cess, tremendous human efforts are needed for the transliteration oftextual content, as the applicability of optical character recognitiontechniques is often limited. In the era of Digital Humanities our aim isto accelerate the transcription by using automatic speech recognitiontechnology. We introduce the challenges and our initial results indeveloping a real-time, medieval Latin language LVCSR dictation sys-tem for East-Central Europe (ECE). In this region, the pronunciationand usage of medieval Latin is considered to be roughly uniform.At this phase of the research, therefore, Latin speech data was notcollected for acoustic model training but only for test purposes —from a selection of ECE countries. Our experimental results, however,suggest that ECE Latin varies significantly depending on the primarynational language on both acoustic-phonetic and grammatical levels.On the other hand, unexpectedly low word error rates are obtainedfor several speakers whose native language is completely uncoveredby the applied training data.

The Motivation and Development of MPAi, a MaoriPronunciation Aid

C.I. Watson 1, P.J. Keegan 1, M.A. Maclagan 2, R. Harlow 3,J. King 2; 1University of Auckland, New Zealand;2University of Canterbury, New Zealand; 3University ofWaikato, New ZealandWed-SS-6-2-3, Time: 10:40–11:00

This paper outlines the motivation and development of a pronuncia-tion aid (MPAi) for the Maori language, the language of the indigenouspeople of New Zealand. Maori is threatened and after a break intransmission the language is currently undergoing revitalization.The data for the aid has come from a corpus of 60 speakers (menand women). The language aid allows users to model their speechagainst exemplars from young speakers or older speakers of Maori.This is important, because of the status of the elders in the Maorispeaking community, but it also recognizes that Maori is undergoingsubstantial vowel change. The pronunciation aid gives feedback onvowel production via formant analysis, and selected words via speechrecognition. The evaluation of the aid by 22 language teachers ispresented and the resulting changes are discussed.

On the Linguistic Relevance of Speech Units Learnedby Unsupervised Acoustic Modeling

Siyuan Feng, Tan Lee; Chinese University of Hong Kong,ChinaWed-SS-6-2-4, Time: 11:00–11:20

Unsupervised acoustic modeling is an important and challengingproblem in spoken language technology development for low-resource languages. It aims at automatically learning a set of speechunits from un-transcribed data. These learned units are expectedto be related to fundamental linguistic units that constitute theconcerned language. Formulated as a clustering problem, unsuper-

Notes

160

vised acoustic modeling methods are often evaluated in terms ofaverage purity or similar types of performance measures. They donot provide detailed insights on the fitness of individual learnedunits and the relation between them. This paper presents an inves-tigation on the linguistic relevance of learned speech units based onKullback-Leibler (KL) divergence. A symmetric KL divergence metricis used to measure the distance between each pair of learned unit andground-truth phoneme of the target language. Experimental analysison a multilingual database shows that KL divergence is consistentwith purity in evaluating clustering results. The deviation betweena learned unit and its closest ground-truth phoneme is comparableto the inherent variability of the phoneme. The learned speech unitshave a good coverage of linguistically defined phonemes. However,there are certain phonemes that can not be covered, for example, theretroflex final /er/ in Mandarin.

Deep Auto-Encoder Based Multi-Task Learning UsingProbabilistic Transcriptions

Amit Das 1, Mark Hasegawa-Johnson 1, Karel Veselý 2;1University of Illinois at Urbana-Champaign, USA; 2BrnoUniversity of Technology, Czech RepublicWed-SS-6-2-5, Time: 11:20–11:40

We examine a scenario where we have no access to native transcribersin the target language. This is typical of language communities thatare under-resourced. However, turkers (online crowd workers)available in online marketplaces can serve as valuable alternativeresources for providing transcripts in the target language. Weassume that the turkers neither speak nor have any familiarity withthe target language. Thus, they are unable to distinguish all phonepairs in the target language; their transcripts therefore specify, atbest, a probability distribution called a probabilistic transcript (PT).Standard deep neural network (DNN) training using PTs do notnecessarily improve error rates. Previously reported results havedemonstrated some success by adopting the multi-task learning(MTL) approach. In this study, we report further improvements byintroducing a deep auto-encoder based MTL. This method leverageslarge amounts of untranscribed data in the target language in addi-tion to the PTs obtained from turkers. Furthermore, to encouragetransfer learning in the feature space, we also examine the effectof using monophones from transcripts in well-resourced languages.We report consistent improvement in phone error rates (PER) forSwahili, Amharic, Dinka, and Mandarin.

Areal and Phylogenetic Features for MultilingualSpeech Synthesis

Alexander Gutkin 1, Richard Sproat 2; 1Google, UK;2Google, USAWed-SS-6-2-6, Time: 11:40–12:00

We introduce phylogenetic and areal language features to the domainof multilingual text-to-speech synthesis. Intuitively, enriching theexisting universal phonetic features with cross-lingual shared repre-sentations should benefit the multilingual acoustic models and helpto address issues like data scarcity for low-resource languages. Weinvestigate these representations using the acoustic models basedon long short-term memory recurrent neural networks. Subjectiveevaluations conducted on eight languages from diverse languagefamilies show that sometimes phylogenetic and areal representationslead to significant multilingual synthesis quality improvements. Tohelp better leverage these novel features, improving the baselinephonetic representation may be necessary.

Wed-SS-6-11 : Special Session: DataCollection, Transcription and AnnotationIssues in Child Language AcquisitionF11, 10:00–12:00, Wednesday, 23 Aug. 2017Chairs: Elika Bergelson, Sho Tsuji

SLPAnnotator: Tools for Implementing Sign LanguagePhonetic Annotation

Kathleen Currie Hall, Scott Mackie, Michael Fry, OksanaTkachman; University of British Columbia, CanadaWed-SS-6-11-1, Time: 11:40–12:00

This paper introduces a new resource for building phonetically tran-scribed corpora of signed languages. The free, open-source softwaretool, SLPAnnotator, is designed to facilitate the transcription of handconfigurations using a slightly modified version of the Sign Lan-guage Phonetic Annotation (SLPA) system ([1], [2], [3], [4]; see also [5]).

While the SLPA system is extremely phonetically detailed, it canbe seen as cumbersome and, perhaps, harder for humans to useand interpret than other transcription systems (e.g. Prosodic ModelHandshape Coding, [6]). SLPAnnotator is designed to bridge thegap between such systems by automating some of the transcriptionprocess, providing users with informative references about possibleconfigurations as they are coding, giving continuously updatableaccess to a visual model of the transcribed handshape, and allowingusers to verify that transcribed handshapes are both phonologicallyand anatomically plausible. Finally, SLPAnnotator is designed tointerface with other analysis tools, such as Phonological CorpusTools([7], [8]), to allow for subsequent phonological analysis of the result-ing sign language corpora.

The LENA System Applied to Swedish: Reliability ofthe Adult Word Count Estimate

Iris-Corinna Schwarz 1, Noor Botros 2, Alekzandra Lord 2,Amelie Marcusson 2, Henrik Tidelius 2, Ellen Marklund 1;1Stockholm University, Sweden; 2Karolinska Institute,SwedenWed-SS-6-11-2, Time: 10:40–11:00

The Language Environment Analysis system LENA is used to captureday-long recordings of children’s natural audio environment. Thesystem performs automated segmentation of the recordings andprovides estimates for various measures. One of those measuresis Adult Word Count (AWC), an approximation of the number ofwords spoken by adults in close proximity to the child. The LENAsystem was developed for and trained on American English, but ithas also been evaluated on its performance when applied to Spanish,Mandarin and French. The present study is the first evaluation of theLENA system applied to Swedish, and focuses on the AWC estimate.Twelve five-minute segments were selected at random from each offour day-long recordings of 30-month-old children. Each of these48 segments was transcribed by two transcribers, and both numberof words and number of vowels were calculated (inter-transcriberreliability for words: r = .95, vowels: r = .93). Both counts correlatedwith the LENA system’s AWC estimate for the same segments (words:r = .67, vowels: r = .66). The reliability of the AWC as estimated bythe LENA system when applied to Swedish is therefore comparableto its reliability for Spanish, Mandarin and French.

Notes

161

What do Babies Hear? Analyses of Child- andAdult-Directed Speech

Marisa Casillas 1, Andrei Amatuni 2, Amanda Seidl 3,Melanie Soderstrom 4, Anne S. Warlaumont 5, ElikaBergelson 2; 1MPI for Psycholinguistics, The Netherlands;2Duke University, USA; 3Purdue University, USA;4University of Manitoba, Canada; 5University ofCalifornia at Merced, USAWed-SS-6-11-3, Time: 10:20–10:40

Child-directed speech is argued to facilitate language development,and is found cross-linguistically and cross-culturally to varyingdegrees. However, previous research has generally focused onshort samples of child-caregiver interaction, often in the lab orwith experimenters present. We test the generalizability of thisphenomenon with an initial descriptive analysis of the speechheard by young children in a large, unique collection of naturalistic,daylong home recordings. Trained annotators coded automatically-detected adult speech ‘utterances’ from 61 homes across 4 NorthAmerican cities, gathered from children (age 2–24 months) wearingaudio recorders during a typical day. Coders marked the speakergender (male/female) and intended addressee (child/adult), yielding10,886 addressee and gender tags from 2,523 minutes of audio (cf.HB-CHAAC Interspeech ComParE challenge; Schuller et al., in press).Automated speaker-diarization (LENA) incorrectly gender-tagged 30%of male adult utterances, compared to manually-coded consensus.Furthermore, we find effects of SES and gender on child-directedand overall speech, increasing child-directed speech with child age,and interactions of speaker gender, child gender, and child age:female caretakers increased their child-directed speech more withage than male caretakers did, but only for male infants. Implicationsfor language acquisition and existing classification algorithms arediscussed.

A New Workflow for Semi-Automatized Annotations:Tests with Long-Form Naturalistic Recordings ofChildrens Language Environments

Marisa Casillas 1, Elika Bergelson 2, Anne S.Warlaumont 3, Alejandrina Cristia 4, MelanieSoderstrom 5, Mark VanDam 6, Han Sloetjes 1; 1MPI forPsycholinguistics, The Netherlands; 2Duke University,USA; 3University of California at Merced, USA; 4LSCP(UMR 8554), France; 5University of Manitoba, Canada;6Washington State University, USAWed-SS-6-11-4, Time: 11:20–11:40

Interoperable annotation formats are fundamental to the utility,expansion, and sustainability of collective data repositories. Inlanguage development research, shared annotation schemes havebeen critical to facilitating the transition from raw acoustic data tosearchable, structured corpora. Current schemes typically requirecomprehensive and manual annotation of utterance boundaries andorthographic speech content, with an additional, optional range oftags of interest. These schemes have been enormously successfulfor datasets on the scale of dozens of recording hours but areuntenable for long-format recording corpora, which routinely con-tain hundreds to thousands of audio hours. Long-format corporawould benefit greatly from (semi-)automated analyses, both on theearliest steps of annotation — voice activity detection, utterancesegmentation, and speaker diarization — as well as later steps— e.g., classification-based codes such as child-vs-adult-directedspeech, and speech recognition to produce phonetic/orthographicrepresentations. We present an annotation workflow specificallydesigned for long-format corpora which can be tailored by individual

researchers and which interfaces with the current dominant schemefor short-format recordings. The workflow allows semi-automatedannotation and analyses at higher linguistic levels. We give oneexample of how the workflow has been successfully implemented ina large cross-database project.

Top-Down versus Bottom-Up Theories ofPhonological Acquisition: A Big Data Approach

Christina Bergmann, Sho Tsuji, Alejandrina Cristia;LSCP (UMR 8554), FranceWed-SS-6-11-5, Time: 10:00–10:20

Recent work has made available a number of standardized meta-analyses bearing on various aspects of infant language processing.We utilize data from two such meta-analyses (discrimination of vowelcontrasts and word segmentation, i.e., recognition of word formsextracted from running speech) to assess whether the publishedbody of empirical evidence supports a bottom-up versus a top-downtheory of early phonological development by leveling the powerof results from thousands of infants. We predicted that if infantscan rely purely on auditory experience to develop their phonolog-ical categories, then vowel discrimination and word segmentationshould develop in parallel, with the latter being potentially laggedcompared to the former. However, if infants crucially rely on wordform information to build their phonological categories, then de-velopment at the word level must precede the acquisition of nativesound categories. Our results do not support the latter prediction.We discuss potential implications and limitations, most salientlythat word forms are only one top-down level proposed to affectphonological development, with other proposals suggesting thattop-down pressures emerge from lexical (i.e., word-meaning pairs)development. This investigation also highlights general proceduresby which standardized meta-analyses may be reused to answertheoretical questions spanning across phenomena.

Which Acoustic and Phonological Factors ShapeInfants’ Vowel Discrimination? Exploiting NaturalVariation in InPhonDB

Sho Tsuji 1, Alejandrina Cristia 2; 1University ofPennsylvania, USA; 2LSCP (UMR 8554), FranceWed-SS-6-11-6, Time: 11:00–11:20

A key research question in early language acquisition concerns thedevelopment of infants’ ability to discriminate sounds, and thefactors structuring discrimination abilities. Vowel discrimination,in particular, has been studied using a range of tasks, experimentalparadigms, and stimuli over the past 40 years, work recently com-piled in a meta-analysis. We use this meta-analysis to assess whetherthere is statistical evidence for the following factors affecting effectsizes across studies: (1) the order in which the two vowel stimuli arepresented; and (2) the distance between the vowels, measured acous-tically in terms of spectral and quantity differences. The magnitudeof effect sizes analysis revealed order effects consistent with theNatural Referent Vowels framework, with greater effect sizes whenthe second vowel was more peripheral than the first. Additionally, wefind that spectral acoustic distinctiveness is a consistent predictor ofstudies’ effect sizes, while temporal distinctiveness did not predicteffect size magnitude. None of these factors interacted significantlywith age. We discuss implications of these results for languageacquisition, and more generally developmental psychology, research.

Notes

162

Wed-SS-7-1 : Special Session: DigitalRevolution for Under-resourced Languages 2Poster 1, 13:30–15:30, Wednesday, 23 Aug. 2017Chairs: Shyam Agrawal, Oddur Kjartansson

The ABAIR Initiative: Bringing Spoken Irish into theDigital Space

Ailbhe Ní Chasaide, Neasa Ní Chiaráin, ChristophWendler, Harald Berthelsen, Andy Murphy, ChristerGobl; Trinity College Dublin, IrelandWed-SS-7-1-1, Time: 13:30–15:30

The processes of language demise take hold when a language ceasesto belong to the mainstream of life’s activities. Digital commu-nication technology increasingly pervades all aspects of modernlife. Languages not digitally ‘available’ are ever more marginalised,whereas a digital presence often yields unexpected opportunities tointegrate the language into the mainstream. The ABAIR initiativeembraces three central aspects of speech technology developmentfor Irish (Gaelic): the provision of technology-oriented linguistic-phonetic resources; the building and perfecting of core speechtechnologies; and the development of technology applications, whichexploit both the technologies and the linguistic resources. The latterenable the public, learners, and those with disabilities to integrateIrish into their day-to-day usage. This paper outlines some of thespecific linguistic and sociolinguistic challenges and the approachesadopted to address them. Although machine-learning approachesare helping to speed up the process of technology provision, theABAIR experience highlights how phonetic-linguistic resourcesare also crucial to the development process. For the endangeredlanguage, linguistic resources are central to many applications thatimpact on language usage. The sociolinguistic context and the needsof potential end users should be central considerations in settingresearch priorities and deciding on methods.

Very Low Resource Radio Browsing for AgileDevelopmental and Humanitarian Monitoring

Armin Saeb 1, Raghav Menon 1, Hugh Cameron 2,William Kibira 2, John Quinn 2, Thomas Niesler 1;1Stellenbosch University, South Africa; 2UN Global Pulse,UgandaWed-SS-7-1-2, Time: 13:30–15:30

We present a radio browsing system developed on a very smallcorpus of annotated speech by using semi-supervised training ofmultilingual DNN/HMM acoustic models. This system is intended tosupport relief and developmental programmes by the United Nations(UN) in parts of Africa where the spoken languages are extremelyunder resourced. We assume the availability of 12 minutes of anno-tated speech in the target language, and show how this can best beused to develop an acoustic model. First, a multilingual DNN/HMMis trained using Acholi as the target language and Luganda, UgandanEnglish and South African English as source languages. We showthat the lowest word error rates are achieved by using this model tolabel further untranscribed target language data and then developingSGMM acoustic model from the extended dataset. The performanceof an ASR system trained in this way is sufficient for keyword detec-tion that yields useful and actionable near real-time information todevelopmental organisations.

Extracting Situation Frames from Non-English Speech:Evaluation Framework and Pilot Results

Nikolaos Malandrakis 1, Ondrej Glembek 2, Shrikanth S.Narayanan 1; 1University of Southern California, USA;2Brno University of Technology, Czech RepublicWed-SS-7-1-3, Time: 13:30–15:30

This paper describes the first evaluation framework for the extractionof Situation Frames — structures describing humanitarian assistanceneeds — from non-English speech audio, conducted for the DARPALORELEI (Low Resource Languages for Emergent Incidents) program.Participants in LORELEI had to process audio from a variety ofsources, in non-English languages, and extract the informationrequired to populate Situation Frames describing whether any needis mentioned, the type of need present and where the need exists.The evaluation was conducted over a period of 10 days and attractedsubmissions from 6 teams, each team spanning multiple organiza-tions. Performance was evaluated using precision-recall curves. Theresults are encouraging, with most teams showing some capabilityto detect the type of situation discussed, but more work will berequired to connect needs to specific locations.

Eliciting Meaningful Units from Speech

Daniil Kocharov, Tatiana Kachkovskaia, Pavel Skrelin;Saint Petersburg State University, RussiaWed-SS-7-1-4, Time: 13:30–15:30

Elicitation of information structure from speech is a crucial stepin automatic speech understanding. In terms of both productionand perception, we consider intonational phrase to be the basicmeaningful unit of information structure in speech. The currentpaper presents a method of detecting these units in speech byprocessing both the recorded speech and its textual representation.Using syntactic information, we split text into small groups of wordsclosely connected with each other. Assuming that intonationalphrases are built from these small groups, we use acoustic informa-tion to reveal their actual boundaries. The procedure was initiallydeveloped for processing Russian speech, and we have achievedthe best published results for this language with F1 equal to 0.91.We assume that it may be adapted for other languages that havesome amount of read speech resources, including under-resourcedlanguages. For comparison we have evaluated it on English material(Boston University Radio Speech Corpus). Our results, F1 of 0.76, arecomparable with the top systems designed for English.

Unsupervised Speech Signal to SymbolTransformation for Zero Resource SpeechApplications

Saurabhchand Bhati, Shekhar Nayak, K. Sri RamaMurty; IIT Hyderabad, IndiaWed-SS-7-1-5, Time: 13:30–15:30

Zero resource speech processing refers to a scenario where no orminimal transcribed data is available. In this paper, we propose athree-step unsupervised approach to zero resource speech process-ing, which does not require any other information/dataset. In thefirst step, we segment the speech signal into phoneme-like units,resulting in a large number of varying length segments. The secondstep involves clustering the varying-length segments into a finitenumber of clusters so that each segment can be labeled with acluster index. The unsupervised transcriptions, thus obtained, canbe thought of as a sequence of virtual phone labels. In the third step,a deep neural network classifier is trained to map the feature vectorsextracted from the signal to its corresponding virtual phone label.The virtual phone posteriors extracted from the DNN are used as

Notes

163

features in the zero resource speech processing. The effectivenessof the proposed approach is evaluated on both ABX and spokenterm discovery tasks (STD) using spontaneous American English andTsonga language datasets, provided as part of zero resource 2015challenge. It is observed that the proposed system outperformsbaselines, supplied along the datasets, in both the tasks without anytask specific modifications.

Machine Assisted Analysis of Vowel Length Contrastsin Wolof

Elodie Gauthier 1, Laurent Besacier 1, Sylvie Voisin 2; 1LIG(UMR 5217), France; 2DDL (UMR 5596), FranceWed-SS-7-1-6, Time: 13:30–15:30

Growing digital archives and improving algorithms for automaticanalysis of text and speech create new research opportunities forfundamental research in phonetics. Such empirical approachesallow statistical evaluation of a much larger set of hypothesis aboutphonetic variation and its conditioning factors (among them geo-graphical / dialectal variants). This paper illustrates this vision andproposes to challenge automatic methods for the analysis of a noteasily observable phenomenon: vowel length contrast. We focus onWolof, an under-resourced language from Sub-Saharan Africa. Inparticular, we propose multiple features to make a fine evaluation ofthe degree of length contrast under different factors such as: read vssemi-spontaneous speech; standard vs dialectal Wolof. Our measuresmade fully automatically on more than 20k vowel tokens show thatour proposed features can highlight different degrees of contrast foreach vowel considered. We notably show that contrast is weaker insemi-spontaneous speech and in a non standard semi-spontaneousdialect.

Leveraging Text Data for Word Segmentation forUnderresourced Languages

Thomas Glarner 1, Benedikt Boenninghoff 2, OliverWalter 1, Reinhold Haeb-Umbach 1; 1UniversitätPaderborn, Germany; 2Ruhr-Universität Bochum,GermanyWed-SS-7-1-7, Time: 13:30–15:30

In this contribution we show how to exploit text data to support worddiscovery from audio input in an underresourced target language.Given audio, of which a certain amount is transcribed at the wordlevel, and additional unrelated text data, the approach is able tolearn a probabilistic mapping from acoustic units to characters andutilize it to segment the audio data into words without the need of apronunciation dictionary. This is achieved by three components: anunsupervised acoustic unit discovery system, a supervisedly trainedacoustic unit-to-grapheme converter, and a word discovery system,which is initialized with a language model trained on the text data.Experiments for multiple setups show that the initialization of thelanguage model with text data improves the word segmentationperformance by a large margin.

Improving DNN Bluetooth Narrowband AcousticModels by Cross-Bandwidth and Cross-LingualInitialization

Xiaodan Zhuang, Arnab Ghoshal, Antti-Veikko Rosti,Matthias Paulik, Daben Liu; Apple, USAWed-SS-7-1-8, Time: 13:30–15:30

The success of deep neural network (DNN) acoustic models is partlyowed to large amounts of training data available for different applica-tions. This work investigates ways to improve DNN acoustic modelsfor Bluetooth narrowband mobile applications when relatively small

amounts of in-domain training data are available. To address thechallenge of limited in-domain data, we use cross-bandwidth andcross-lingual transfer learning methods to leverage knowledge fromother domains with more training data (different bandwidth and/orlanguages). Specifically, narrowband DNNs in a target languageare initialized using the weights of DNNs trained on bandlimitedwide-band data in the same language or those trained on a different(resource-rich) language. We investigate multiple recipes involvingsuch methods with different data resources. For all languages in ourexperiments, these recipes achieve up to 45% relative WER reduction,compared to training solely on the Bluetooth narrowband data in thetarget language. Furthermore, these recipes are very beneficial evenwhen over two hundred hours of manually transcribed in-domaindata is available, and we can achieve better accuracy than the base-lines with as little as 20 hours of in-domain data.

Joint Estimation of Articulatory Features andAcoustic Models for Low-Resource Languages

Basil Abraham, S. Umesh, Neethu Mariam Joy; IITMadras, IndiaWed-SS-7-1-9, Time: 13:30–15:30

Using articulatory features for speech recognition improves the per-formance of low-resource languages. One way to obtain articulatoryfeatures is by using an articulatory classifier (pseudo-articulatoryfeatures). The performance of the articulatory features depends onthe efficacy of this classifier. But, training such a robust classifier fora low-resource language is constrained due to the limited amountof training data. We can overcome this by training the articulatoryclassifier using a high resource language. This classifier can then beused to generate articulatory features for the low-resource language.However, this technique fails when high and low-resource languageshave mismatches in their environmental conditions. In this paper,we address both the aforementioned problems by jointly estimatingthe articulatory features and low-resource acoustic model. Theexperiments were performed on two low-resource Indian languagesnamely, Hindi and Tamil. English was used as the high-resourcelanguage. A relative improvement of 23% and 10% were obtained forHindi and Tamil, respectively.

Transfer Learning and Distillation Techniques toImprove the Acoustic Modeling of Low ResourceLanguages

Basil Abraham, Tejaswi Seeram, S. Umesh; IIT Madras,IndiaWed-SS-7-1-10, Time: 13:30–15:30

Deep neural networks (DNN) require large amount of training datato build robust acoustic models for speech recognition tasks. Ourwork is intended in improving the low-resource language acousticmodel to reach a performance comparable to that of a high-resourcescenario with the help of data/model parameters from other high-resource languages. We explore transfer learning and distillationmethods, where a complex high resource model guides or super-vises the training of low resource model. The techniques include(i) multi-lingual framework of borrowing data from high-resourcelanguage while training the low-resource acoustic model. The KLdivergence based constraints are added to make the model biasedtowards low-resource language, (ii) distilling knowledge from thecomplex high-resource model to improve the low-resource acousticmodel. The experiments were performed on three Indian languagesnamely Hindi, Tamil and Kannada. All the techniques gave improvedperformance and the multi-lingual framework with KL divergenceregularization giving the best results. In all the three languagesa performance close to or better than high-resource scenario wasobtained.

Notes

164

Building an ASR Corpus Using Althingi’sParliamentary Speeches

Inga Rún Helgadóttir, Róbert Kjaran, Anna BjörkNikulásdóttir, Jón Guðnason; Reykjavik University,IcelandWed-SS-7-1-11, Time: 13:30–15:30

Acoustic data acquisition for under-resourced languages is an im-portant and challenging task. In the Icelandic parliament, Althingi,all performed speeches are transcribed manually and published astext on Althingi’s web page. To reduce the manual work involved, anautomatic speech recognition system is being developed for Althingi.In this paper the development of a speech corpus suitable for thetraining of a parliamentary ASR system is described. Text and audiodata of manually transcribed speeches were processed to build analigned, segmented corpus, whereby language specific tasks had tobe developed specially for Icelandic. The resulting corpus of 542hours of speech is freely available on http://www.malfong.is. Firstexperiments with an ASR system trained on the Althingi corpushave been conducted, showing promising results. Word error rate of16.38% was obtained using time-delay deep neural network (TD-DNN)and 14.76% was obtained using long-short term memory recurrentneural network (LSTM-RNN) architecture. The Althingi corpus isto our knowledge the largest speech corpus currently available inIcelandic. The corpus as well as the developed methods for corpuscreation constitute a valuable resource for further developmentswithin Icelandic language technology.

Implementation of a Radiology Speech RecognitionSystem for Estonian Using Open Source Software

Tanel Alumäe, Andrus Paats, Ivo Fridolin, Einar Meister;Tallinn University of Technology, EstoniaWed-SS-7-1-12, Time: 13:30–15:30

Speech recognition has become increasingly popular in radiology re-porting in the last decade. However, developing a speech recognitionsystem for a new language in a highly specific domain requires a lotof resources, expert knowledge and skills. Therefore, commercialvendors do not offer ready-made radiology speech recognitionsystems for less-resourced languages.

This paper describes the implementation of a radiology speechrecognition system for Estonian, a language with less than onemillion native speakers. The system was developed in partnershipwith a hospital that provided a corpus of written reports for languagemodeling purposes. Rewrite rules for pre-processing training textsand postprocessing recognition results were created manually basedon a small parallel corpus created by the hospital’s radiologists,using the Thrax toolkit. Deep neural network based acoustic modelswere trained based on 216 hours of out-of-domain data and adaptedon 14 hours of spoken radiology data, using the Kaldi toolkit. Thecurrent word error rate of the system is 5.4%. The system is in activeuse in real clinical environment.

Building ASR Corpora Using Eyra

Jón Guðnason, Matthías Pétursson, Róbert Kjaran,Simon Klüpfel, Anna Björk Nikulásdóttir; ReykjavikUniversity, IcelandWed-SS-7-1-13, Time: 13:30–15:30

Building acoustic databases for speech recognition is very importantfor under-resourced languages. To build a speech recognitionsystem, a large amount of speech data from a considerable numberof participants needs to be collected. Eyra is a toolkit that can beused to gather acoustic data from a large number of participants ina relatively straight forward fashion. Predetermined prompts are

downloaded onto a client, typically run on a smartphone, wherethe participant reads them aloud so that the recording and itscorresponding prompt can be uploaded. This paper presents theEyra toolkit, its quality control routines and annotation mechanism.The quality control relies on a forced-alignment module, which givesfeedback to the participant, and an annotation module which allowsdata collectors to rate the read prompts after they are uploaded tothe system. The paper presents an analysis of the performance ofthe quality control and describes two data collections for Icelandicand Javanese.

Rapid Development of TTS Corpora for Four SouthAfrican Languages

Daniel van Niekerk 1, Charl van Heerden 1, MarelieDavel 1, Neil Kleynhans 1, Oddur Kjartansson 2, MartinJansche 2, Linne Ha 3; 1North-West University, SouthAfrica; 2Google, UK; 3Google, USAWed-SS-7-1-14, Time: 13:30–15:30

This paper describes the development of text-to-speech corporafor four South African languages. The approach followed investi-gated the possibility of using low-cost methods including informalrecording environments and untrained volunteer speakers. Thisobjective and the additional future goal of expanding the corpus toincrease coverage of South Africa’s 11 official languages necessitatedexperimenting with multi-speaker and code-switched data. Theprocess and relevant observations are detailed throughout. Thelatest version of the corpora are available for download under anopen-source licence and will likely see further development andrefinement in future.

Uniform Multilingual Multi-Speaker Acoustic Modelfor Statistical Parametric Speech Synthesis ofLow-Resourced Languages

Alexander Gutkin; Google, UKWed-SS-7-1-15, Time: 13:30–15:30

Acquiring data for text-to-speech (TTS) systems is expensive. Thistypically requires large amounts of training data, which is notavailable for low-resourced languages. Sometimes small amountsof data can be collected, while often no data may be available atall. This paper presents an acoustic modeling approach utilizinglong short-term memory (LSTM) recurrent neural networks (RNN)aimed at partially addressing the language data scarcity problem.Unlike speaker-adaptation systems that aim to preserve speakersimilarity across languages, the salient feature of the proposedapproach is that, once constructed, the resulting system does notneed retraining to cope with the previously unseen languages. This isdue to language and speaker-agnostic model topology and universallinguistic feature set. Experiments on twelve languages show thatthe system is able to produce intelligible and sometimes naturaloutput when a language is unseen. We also show that, when smallamounts of training data are available, pooling the data sometimesimproves the overall intelligibility and naturalness. Finally, we showthat sometimes having a multilingual system with no prior exposureto the language is better than building single-speaker system fromsmall amounts of data for that language.

Nativization of Foreign Names in TTS for AutomaticReading of World News in Swahili

Joseph Mendelson 1, Pilar Oplustil 2, Oliver Watts 2, SimonKing 2; 1KTH, Sweden; 2University of Edinburgh, UKWed-SS-7-1-16, Time: 13:30–15:30

When a text-to-speech (TTS) system is required to speak world news,a large fraction of the words to be spoken will be proper names

Notes

165

originating in a wide variety of languages. Phonetization of thesenames based on target language letter-to-sound rules will typicallybe inadequate. This is detrimental not only during synthesis, wheninappropriate phone sequences are produced, but also during train-ing, if the system is trained on data from the same domain. Thisis because poor phonetization during forced alignment based onhidden Markov models can pollute the whole model set, resultingin degraded alignment even of normal target-language words. Thispaper presents four techniques designed to address this issue inthe context of a Swahili TTS system: automatic transcription ofproper names based on a lexicon from a better-resourced language;the addition of a parallel phone set and special part-of-speech tagexclusively dedicated to proper names; a manually-crafted phonemapping which allows substitutions for potentially more accuratephones in proper names during forced alignment; the addition inproper names of a grapheme-derived frame-level feature, supple-menting the standard phonetic inputs to the acoustic model. Wepresent results from objective and subjective evaluations of systemsbuilt using these four techniques.

Wed-SS-7-11 : Special Session: ComputationalModels in Child Language AcquisitionF11, 13:30–15:30, Wednesday, 23 Aug. 2017Chairs: Alejandrina Cristia, Kristina Nilsson Björkenstam

Multi-Task Learning for Mispronunciation Detectionon Singapore Children’s Mandarin Speech

Rong Tong, Nancy F. Chen, Bin Ma; A*STAR, SingaporeWed-SS-7-11-1, Time: 15:10–15:30

Speech technology for children is more challenging than for adults,because there is a lack of children’s speech corpora. Moreover, thereis higher heterogeneity in children’s speech due to variability inanatomy across age and gender, larger variance in speaking rate andvocal effort, and immature command of word usage, grammar, andlinguistic structure. Speech productions from Singapore childrenpossess even more variability due to the multilingual environmentin the city-state, causing inter-influences from Chinese languages(e.g., Hokkien and Mandarin), English dialects (e.g., American andBritish), and Indian languages (e.g., Hindi and Tamil). In this paper,we show that acoustic modeling of children’s speech can leverageon a larger set of adult data. We compare two data augmentationapproaches for children’s acoustic modeling. The first approachdisregards the child and adult categories and consolidates thetwo datasets together as one entire set. The second approach ismulti-task learning: during training the acoustic characteristics ofadults and children are jointly learned through shared hidden layersof the deep neural network, yet they still retain their respectivetargets using two distinct softmax layers. We empirically show thatthe multi-task learning approach outperforms the baseline in bothspeech recognition and computer-assisted pronunciation training.

Relating Unsupervised Word Segmentation toReported Vocabulary Acquisition

Elin Larsen, Alejandrina Cristia, Emmanuel Dupoux;LSCP (UMR 8554), FranceWed-SS-7-11-2, Time: 13:30–13:50

A range of computational approaches have been used to modelthe discovery of word forms from continuous speech by infants.Typically, these algorithms are evaluated with respect to the ideal‘gold standard’ word segmentation and lexicon. These metrics assesshow well an algorithm matches the adult state, but may not reflectthe intermediate states of the child’s lexical development. We set

up a new evaluation method based on the correlation between wordfrequency counts derived from the application of an algorithm ontoa corpus of child-directed speech, and the proportion of infantsknowing those words, according to parental reports. We evaluate arepresentative set of 4 algorithms, applied to transcriptions of theBrent corpus, which have been phonologized using either phonemesor syllables as basic units. Results show remarkable variation in theextent to which these 8 algorithm-unit combinations predicted infantvocabulary, with some of these predictions surpassing those derivedfrom the adult gold standard segmentation. We argue that infantvocabulary prediction provides a useful complement to traditionalevaluation; for example, the best predictor model was also one ofthe worst in terms of segmentation score, and there was no clearrelationship between token or boundary F-score and vocabularyprediction.

Modelling the Informativeness of Non-Verbal Cues inParent-Child Interaction

Mats Wirén, Kristina N. Björkenstam, Robert Östling;Stockholm University, SwedenWed-SS-7-11-3, Time: 14:30–14:50

Non-verbal cues from speakers, such as eye gaze and hand positions,play an important role in word learning [1]. This is consistentwith the notion that for meaning to be reconstructed, acousticpatterns need to be linked to time-synchronous patterns from atleast one other modality [2]. In previous studies of a multimodallyannotated corpus of parent-child interaction, we have shown thatparents interacting with infants at the early word-learning stage (7–9months) display a large amount of time-synchronous patterns, butthat this behaviour tails off with increasing age of the children [3].Furthermore, we have attempted to quantify the informativeness ofthe different non-verbal cues, that is, to what extent they actuallyhelp to discriminate between different possible referents, and howcritical the timing of the cues is [4]. The purpose of this paperis to generalise our earlier model by quantifying informativenessresulting from non-verbal cues occurring both before and after theirassociated verbal references.

Computational Simulations of Temporal VocalizationBehavior in Adult-Child Interaction

Ellen Marklund, David Pagmar, Tove Gerholm, LisaGustavsson; Stockholm University, SwedenWed-SS-7-11-4, Time: 14:10–14:30

The purpose of the present study was to introduce a computationalsimulation of timing in child-adult interaction. The simulation usestemporal information from real adult-child interactions as defaulttemporal behavior of two simulated agents. Dependencies betweenthe agents’ behavior are added, and how the simulated interactionscompare to real interaction data as a result is investigated. In thepresent study, the real data consisted of transcriptions of a motherinteracting with her 12-month-old child, and the data simulatedwas vocalizations. The first experiment shows that although thetwo agents generate vocalizations according to the temporal charac-teristics of the interlocutors in the real data, simulated interactionwith no contingencies between the two agents’ behavior differs fromreal interaction data. In the second experiment, a contingency wasintroduced to the simulation: the likelihood that the adult agentinitiated a vocalization if the child agent was already vocalizing.Overall, the simulated data is more similar to the real interactiondata when the adult agent is less likely to start speaking while thechild agent vocalizes. The results are in line with previous studieson turn-taking in parent-child interaction at comparable ages. Thisillustrates that computational simulations are useful tools wheninvestigating parent-child interactions.

Notes

166

Approximating Phonotactic Input in Children’sLinguistic Environments from OrthographicTranscripts

Sofia Strömbergsson 1, Jens Edlund 2, Jana Götze 1,Kristina Nilsson Björkenstam 3; 1Karolinska Institute,Sweden; 2KTH, Sweden; 3Stockholm University, SwedenWed-SS-7-11-5, Time: 13:50–14:10

Child-directed spoken data is the ideal source of support for claimsabout children’s linguistic environments. However, phonologicaltranscriptions of child-directed speech are scarce, compared tosources like adult-directed speech or text data. Acquiring reliabledescriptions of children’s phonological environments from morereadily accessible sources would mean considerable savings of timeand money. The first step towards this goal is to quantify thereliability of descriptions derived from such secondary sources.

We investigate how phonological distributions vary across differentmodalities (spoken vs. written), and across the age of the intendedaudience (children vs. adults). Using a previously unseen collectionof Swedish adult- and child-directed spoken and written data, wecombine lexicon look-up and grapheme-to-phoneme conversionto approximate phonological characteristics. The analysis showsdistributional differences across datasets both for single phonemesand for longer phoneme sequences. Some of these are predictablyattributed to lexical and contextual characteristics of text vs. speech.

The generated phonological transcriptions are remarkably reliable.The differences in phonological distributions between child-directedspeech and secondary sources highlight a need for compensatorymeasures when relying on written data or on adult-directed spokendata, and/or for continued collection of actual child-directed speechin research on children’s language environments.

Learning Weakly Supervised Multimodal PhonemeEmbeddings

Rahma Chaabouni, Ewan Dunbar, Neil Zeghidour,Emmanuel Dupoux; ENS, FranceWed-SS-7-11-6, Time: 14:50–15:10

Recent works have explored deep architectures for learning multi-modal speech representation (e.g. audio and images, articulationand audio) in a supervised way. Here we investigate the role of com-bining different speech modalities, i.e. audio and visual informationrepresenting the lips’ movements, in a weakly supervised way usingSiamese networks and lexical same-different side information. Inparticular, we ask whether one modality can benefit from the otherto provide a richer representation for phone recognition in a weaklysupervised setting. We introduce mono-task and multi-task methodsfor merging speech and visual modalities for phone recognition. Themono-task learning consists in applying a Siamese network on theconcatenation of the two modalities, while the multi-task learningreceives several different combinations of modalities at train time.We show that multi-task learning enhances discriminability for visualand multimodal inputs while minimally impacting auditory inputs.Furthermore, we present a qualitative analysis of the obtained phoneembeddings, and show that cross-modal visual input can improvethe discriminability of phonological features which are visually dis-cernable (rounding, open/close, labial place of articulation), resultingin representations that are closer to abstract linguistic features thanthose based on audio only.

Wed-SS-8-11 : Special Session: VoiceAttractivenessF11, 16:00–18:00, Wednesday, 23 Aug. 2017Chairs: Melissa Barkat-Defradas, Benjamin Weiss

IntroductionWed-SS-8-11-10, Time: 16:00–16:10

(No abstract available at the time of publication)

Personalized Quantification of Voice Attractivenessin Multidimensional Merit Space

Yasunari Obuchi; Tokyo University of Technology, JapanWed-SS-8-11-1, Time: 16:10–17:40

Voice attractiveness is an indicator which is somehow objectiveand somehow subjective. It would be helpful to assume that eachvoice has its own attractiveness. However, the paired comparisonresults of human listeners sometimes include inconsistency. In thispaper, we propose a multidimensional mapping scheme of voiceattractiveness, which explains the existence of objective merit valuesof voices and subjective preference of listeners. Paired comparisonis modeled in a probabilistic framework, and the optimal mappingis obtained from the paired comparison results on the maximumlikelihood criterion.

The merit values can be estimated from the acoustic feature using themachine learning framework. We show how the estimation processworks using real database consisting of common Japanese greetingutterances. Experiments using 1- and 2- dimensional merit spacesconfirm that the comparison result prediction from the acousticfeature becomes more accurate in the 2-dimensional case.

The Role of Temporal Amplitude Modulations in thePolitical Arena: Hillary Clinton vs. Donald Trump

Hans Rutger Bosker; MPI for Psycholinguistics, TheNetherlandsWed-SS-8-11-2, Time: 16:10–17:40

Speech is an acoustic signal with inherent amplitude modulationsin the 1–9 Hz range. Recent models of speech perception proposethat this rhythmic nature of speech is central to speech recognition.Moreover, rhythmic amplitude modulations have been shown tohave beneficial effects on language processing and the subjectiveimpression listeners have of the speaker. This study investigated therole of amplitude modulations in the political arena by comparingthe speech produced by Hillary Clinton and Donald Trump in thethree presidential debates of 2016.

Inspection of the modulation spectra, revealing the spectral contentof the two speakers’ amplitude envelopes after matching for overallintensity, showed considerably greater power in Clinton’s modulationspectra (compared to Trump’s) across the three debates, particularlyin the 1–9 Hz range. The findings suggest that Clinton’s speech hada more pronounced temporal envelope with rhythmic amplitudemodulations below 9 Hz, with a preference for modulations around3 Hz. This may be taken as evidence for a more structured temporalorganization of syllables in Clinton’s speech, potentially due to morefrequent use of preplanned utterances. Outcomes are interpretedin light of the potential beneficial effects of a rhythmic temporalenvelope on intelligibility and speaker perception.

Notes

167

Perceptual Ratings of Voice Likability CollectedThrough In-Lab Listening Tests vs. Mobile-BasedCrowdsourcing

Laura Fernández Gallardo, Rafael Zequeira Jiménez,Sebastian Möller; T-Labs, GermanyWed-SS-8-11-3, Time: 16:10–17:40

Human perceptions of speaker characteristics, needed to performautomatic predictions from speech features, have generally beencollected by conducting demanding in-lab listening tests undercontrolled conditions. Concurrently, crowdsourcing has emergedas a valuable approach for running user studies through surveysor quantitative ratings. Micro-task crowdsourcing markets enablethe completion of small tasks (commonly of minutes or seconds),rewarding users with micro-payments. This paradigm permitseffortless collection of user input from a large and diverse poolof participants at low cost. This paper presents different auditorytests for collecting perceptual voice likability ratings employing acommon set of 30 male and female voices. These tests are basedon direct scaling and on paired-comparisons, and were conductedin the laboratory and via crowdsourcing using micro-tasks. Designconsiderations are proposed for adapting the laboratory listeningtests to a mobile-based crowdsourcing platform to obtain trustwor-thy listeners’ answers. Our likability scores obtained by the differenttest approaches are highly correlated. This outcome motivates theuse of crowdsourcing for future listening tests investigating e.g.speaker characterization, reducing the efforts involved in engagingparticipants and administering the tests on-site.

Attractiveness of French Voices for German Listeners— Results from Native and Non-Native Read Speech

Jürgen Trouvain, Frank Zimmerer; Universität desSaarlandes, GermanyWed-SS-8-11-4, Time: 16:10–17:40

This study investigated how the perceived attractiveness of voiceswas influenced by a foreign language, a foreign accent, and thelevel of fluency in the foreign language. Stimuli were taken from aFrench-German corpus of read speech with German native speakersas raters. Additional factors were stimulus length (syllable or entiresentence) and sex (of the raters and speakers). Results with Germannative raters reveal that stimuli spanning just a syllable were judgedsignificantly less attractive than those containing a sentence, andthat stimuli from French speakers were assessed as more attractivethan those of German speakers. This backs the cliché that Frenchhas an attractive image for German listeners. An analysis of thebest vs. the worst rated sentences suggest that an individual mixof voice quality, disfluency management, prosodic behaviour andpronunciation precision is responsible for the results.

Social Attractiveness in Dialogs

Antje Schweitzer, Natalie Lewandowski, Daniel Duran;Universität Stuttgart, GermanyWed-SS-8-11-5, Time: 16:10–17:40

This study investigates how acoustic and lexical properties ofspontaneous speech in dialogs affect perceived social attractivenessin terms of speaker likeability, friendliness, competence, and self-confidence. We analyze a database of longer spontaneous dialogsbetween German female speakers and the mutual ratings that dialogpartners assigned to one another after every conversation. Thus theratings reflect long-term impressions based on dialog behavior. Usinglinear mixed models, we investigate both classical acoustic-prosodicand lexical parameters as well as parameters that capture the degreeof speakers’ adaptation, or “convergence”, of these parameters toeach other. Specifically we find that likeability is correlated withthe speaker’s lexical convergence as well as with her convergence

in f0 peak height. Friendliness is significantly related to variationin intensity. For competence, the proportion of positive words inthe dialog, variation in shimmer, and overall phonetic convergenceare significant correlates. Self-confidence finally is related to severalprosodic, phonetic, and lexical adaptation parameters. In somecases, the effect depends on whether interlocutors also had eyecontact during their conversation. Taken together, these findingsprovide evidence that in addition to classical parameters, conver-gence parameters play an important role in the mutual perception ofsocial attractiveness.

A Gender Bias in the Acoustic-Melodic Features ofCharismatic Speech?

Eszter Novák-Tót 1, Oliver Niebuhr 1, Aoju Chen 2;1University of Southern Denmark, Denmark;2Universiteit Utrecht, The NetherlandsWed-SS-8-11-6, Time: 16:10–17:40

Previous studies proved the immense importance of nonverbalskills when it comes to being persuasive and coming across ascharismatic. It was also found that men sound more convincingand persuasive (i.e. altogether more charismatic) than women underotherwise comparable conditions. This gender bias is investigatedin the present study by analyzing and comparing acoustic-melodiccharisma features of male and female business executives. In linewith the gender bias in perception, our results show that femaleCEOs who are judged to be similarly charismatic as their malecounterpart(s) produce more and stronger acoustic charisma cues.This suggests that there is a gender bias which is compensated forby making a greater effort on the part of the female speakers.

Pitch Convergence as an Effect of PerceivedAttractiveness and Likability

Jan Michalsky, Heike Schoormann; Carl von OssietzkyUniversität Oldenburg, GermanyWed-SS-8-11-7, Time: 16:10–17:40

While there is a growing body of research on which and how pitchfeatures are perceived as attractive or likable, there are few studiesinvestigating how the impression of a speaker as attractive or likableaffects the speech behavior of his/her interlocutor. Recent studieshave shown that perceived attractiveness and likability may not onlyhave an effect on a speaker’s pitch features in isolation but also onthe prosodic entrainment. It has been shown that how speakerssynchronize their pitch features relatively to their interlocutor isaffected by such impressions. This study investigates pitch con-vergence, examining whether speakers become more similar overthe course of a conversation depending on perceived attractivenessand/or likability. The expected pitch convergence is thereby inves-tigated on two levels, over the entire conversation (globally) as wellas turn-wise (locally). The results from a speed dating experimentwith 98 mixed-sex dialogues of heterosexual singles show thatspeakers become more similar globally and locally over time both inregister and range. Furthermore, the degree of pitch convergence isgreatly affected by perceived attractiveness and likability with effectsdiffering between attractiveness and likability as well as between theglobal and the local level.

Does Posh English Sound Attractive?

Li Jiao 1, Chengxia Wang 2, Cristiane Hsu 2, PeterBirkholz 3, Yi Xu 2; 1Tongji University, China; 2UniversityCollege London, UK; 3Technische Universität Dresden,GermanyWed-SS-8-11-8, Time: 16:10–17:40

Poshness refers to how much a British English speaker sounds upperclass when they talk. Popular descriptions of posh English mostly

Notes

168

focus on vocabulary, accent and phonology. This study tests thehypothesis that, as a social index, poshness is also manifested viaphonetic properties known to encode vocal attractiveness. Specif-ically, posh English, because of its impression of being detached,authoritative and condescending, would more closely resemblean attractive male voice than an attractive female voice. In fourexperiments, we tested this hypothesis by acoustically manipulatingCambridge-accented English utterances by a male and a femalespeaker through PSOLA resynthesis, and having native speakers ofBritish English judge how posh or attractive each utterance sounds.The manipulated acoustic dimensions are formant dispersion, pitchshift and speech rate. Initial results from the first two experimentsshowed a trend in the hypothesized direction for the male speakers’utterances. But for the female utterances there was a ceiling effectdue to the frequent alternation of speaker gender within the sametest session. When the two speakers’ utterances were separated byblocks in the third and fourth experiments, a clearer support for themain hypothesis was found.

Large-Scale Speaker Ranking from CrowdsourcedPairwise Listener Ratings

Timo Baumann; Carnegie Mellon University, USAWed-SS-8-11-9, Time: 16:10–17:40

Speech quality and likability is a multi-faceted phenomenon con-sisting of a combination of perceptory features that cannot easilybe computed nor weighed automatically. Yet, it is often easy todecide which of two voices one likes better, even though it would behard to describe why, or to name the underlying basic perceptoryfeatures. Although likability is inherently subjective and individualpreferences differ frequently, generalizations are useful and there isoften a broad intersubjective consensus about whether one speakeris more likable than another. However, breaking down likabilityrankings into pairwise comparisons leads to a quadratic explosion ofrating pairs. We present a methodology and software to efficientlycreate a likability ranking for many speakers from crowdsourcedpairwise likability ratings. We collected pairwise likability ratings formany (>220) speakers from many raters (>160) and turn these rat-ings into one likability ranking. We investigate the resulting speakerranking stability under different conditions: limiting the number ofratings and the dependence on rater and speaker characteristics.We also analyze the ranking wrt. acoustic correlates to find outwhat factors influence likability. We publish our ranking and theunderlying ratings in order to facilitate further research.

DiscussionWed-SS-8-11-11, Time: 17:40–18:00

(No abstract available at the time of publication)

Wed-O-6-1 : Speech Production andPhysiologyAula Magna, 10:00–12:00, Wednesday, 23 Aug. 2017Chairs: Felicitas Kleber, Elizabeth Godoy

Aerodynamic Features of French Fricatives

Rosario Signorello 1, Sergio Hassid 2, Didier Demolin 1;1LPP (UMR 7018), France; 2Hôpital Erasme, BelgiumWed-O-6-1-1, Time: 10:00–10:20

The present research investigates the aerodynamic features ofFrench fricative consonants using direct measurement of subglottalair pressure by tracheal puncture (Ps) synchronized with intraoralair pressure (Po), oral airflow (Oaf) and acoustic measurements. Datawere collected from four Belgian French speakers’ productions of

CVCV pseudowords including voiceless and voiced fricatives [f, v, s,z, S, Z]. The goals of this study are: (i) to predict the starting, central,and releasing points of frication based on the measurements of Ps,Po, and Oaf; (ii) to compare voiceless and voiced fricatives and theirplaces of articulation; and (iii) to provide reference values for theaerodynamic features of fricatives for further linguistic, clinical,physical and computational modeling research.

Inter-Speaker Variability: Speaker Normalisation andQuantitative Estimation of Articulatory Invariants inSpeech Production for French

Antoine Serrurier 1, Pierre Badin 2, Louis-Jean Boë 2,Laurent Lamalle 3, Christiane Neuschaefer-Rube 1;1Uniklinik RWTH Aachen, Germany; 2GIPSA, France;3IRMaGe, FranceWed-O-6-1-2, Time: 10:20–10:40

Speech production can be analysed in terms of universal articulatory-acoustic phonemic units shared between speakers. However,morphological differences between speakers and idiosyncratic artic-ulatory strategies lead to large inter-speaker articulatory variability.Relationships between strategy and morphology have already beenpinpointed in the literature. This study aims thus at generalising ex-isting results on a larger database for the entire vocal tract (VT) and atquantifying phoneme-specific inter-speaker articulatory invariants.Midsagittal MRI of 11 French speakers for 62 vowels and consonantswere recorded and VT contours manually edited. A procedure ofnormalisation of VT contours between speakers, based on the use ofmean VT contours, led to an overall reduction of inter-speaker VTcontours variance of 88%. On the opposite, the sagittal function (i.e.the transverse sagittal distance along the VT midline), which is themain determinant of the acoustic output, had an overall amplitudevariance decrease of only 37%, suggesting that the speakers adapttheir strategy to their morphology to achieve proper acoustic goals.Moreover, articulatory invariants were identified on the sagittal vari-ance distribution along the VT as the regions with lower variability.These regions correspond to the classical places of articulation andare associated with higher acoustic sensitivity function levels.

Comparison of Basic Beatboxing ArticulationsBetween Expert and Novice Artists Using Real-TimeMagnetic Resonance Imaging

Nimisha Patil, Timothy Greer, Reed Blaylock,Shrikanth S. Narayanan; University of SouthernCalifornia, USAWed-O-6-1-3, Time: 10:40–11:00

Real-time Magnetic Resonance Imaging (rtMRI) was used to examinemechanisms of sound production in five beatboxers. rtMRI wasfound to be an effective tool with which to study the articulatorydynamics of this form of human vocal production; it provides adynamic view of the entire midsagittal vocal tract and at a framerate (83 fps) sufficient to observe the movement and coordinationof critical articulators. The artists’ repertoires included percussionelements generated using a wide range of articulatory and airstreammechanisms. Analysis of three common beatboxing sounds resultedin the finding that advanced beatboxers produce stronger ejectivesand have greater control over different airstreams than novicebeatboxers, to enhance the quality of their sounds. No difference inproduction mechanisms between males and females was observed.These data offer insights into the ways in which articulators can betrained and used to achieve specific acoustic goals.

Notes

169

Speaker-Specific Biomechanical Model-BasedInvestigation of a Simple Speech Task Based onTagged-MRI

Keyi Tang 1, Negar M. Harandi 1, Jonghye Woo 2,Georges El Fakhri 2, Maureen Stone 3, Sidney Fels 1;1University of British Columbia, Canada; 2MassachusettsGeneral Hospital, USA; 3University of Maryland, USAWed-O-6-1-4, Time: 11:00–11:20

We create two 3D biomechanical speaker models matched to medicalimage data of two healthy English speakers. We use a new, hybridregistration technique that morphs a generic 3D, biomechanicalmodel to medical images. The generic model of the head and neckincludes jaw, tongue, soft-palate, epiglottis, lips and face, and iscapable of simulating upper-airway biomechanics. We use cineand tagged magnetic resonance (MR) images captured while ourvolunteers repeated a simple utterance (/@-gis/) synchronized to ametronome. We simulate our models based on internal tongue tissuetrajectories that we extract from tagged MR images, and use in aninverse solver. For areas without tracked data points, the registeredgeneric model moves based on the computed muscle activations. Ourmodeling efforts include a wide range of speech organs illustratingthe coupling complexity between the oral anatomy during simplespeech utterances.

Sounds of the Human Vocal Tract

Reed Blaylock, Nimisha Patil, Timothy Greer,Shrikanth S. Narayanan; University of SouthernCalifornia, USAWed-O-6-1-5, Time: 11:20–11:40

Previous research suggests that beatboxers only use sounds thatexist in the world’s languages. This paper provides evidence to thecontrary, showing that beatboxers use non-linguistic articulationsand airstream mechanisms to produce many sound effects that havenot been attested in any language. An analysis of real-time magneticresonance videos of beatboxing reveals that beatboxers producenon-linguistic articulations such as ingressive retroflex trills andingressive lateral bilabial trills. In addition, beatboxers can use bothlingual egressive and pulmonic ingressive airstreams, neither ofwhich have been reported in any language.

The results of this study affect our understanding of the limits ofthe human vocal tract, and address questions about the mental unitsthat encode music and phonological grammar.

A Simulation Study on the Effect of Glottal BoundaryConditions on Vocal Tract Formants

Yasufumi Uezu, Tokihiko Kaburagi; Kyushu University,JapanWed-O-6-1-6, Time: 11:40–12:00

In the source-filter theory, the complete closure of the glottis isassumed as a glottal boundary condition. However, such assumptionof glottal closure in the source-filter theory is not strictly satisfied inactual utterance. Therefore, it is considered that acoustic features ofthe glottis and the subglottal region may affect vocal tract formants.In this study, we investigated how differences in the glottal boundaryconditions affect vocal tract formants by speech synthesis simula-tion using speech production model. We synthesized five Japanesevowels using the speech production model in consideration of thesource-filter interaction. This model consisted of the glottal areapolynomial model and the acoustic tube model in the concatenationof the vocal tract, glottis, and the subglottis. From the results, it wasfound that the first formant frequency was affected more strongly by

the boundary conditions, and also found that the open quotient maygive the formant stronger effect than the maximum glottal width. Inaddition, formant frequencies were also affected more strongly bysubglottal impedance when the maximum glottal area was wider.

Wed-O-6-4 : Speech and Harmonic AnalysisB4, 10:00–12:00, Wednesday, 23 Aug. 2017Chairs: Abeer Alwan, Franz Pernkopf

A Robust and Alternative Approach to ZeroFrequency Filtering Method for Epoch Extraction

P. Gangamohan, B. Yegnanarayana; IIIT Hyderabad,IndiaWed-O-6-4-1, Time: 10:00–10:20

During production of voiced speech, there exists impulse-like ex-citations due to abrupt closure of vocal folds. These impulse-likeexcitations are often referred as epochs or glottal closure instants(GCIs). The zero frequency filtering (ZFF) method exploits the prop-erties of impulse-like excitation by passing a speech signal throughthe resonator whose pole pair is located at 0 Hz. As the resonatoris unstable, the polynomial growth/decay is observed in the filteredsignal, thus requiring a trend removal operation. It is observed thatthe length of the window for trend removal operation is critical inspeech signals where there are more fluctuations in the fundamentalfrequency (F0). In this paper, a simple finite impulse response (FIR)implementation is proposed. The FIR filter is designed by placing

large number of zeros at fs2 Hz (fs represents the sampling fre-

quency), closer to the unit circle, in the z-plane. Experimental resultsshow that the proposed method is robust and computationally lesscomplex when compared to the ZFF method.

Improving YANGsaf F0 Estimator with AdaptiveKalman Filter

Kanru Hua; University of Illinois at Urbana-Champaign,USAWed-O-6-4-2, Time: 10:20–10:40

We present improvements to the refinement stage of YANGsaf[1] (YetANother Glottal source analysis framework), a recently published F0estimation algorithm by Kawahara et al., for noisy/breathy speechsignals. The baseline system, based on time-warping and weightedaverage of multi-band instantaneous frequency estimates, is stillsensitive to additive noise when none of the harmonic providereliable frequency estimate at low SNR. We alleviate this problem bycalibrating the weighted averaging process based on statistics gath-ered from a Monte-Carlo simulation, and applying Kalman filteringto refined F0 trajectory with time-varying measurement and processdistributions. The improved algorithm, adYANGsaf (adaptive YetANother Glottal source analysis framework), achieves significantlyhigher accuracy and smoother F0 trajectory on noisy speech whileretaining its accuracy on clean speech, with little computationaloverhead introduced.

A Spectro-Temporal Demodulation Technique forPitch Estimation

Jitendra Kumar Dhiman, Nagaraj Adiga,Chandra Sekhar Seelamantula; Indian Institute ofScience, IndiaWed-O-6-4-3, Time: 10:40–11:00

We consider a two-dimensional demodulation framework for spectro-temporal analysis of the speech signal. We construct narrowband

Notes

170

(NB) speech spectrograms, and demodulate them using the Riesztransform, which is a two-dimensional extension of the Hilberttransform. The demodulation results in time-frequency envelope(amplitude modulation or AM) and time-frequency carrier (frequencymodulation or FM). The AM corresponds to the vocal tract and isreferred to as the vocal tract spectrogram. The FM corresponds to theunderlying excitation and is referred to as the carrier spectrogram.The carrier spectrogram exhibits a high degree of time-frequencyconsistency for voiced sounds. For unvoiced sounds, such a struc-ture is lacking. In addition, the carrier spectrogram reflects thefundamental frequency (F0) variation of the speech signal. We de-velop a technique to determine the F0 from the carrier spectrogram.The time-frequency consistency is used to determine which time-frequency regions correspond to voiced segments. Comparisonswith the state-of-the-art F0 estimation algorithms show that theproposed F0 estimator has high accuracy for telephone channelspeech and is robust to noise.

Robust Method for Estimating F0 of Complex ToneBased on Pitch Perception of Amplitude ModulatedSignal

Kenichiro Miwa, Masashi Unoki; JAIST, JapanWed-O-6-4-4, Time: 11:00–11:20

Estimating the fundamental frequency (F0) of a target sound innoisy reverberant environments is a challenging issue in not onlysound analysis/synthesis but also sound enhancement. This paperproposes a method for robustly and accurately estimating the F0 ofa time-variant complex tone on the basis of an amplitude modula-tion/demodulation technique. It is based on the mechanism of thepitch perception of amplitude modulated signal and the frame-workof power envelope restoration based on the concept of modulationtransfer function. Computer simulations were carried out to discussfeasibility of the accuracy and robustness of the proposed methodfor estimating the F0 in heavy noisy reverberant environments. Thecomparative results revealed that the percentage correct rates ofthe estimated F0s using five recent methods (TEMPO2, YIN, PHIA,CmpCep, and SWIPE’) were drastically reduced as the SNR decreasedand the reverberation time increased. The results also demonstratedthat the proposed method robustly and accurately estimated the F0in both heavy noisy and reverberant environments.

Low-Complexity Pitch Estimation Based on PhaseDifferences Between Low-Resolution Spectra

Simon Graf 1, Tobias Herbig 1, Markus Buck 1, GerhardSchmidt 2; 1Nuance Communications, Germany;2Christian-Albrechts-Universität zu Kiel, GermanyWed-O-6-4-5, Time: 11:20–11:40

Detection of voiced speech and estimation of the pitch frequencyare important tasks for many speech processing algorithms. Pitchinformation can be used, e.g., to reconstruct voiced speech corruptedby noise.

In automotive environments, driving noise especially affects voicedspeech portions in the lower frequencies. Pitch estimation istherefore important, e.g., for in-car-communication systems. Suchsystems amplify the driver’s voice and allow for convenient conver-sations with backseat passengers. Low latency is required for thisapplication, which requires the use of short window lengths andshort frame shifts between consecutive frames. Conventional pitchestimation techniques, however, rely on long windows that exceedthe pitch period of human speech. In particular, male speakers’ lowpitch frequencies are difficult to resolve.

In this publication, we introduce a technique that approaches pitchestimation from a different perspective. The pitch information is ex-

tracted based on phase differences between multiple low-resolutionspectra instead of a single long window. The technique benefitsfrom the high temporal resolution provided by the short frame shiftand is capable to deal with the low spectral resolution caused byshort window lengths. Using the new approach, even very low pitchfrequencies can be estimated very efficiently.

Harvest: A High-Performance FundamentalFrequency Estimator from Speech Signals

Masanori Morise; University of Yamanashi, JapanWed-O-6-4-6, Time: 11:40–12:00

A fundamental frequency (F0) estimator named Harvest is described.The unique points of Harvest are that it can obtain a reliable F0contour and reduce the error that the voiced section is wronglyidentified as the unvoiced section. It consists of two steps: esti-mation of F0 candidates and generation of a reliable F0 contour onthe basis of these candidates. In the first step, the algorithm usesfundamental component extraction by many band-pass filters withdifferent center frequencies and obtains the basic F0 candidatesfrom filtered signals. After that, basic F0 candidates are refined andscored by using the instantaneous frequency, and then several F0candidates in each frame are estimated. Since the frame-by-frameprocessing based on the fundamental component extraction is notrobust against temporally local noise, a connection algorithm usingneighboring F0s is used in the second step. The connection takesadvantage of the fact that the F0 contour does not precipitouslychange in a short interval. We carried out an evaluation using twospeech databases with electroglottograph (EGG) signals to compareHarvest with several state-of-the-art algorithms. Results showed thatHarvest achieved the best performance of all algorithms.

Wed-O-6-6 : Dialog and ProsodyC6, 10:00–12:00, Wednesday, 23 Aug. 2017Chairs: Julia Hirschberg, Rolf Carlson

Prosodic Event Recognition Using ConvolutionalNeural Networks with Context Information

Sabrina Stehwien, Ngoc Thang Vu; Universität Stuttgart,GermanyWed-O-6-6-1, Time: 10:00–10:20

This paper demonstrates the potential of convolutional neuralnetworks (CNN) for detecting and classifying prosodic events onwords, specifically pitch accents and phrase boundary tones, fromframe-based acoustic features. Typical approaches use not onlyfeature representations of the word in question but also its sur-rounding context. We show that adding position features indicatingthe current word benefits the CNN. In addition, this paper discussesthe generalization from a speaker-dependent modelling approach toa speaker-independent setup. The proposed method is simple andefficient and yields strong results not only in speaker-dependent butalso speaker-independent cases.

Prosodic Facilitation and Interference While Judgingon the Veracity of Synthesized Statements

Ramiro H. Gálvez 1, Štefan Benuš 2, Agustín Gravano 1,Marian Trnka 3; 1Universidad de Buenos Aires,Argentina; 2UKF, Slovak Republic; 3Slovak Academy ofSciences, Slovak RepublicWed-O-6-6-2, Time: 10:20–10:40

Two primary sources of information are provided in human speech.On the one hand, the verbal channel encodes linguistic content, while

Notes

171

on the other hand, the vocal channel transmits paralinguistic infor-mation, mainly through prosody. In line with several studies thatinduce a conflict between these two channels to better understandthe role of prosody, we conducted an experiment in which subjectshad to listen to a series of statements synthesized with varyingprosody and indicate if they believed them to be true or false. Wefind evidence suggesting that acoustic/prosodic (a/p) features of thesynthesized statements affect response times (a well-known proxyfor cognitive load). Our results suggest that prosody in synthesizedspeech may play a role of either facilitation or interference whensubjects judge the truthfulness of a statement. Furthermore, wefind that this pattern is amplified when the a/p features of thesynthesized statements are analyzed relative to the subjects’ owna/p features. This suggests that the entrainment of TTS voices hasserious implications in the perceived trustworthiness of the system’sskills.

An Investigation of Pitch Matching Across AdjacentTurns in a Corpus of Spontaneous German

Margaret Zellers, Antje Schweitzer; Universität Stuttgart,GermanyWed-O-6-6-3, Time: 10:40–11:00

Speakers in conversations may adapt their turn pitch relative tothat of preceding turns to signal alignment with their interlocutor.However, the reference frame for pitch matching across turns is stillunclear. Researchers studying pitch in the context of conversationhave argued for an initializing approach, in which turn pitch mustbe judged relative to pitch in preceding turns. However, perceptualstudies have indicated that listeners are able to reliably identify thelocation of pitch values within an individual speaker’s range; that is,even without conversational context, they are able to normalize tospeakers. This would imply that speakers might match normalizedpitch instead of absolute pitch. Using a combined quantitative-qualitative approach, we investigate the relationship between pitchin adjacent turns in spontaneous German conversation. We use twodifferent methods of evaluating pitch in adjacent turns, reflectingnormalizing and initializing approaches respectively. We find thatthe results are well correlated with conversational participants’evaluation of the conversation. Furthermore, evaluating locationswith matched or mismatched pitch can help distinguish betweenblind and face-to-face conversational situations, as well as identifyinglocations where specific discourse strategies (such as tag questions)have been deployed.

The Relationship Between F0 Synchrony and SpeechConvergence in Dyadic Interaction

Sankar Mukherjee 1, Alessandro D’Ausilio 1, NoëlNguyen 2, Luciano Fadiga 1, Leonardo Badino 1; 1IstitutoItaliano di Tecnologia, Italy; 2LPL (UMR 7309), FranceWed-O-6-6-4, Time: 11:00–11:20

Speech accommodation happens when two people engage in verbalconversation. In this paper two types of accommodation are inves-tigated — one dependent on cognitive, physiological, functional andsocial constraints (Convergence), the other dependent on linguisticand paralinguistic factors (Synchrony). Convergence refers to thesituation when two speakers’ speech characteristics move towardsa common point. Synchrony happens if speakers’ prosodic featuresbecome correlated over time. Here we analyze relations betweenthe two phenomena at the single word level. Although calculationof Synchrony is fairly straightforward, measuring Convergence iseven more problematic as proved by a long history of debates onhow to define it. In this paper we consider Convergence as anemergent behavior and investigate it by developing a robust andautomatic method based on Gaussian Mixture Model (GMM). Our

results show that high Synchrony of F0 between two speakers leadsto greater amount of Convergence. This provides robust support forthe idea that Synchrony and Convergence are interrelated processes,particularly in female participants.

The Role of Linguistic and Prosodic Cues on thePrediction of Self-Reported Satisfaction in ContactCentre Phone Calls

Jordi Luque, Carlos Segura, Ariadna Sánchez, MartíUmbert, Luis Angel Galindo; Telefónica I+D, SpainWed-O-6-6-5, Time: 11:20–11:40

Call Centre data is typically collected by organizations and corpo-rations in order to ensure the quality of service, supporting forexample mining capabilities for monitoring customer satisfaction.In this work, we analyze the significance of various acoustic featuresextracted from customer-agents’ spoken interaction in predictingself-reported satisfaction by the customer. We also investigatewhether speech prosodic features can deliver complementary infor-mation to speech transcriptions provided by an ASR. We explorethe possibility of using a deep neural architecture to perform earlyfeature fusion on both prosodic and linguistic information. Con-volutional Neural Networks are trained on a combination of wordembedding and acoustic features for the binary classification task of“low” and “high” satisfaction prediction. We conducted our experi-ments analysing real call-centre interactions of a large corporationin a Spanish spoken country. Our experiments show that linguisticfeatures can predict self-reported satisfaction more accurately thanthose based on prosodic and conversational descriptors. We also findthat dialog turn-level conversational features generally outperformsframe-level signal descriptors. Finally, the fusion of linguistic andprosodic features reports the best performance in our experiments,suggesting the complementarity of the information conveyed byeach set of behavioral representation.

Cross-Linguistic Study of the Production ofTurn-Taking Cues in American English and ArgentineSpanish

Pablo Brusco, Juan Manuel Pérez, Agustín Gravano;Universidad de Buenos Aires, ArgentinaWed-O-6-6-6, Time: 11:40–12:00

We present the results of a series of machine learning experimentsaimed at exploring the differences and similarities in the productionof turn-taking cues in American English and Argentine Spanish.An analysis of prosodic features automatically extracted from 21dyadic conversations (12 En, 9 Sp) revealed that, when signalingHolds, speakers of both languages tend to use roughly the samecombination of cues, characterized by a sustained final intonation,a shorter duration of turn-final inter-pausal units, and a distinctvoice quality. However, in speech preceding Smooth Switches orBackchannels, we observe the existence of the same set of prosodicturn-taking cues in both languages, although the ways in whichthese cues are combined together to form complex signals differ.Still, we find that these differences do not degrade below chance theperformance of cross-linguistic systems for automatically detectingturn-taking signals. These results are relevant to the construction ofmultilingual spoken dialogue systems, which need to adapt not onlytheir ASR modules but also the way prosodic turn-taking cues aresynthesized and recognized.

Notes

172

Wed-O-6-8 : Social Signals, Styles, andInteractionD8, 10:00–12:00, Wednesday, 23 Aug. 2017Chairs: Khiet Truong, Nigel Ward

Emotional Features for Speech Overlaps Classification

Olga Egorow, Andreas Wendemuth;Otto-von-Guericke-Universität Magdeburg, GermanyWed-O-6-8-1, Time: 10:00–10:20

One interesting phenomenon of natural conversation is overlappingspeech. Besides causing difficulties in automatic speech processing,such overlaps carry information on the state of the overlapper: com-petitive overlaps (i.e. “interruptions”) can signal disagreement or thefeeling of being overlooked, and cooperative overlaps (i.e. supportiveinterjections) can signal agreement and interest. These hints canbe used to improve human-machine interaction. In this paper wepresent an approach for automatic classification of competitive andcooperative overlaps using the emotional content of the speakers’utterances before and after the overlap. For these experiments, weuse real-world data from human-human interactions in call centres.We also compare our approach to standard acoustic classification onthe same data and come to the conclusion, that emotional featuresare clearly superior to acoustic features for this task, resulting inan unweighted average f-measure of 71.9%. But we also find thatacoustic features should not be entirely neglected: using a latefusion procedure, we can further improve the unweighted averagef-measure by 2.6%.

Computing Multimodal Dyadic Behaviors DuringSpontaneous Diagnosis Interviews Toward AutomaticCategorization of Autism Spectrum Disorder

Chin-Po Chen 1, Xian-Hong Tseng 1, Susan Shur-FenGau 2, Chi-Chun Lee 1; 1National Tsing Hua University,Taiwan; 2National Taiwan University, TaiwanWed-O-6-8-2, Time: 10:20–10:40

Autism spectrum disorder (ASD) is a highly-prevalent neural de-velopmental disorder often characterized by social communicativedeficits and restricted repetitive interest. The heterogeneous natureof ASD in its behavior manifestations encompasses broad syndromessuch as, Classical Autism (AD), High-functioning Autism (HFA), andAsperger syndrome (AS). In this work, we compute a variety ofmultimodal behavior features, including body movements, acousticcharacteristics, and turn-taking events dynamics, of the participant,the investigator and the interaction between the two directly fromaudio-video recordings by leveraging the Autism Diagnostic Observa-tional Schedule (ADOS) as a clinically-valid behavior data elicitationtechnique. Several of these signal-derived behavioral measures showstatistically significant differences among the three syndromes.Our analyses indicate that these features may be pointing to theunderlying differences in the behavior characterizations of socialfunctioning between AD, AS, and HFA — corroborating some of theprevious literature. Further, our signal-derived behavior measuresachieve competitive, sometimes exceeding, recognition accuracies indiscriminating between the three syndromes of ASD when comparedto investigator’s clinical-rating on participant’s social and commu-nicative behaviors during ADOS.

Deriving Dyad-Level Interaction Representation UsingInterlocutors Structural and Expressive MultimodalBehavior Features

Yun-Shao Lin, Chi-Chun Lee; National Tsing HuaUniversity, TaiwanWed-O-6-8-3, Time: 10:40–11:00

The overall interaction atmosphere is often a result of complexinterplay between individual interlocutor’s behavior expressions andjoint manifestation of dyadic interaction dynamics. There is verylimited work, if any, that has computationally analyzed a humaninteraction at the dyad-level. Hence, in this work, we propose to com-pute an extensive novel set of features representing multi-facetedaspects of a dyadic interaction. These features are grouped intotwo broad categories: expressive and structural behavior dynamics,where each captures information about within-speaker behaviormanifestation, inter-speaker behavior dynamics, durational andtransitional statistics providing holistic behavior quantifications atthe dyad-level. We carry out an experiment of recognizing targetedaffective atmosphere using the proposed expressive and structuralbehavior dynamics features derived from audio and video modal-ities. Our experiment shows that the inclusion of both expressiveand structural behavior dynamics is essential in achieving promisingrecognition accuracies across six different classes (72.5%), wherestructural-based features improve the recognition rates on classesof sad and surprise. Further analyses reveal important aspects ofmultimodal behavior dynamics within dyadic interactions that arerelated to the affective atmospheric scene.

Spotting Social Signals in Conversational Speech overIP: A Deep Learning Perspective

Raymond Brueckner 1, Maximilian Schmitt 2, MajaPantic 3, Björn Schuller 2; 1Technische UniversitätMünchen, Germany; 2Universität Passau, Germany;3Imperial College London, UKWed-O-6-8-4, Time: 11:00–11:20

The automatic detection and classification of social signals is an im-portant task, given the fundamental role nonverbal behavioral cuesplay in human communication. We present the first cross-lingualstudy on the detection of laughter and fillers in conversational andspontaneous speech collected ‘in the wild’ over IP (internet protocol).Further, this is the first comparison of LSTM and GRU networks toshed light on their performance differences. We report frame-basedresults in terms of the unweighted-average area-under-the-curve(UAAUC) measure and will shortly discuss its suitability for this task.In the mono-lingual setup our best deep BLSTM system achieves87.0% and 86.3% UAAUC for English and German, respectively. In-terestingly, the cross-lingual results are only slightly lower, yielding83.7% for a system trained on English, but tested on German, and85.0% in the opposite case. We show that LSTM and GRU architec-tures are valid alternatives for e. g., on-line and compute-sensitiveapplications, since their application incurs a relative UAAUC decreaseof only approximately 5% with respect to our best systems. Finally,we apply additional smoothing to correct for erroneous spikes anddrops in the posterior trajectories to obtain an additional gain in allsetups.

Optimized Time Series Filters for Detecting Laughterand Filler Events

Gábor Gosztolya; MTA-SZTE RGAI, HungaryWed-O-6-8-5, Time: 11:20–11:40

Social signal detection, that is, the task of identifying vocalizationslike laughter and filler events is a popular task within computational

Notes

173

paralinguistics. Recent studies have shown that besides applyingstate-of-the-art machine learning methods, it is worth making useof the contextual information and adjusting the frame-level scoresbased on the local neighbourhood. In this study we apply a weightedaverage time series smoothing filter for laughter and filler event iden-tification, and set the weights using a state-of-the-art optimizationmethod, namely the Covariance Matrix Adaptation Evolution Strategy(CMA-ES). Our results indicate that this is a viable way of improvingthe Area Under the Curve (AUC) scores: our resulting scores are muchbetter than the accuracy scores of the raw likelihoods produced byDeep Neural Networks trained on three different feature sets, andwe also significantly outperform standard time series filters as wellas DNNs used for smoothing. Our score achieved on the test setof a public English database containing spontaneous mobile phoneconversations is the highest one published so far that was realizedby feed-forward techniques.

Visual, Laughter, Applause and Spoken ExpressionFeatures for Predicting Engagement Within TED Talks

Fasih Haider 1, Fahim A. Salim 1, Saturnino Luz 2, CarlVogel 1, Owen Conlan 1, Nick Campbell 1; 1TrinityCollege Dublin, Ireland; 2University of Edinburgh, UKWed-O-6-8-6, Time: 11:40–12:00

There is an enormous amount of audio-visual content availableon-line in the form of talks and presentations. The prospectiveusers of the content face difficulties in finding the right content forthem. However, automatic detection of interesting (engaging vs.non-engaging) content can help users to find the videos accordingto their preferences. It can also be helpful for a recommendationand personalised video segmentation system. This paper presentsa study of engagement based on TED talks (1338 videos) which arerated by on-line viewers (users). It proposes novel models to predictthe user’s (on-line viewers) engagement using high-level visual fea-tures (camera angles), the audience’s laughter and applause, and thepresenter’s speech expressions. The results show that these featurescontribute towards the prediction of user engagement in these talks.However, finding the engaging speech expressions can also help asystem in making summaries of TED Talks (video summarization)and creating feedback to presenters about their speech expressionsduring talks.

Wed-O-6-10 : Acoustic Model AdaptationE10, 10:00–12:00, Wednesday, 23 Aug. 2017Chairs: Catherine Breslin, George Saon

Large-Scale Domain Adaptation via Teacher-StudentLearning

Jinyu Li, Michael L. Seltzer, Xi Wang, Rui Zhao, YifanGong; Microsoft, USAWed-O-6-10-1, Time: 10:00–10:20

High accuracy speech recognition requires a large amount of tran-scribed data for supervised training. In the absence of such data,domain adaptation of a well-trained acoustic model can be per-formed, but even here, high accuracy usually requires significantlabeled data from the target domain. In this work, we propose anapproach to domain adaptation that does not require transcriptionsbut instead uses a corpus of unlabeled parallel data, consisting ofpairs of samples from the source domain of the well-trained modeland the desired target domain. To perform adaptation, we employteacher/student (T/S) learning, in which the posterior probabilitiesgenerated by the source-domain model can be used in lieu of labels totrain the target-domain model. We evaluate the proposed approach

in two scenarios, adapting a clean acoustic model to noisy speechand adapting an adults’ speech acoustic model to children’s speech.Significant improvements in accuracy are obtained, with reductionsin word error rate of up to 44% over the original source model with-out the need for transcribed data in the target domain. Moreover,we show that increasing the amount of unlabeled data results inadditional model robustness, which is particularly beneficial whenusing simulated training data in the target-domain.

Improving Children’s Speech Recognition ThroughExplicit Pitch Scaling Based on Iterative SpectrogramInversion

W. Ahmad 1, S. Shahnawazuddin 2, H.K. Kathania 1,Gayadhar Pradhan 2, A.B. Samaddar 1; 1NIT Sikkim,India; 2NIT Patna, IndiaWed-O-6-10-2, Time: 10:20–10:40

The task of transcribing children’s speech using statistical modelstrained on adults’ speech is very challenging. Large mismatch inthe acoustic and linguistic attributes of the training and test data isreported to degrade the performance. In such speech recognitiontasks, the differences in pitch (or fundamental frequency) betweenthe two groups of speakers is one among several mismatch factors.To overcome the pitch mismatch, an existing pitch scaling tech-nique based on iterative spectrogram inversion is explored in thiswork. Explicit pitch scaling is found to improve the recognition ofchildren’s speech under mismatched setup. In addition to that, wehave also studied the effect of discarding the phase informationduring spectrum reconstruction. This is motivated by the fact thatthe dominant acoustic feature extraction techniques make use ofthe magnitude spectrum only. On evaluating the effectiveness undermismatched testing scenario, the existing as well as the modifiedpitch scaling techniques result in very similar recognition perfor-mances. Furthermore, we have explored the role of pitch scalingon another speech recognition system which is trained on speechdata from both adult and child speakers. Pitch scaling is noted to beeffective for children’s speech recognition in this case as well.

RNN-LDA Clustering for Feature Based DNNAdaptation

Xurong Xie 1, Xunying Liu 1, Tan Lee 2, Lan Wang 1;1Chinese Academy of Sciences, China; 2ChineseUniversity of Hong Kong, ChinaWed-O-6-10-3, Time: 10:40–11:00

Model based deep neural network (DNN) adaptation approachesoften require multi-pass decoding in test time. Input feature basedDNN adaptation, for example, based on latent Dirichlet allocation(LDA) clustering, provide a more efficient alternative. In conventionalLDA clustering, the transition and correlation between neighboringclusters is ignored. In order to address this issue, a recurrent neuralnetwork (RNN) based clustering scheme is proposed to learn both thestandard LDA cluster labels and their natural correlation over timein this paper. In addition to directly using the resulting RNN-LDAas input features during DNN adaptation, a range of techniqueswere investigated to condition the DNN hidden layer parametersor activation outputs on the RNN-LDA features. On a DARPA GaleMandarin Chinese broadcast speech transcription task, the proposedRNN-LDA cluster features adapted DNN system outperformed boththe baseline un-adapted DNN system and conventional LDA featuresadapted DNN system by 8% relative on the most difficult Phoenix TVsubset. Consistent improvements were also obtained after furthercombination with model based adaptation approaches.

Notes

174

Robust Online i-Vectors for Unsupervised Adaptationof DNN Acoustic Models: A Study in the Context ofDigital Voice Assistants

Harish Arsikere, Sri Garimella; Amazon.com, IndiaWed-O-6-10-4, Time: 11:00–11:20

Supplementing log filter-bank energies with i-vectors is a popularmethod for adaptive training of deep neural network acousticmodels. While offline i-vectors (the target utterance or other relevantadaptation material is available for i-vector extraction prior to decod-ing) have been well studied, there is little analysis of online i-vectorsand their robustness in multi-user scenarios where speaker changescan be frequent and unpredictable. The authors of [1] showed thatonline adaptation could be achieved through segmental i-vectorscomputed using the hidden Markov model (HMM) state alignmentsof utterances decoded in the recent past. While this approach workswell in general, it could be rendered ineffective by speaker changes.In this paper, we study robust extensions of the ideas proposedin [1] by: (a) updating i-vectors on a per-frame basis based on theincoming target utterance, and (b) using lattice posteriors insteadof one-best HMM state alignments. Experiments with differenti-vector implementations show that: (a) when speaker changes occur,lattice-based frame-level i-vectors provide up to 6% word error ratereduction relative to the baseline [1], and (b) online i-vectors aremore effective, in general, when the microphone characteristics oftest utterances are not seen in training.

Semi-Supervised Learning with Semantic KnowledgeExtraction for Improved Speech Recognition in AirTraffic Control

Ajay Srinivasamurthy 1, Petr Motlicek 1, Ivan Himawan 1,György Szaszák 2, Youssef Oualil 2, Hartmut Helmke 3;1Idiap Research Institute, Switzerland; 2Universität desSaarlandes, Germany; 3DLR, GermanyWed-O-6-10-5, Time: 11:20–11:40

Automatic Speech Recognition (ASR) can introduce higher levels ofautomation into Air Traffic Control (ATC), where spoken languageis still the predominant form of communication. While ATC usesstandard phraseology and a limited vocabulary, we need to adaptthe speech recognition systems to local acoustic conditions andvocabularies at each airport to reach optimal performance. Dueto continuous operation of ATC systems, a large and increasingamount of untranscribed speech data is available, allowing forsemi-supervised learning methods to build and adapt ASR models.In this paper, we first identify the challenges in building ASR systemsfor specific ATC areas and propose to utilize out-of-domain datato build baseline ASR models. Then we explore different methodsof data selection for adapting baseline models by exploiting thecontinuously increasing untranscribed data. We develop a basicapproach capable of exploiting semantic representations of ATCcommands. We achieve relative improvement in both word error rate(23.5%) and concept error rates (7%) when adapting ASR models todifferent ATC conditions in a semi-supervised manner.

Dynamic Layer Normalization for Adaptive NeuralAcoustic Modeling in Speech Recognition

Taesup Kim 1, Inchul Song 2, Yoshua Bengio 1;1Université de Montréal, Canada; 2SAIT, KoreaWed-O-6-10-6, Time: 11:40–12:00

Layer normalization is a recently introduced technique for normaliz-ing the activities of neurons in deep neural networks to improve thetraining speed and stability. In this paper, we introduce a new layernormalization technique called Dynamic Layer Normalization (DLN)

for adaptive neural acoustic modeling in speech recognition. Bydynamically generating the scaling and shifting parameters in layernormalization, DLN adapts neural acoustic models to the acousticvariability arising from various factors such as speakers, channelnoises, and environments. Unlike other adaptive acoustic models,our proposed approach does not require additional adaptation dataor speaker information such as i-vectors. Moreover, the model sizeis fixed as it dynamically generates adaptation parameters. We applyour proposed DLN to deep bidirectional LSTM acoustic models andevaluate them on two benchmark datasets for large vocabulary ASRexperiments: WSJ and TED-LIUM release 2. The experimental resultsshow that our DLN improves neural acoustic models in terms oftranscription accuracy by dynamically adapting to various speakersand environments.

Wed-O-7-1 : Cognition and Brain StudiesAula Magna, 13:30–15:30, Wednesday, 23 Aug. 2017Chairs: Odette Scharenborg, Tanja Schultz

An Entrained Rhythm’s Frequency, Not Phase,Influences Temporal Sampling of Speech

Hans Rutger Bosker, Anne Kösem; MPI forPsycholinguistics, The NetherlandsWed-O-7-1-1, Time: 13:30–13:50

Brain oscillations have been shown to track the slow amplitudefluctuations in speech during comprehension. Moreover, there isevidence that these stimulus-induced cortical rhythms may persisteven after the driving stimulus has ceased. However, how exactlythis neural entrainment shapes speech perception remains debated.This behavioral study investigated whether and how the frequencyand phase of an entrained rhythm would influence the temporalsampling of subsequent speech.

In two behavioral experiments, participants were presented withslow and fast isochronous tone sequences, followed by Dutch targetwords ambiguous between as /As/ “ash” (with a short vowel) andaas /a:s/ “bait” (with a long vowel). Target words were presented atvarious phases of the entrained rhythm. Both experiments revealedeffects of the frequency of the tone sequence on target word percep-tion: fast sequences biased listeners to more long /a:s/ responses.However, no evidence for phase effects could be discerned.

These findings show that an entrained rhythm’s frequency, butnot phase, influences the temporal sampling of subsequent speech.These outcomes are compatible with theories suggesting that sensorytiming is evaluated relative to entrained frequency. Furthermore,they suggest that phase tracking of (syllabic) rhythms by thetaoscillations plays a limited role in speech parsing.

Context Regularity Indexed by Auditory N1 and P2Event-Related Potentials

Xiao Wang 1, Yanhui Zhang 1, Gang Peng 2; 1ChineseUniversity of Hong Kong, China; 2Hong KongPolytechnic University, ChinaWed-O-7-1-2, Time: 13:50–14:10

It is still a question of debate whether the N1-P2 complex is anindex of low-level auditory processes or whether it can capturehigher-order information encoded in the immediate context. Toaddress this issue, the current study examined the morphology ofthe N1-P2 complex as a function of context regularities instantiatedat the sublexical level. We presented two types of speech targets inisolation and in contexts comprising sequences of Cantonese wordssharing either the entire rime units or just the rime segments (thuslacking lexical tone consistency). Results revealed a pervasive yet

Notes

175

unequal attenuation of the N1 and P2 components: The degree ofN1 attenuation tended to decrease while that of P2 increased due toenhanced detectability of more regular speech patterns, as well astheir enhanced predictability in the immediate context. The distinctbehaviors of N1 and P2 event-related potentials could be explained bythe influence of perceptual experience and the hierarchical encodingof context regularities.

Discovering Language in Marmoset Vocalization

Sakshi Verma 1, K.L. Prateek 1, Karthik Pandia 1,Nauman Dawalatabad 1, Rogier Landman 2, JitendraSharma 2, Mriganka Sur 2, Hema A. Murthy 1; 1IITMadras, India; 2MIT, USAWed-O-7-1-3, Time: 14:10–14:30

Various studies suggest that marmosets (Callithrix jacchus) showbehavior similar to that of humans in many aspects. Analyzing theircalls would not only enable us to better understand these speciesbut would also give insights into the evolution of human languagesand vocal tract. This paper describes a technique to discover thepatterns in marmoset vocalization in an unsupervised fashion. Theproposed unsupervised clustering approach operates in two stages.Initially, voice activity detection (VAD) is applied to remove silencesand non-voiced regions from the audio. This is followed by a group-delay based segmentation on the voiced regions to obtain smallersegments. In the second stage, a two-tier clustering is performed onthe segments obtained. Individual hidden Markov models (HMMs)are built for each of the segments using a multiple frame size andmultiple frame rate. The HMMs are then clustered until each clusteris made up of a large number of segments. Once all the clusters getenough number of segments, one Gaussian mixture model (GMM)is built for each of the clusters. These clusters are then mergedusing Kullback-Leibler (KL) divergence. The algorithm converges tothe total number of distinct sounds in the audio, as evidenced bylistening tests.

Subject-Independent Classification of JapaneseSpoken Sentences by Multiple Frequency Bands PhasePattern of EEG Response During Speech Perception

Hiroki Watanabe, Hiroki Tanaka, Sakriani Sakti, SatoshiNakamura; NAIST, JapanWed-O-7-1-4, Time: 14:30–14:50

Recent speech perception models propose that neural oscillations intheta band show phase locking to speech envelope to extract syllabicinformation and rapid temporal information is processed by the cor-responding higher frequency band (e.g., low gamma). It is suggestedthat phase-locked responses to acoustic features show consistentpatterns across subjects. Previous magnetoencephalographic (MEG)experiment showed that subject-dependent template matching clas-sification by theta phase patterns could discriminate three Englishspoken sentences. In this paper, we adopt electroencephalography(EEG) to the spoken sentence discrimination on Japanese language,and we investigate the performances in various different settingsby using: (1) template matching and support vector machine (SVM)classifiers; (2) subject dependent and independent models; (3) mul-tiple frequency bands including theta, alpha, beta, low gamma, andthe combination of all frequency bands. The performances in almostsettings were higher than the chance level. While performances ofSVM and template matching did not differ, the performance withcombination of multiple frequency bands outperformed the one thattrained only on single frequency bands. Best accuracies in subjectdependent and independent models achieved 55.2% by SVM on thecombination of all frequency bands and 44.0% by template matchingon the combination of all frequency bands, respectively.

The Phonological Status of the French Initial Accentand its Role in Semantic Processing: An Event-RelatedPotentials Study

Noémie te Rietmolen 1, Radouane El Yagoubi 2, AlainGhio 3, Corine Astésano 1; 1URI OCTOGONE-LORDAT(EA 4156), France; 2CLLE (UMR 5263), France; 3LPL(UMR 7309), FranceWed-O-7-1-5, Time: 14:50–15:10

French accentuation is held to belong to the level of the phrase.Consequently French is considered ‘a language without accent’ withspeakers that are ‘deaf to stress’. Recent ERP-studies investigatingthe French initial accent (IA) however demonstrate listeners not onlydiscriminate between different stress patterns, but also prefer wordsto be marked with IA early in the process of speech comprehension.Still, as words were presented in isolation, it remains unclear whetherthe preference applied to the lexical or to the phrasal level. In thecurrent ERP-study, we address this ambiguity and manipulate IAon words embedded in a sentence. Furthermore, we orthogonallymanipulate semantic congruity to investigate the interplay betweenaccentuation and later speech processing stages. Preliminary re-sults on 14 participants reveal a significant interaction effect: thecentro-frontally located N400 was larger for words without IA,with a bigger effect for semantically incongruent sentences. Thisindicates that IA is encoded at a lexical level and facilitates semanticprocessing. Furthermore, as participants attended to the semanticcontent of the sentences, the finding underlines the automaticityof stress processing. In sum, we demonstrate accentuation playsan important role in French speech comprehension and call for thetraditional view to be reconsidered.

A Neuro-Experimental Evidence for the Motor Theoryof Speech Perception

Bin Zhao, Jianwu Dang, Gaoyan Zhang; TianjinUniversity, ChinaWed-O-7-1-6, Time: 15:10–15:30

The somatotopic activation in the sensorimotor cortex during speechcomprehension has been redundantly documented and largelyexplained by the notion of embodied semantics, which suggests thatprocessing auditory words referring to body movements recruits thesame somatotopic regions for that action execution. For this issue,the motor theory of speech perception provided another explana-tion, suggesting that the perception of speech sounds produced by aspecific articulator movement may recruit the motor representationof that articulator in the precentral gyrus. To examine the lattertheory, we used a set of Chinese synonyms with different articulatoryfeatures, involving lip gestures (LipR) or not (LipN), and recordedthe electroencephalographic (EEG) signals while subjects passivelylistened to them. It was found that at about 200 ms post-onset,the event-related potential of LipR and LipN showed a significantpolarity reversal near the precentral lip motor areas. EEG sourcereconstruction results also showed more obvious somatotopicactivation in the lip region for the LipR than the LipN. Our resultsprovide a positive support for the effect of articulatory simulationon speech comprehension and basically agree with the motor theoryof speech perception.

Notes

176

Wed-O-7-2 : Noise Robust Speech RecognitionA2, 13:30–15:30, Wednesday, 23 Aug. 2017Chairs: Yifan Gong, Izhak Shafran

Speech Representation Learning Using UnsupervisedData-Driven Modulation Filtering for Robust ASR

Purvi Agrawal, Sriram Ganapathy; Indian Institute ofScience, IndiaWed-O-7-2-1, Time: 13:30–13:50

The performance of an automatic speech recognition (ASR) systemdegrades severely in noisy and reverberant environments in part dueto the lack of robustness in the underlying representations used inthe ASR system. On the other hand, the auditory processing studieshave shown the importance of modulation filtered spectrogramrepresentations in robust human speech recognition. Inspiredby these evidences, we propose a speech representation learningparadigm using data-driven 2-D spectro-temporal modulation filterlearning. In particular, multiple representations are derived usingthe convolutional restricted Boltzmann machine (CRBM) model inan unsupervised manner from the input speech spectrogram. Afilter selection criteria based on average number of active hiddenunits is also employed to select the representations for ASR. Theexperiments are performed on Wall Street Journal (WSJ) Aurora-4database with clean and multi condition training setup. In theseexperiments, the ASR results obtained from the proposed modula-tion filtering approach shows significant robustness to noise andchannel distortions compared to other feature extraction methods(average relative improvements of 19% over baseline features inclean training). Furthermore, the ASR experiments performed onreverberant speech data from the REVERB challenge corpus highlightthe benefits of the proposed representation learning scheme for farfield speech recognition.

Combined Multi-Channel NMF-Based RobustBeamforming for Noisy Speech Recognition

Masato Mimura, Yoshiaki Bando, Kazuki Shimada,Shinsuke Sakai, Kazuyoshi Yoshii, Tatsuya Kawahara;Kyoto University, JapanWed-O-7-2-2, Time: 13:50–14:10

We propose a novel acoustic beamforming method using blindsource separation (BSS) techniques based on non-negative matrixfactorization (NMF). In conventional mask-based approaches, hardor soft masks are estimated and beamforming is performed us-ing speech and noise spatial covariance matrices calculated frommasked noisy observations, but the phase information of the targetspeech is not adequately preserved. In the proposed method, weperform complex-domain source separation based on multi-channelNMF with rank-1 spatial model (rank-1 MNMF) to obtain a speechspatial covariance matrix for estimating a steering vector for thetarget speech utilizing the separated speech observation in eachtime-frequency bin. This accurate steering vector estimation iseffectively combined with our novel noise mask prediction methodusing multi-channel robust NMF (MRNMF) to construct a MaximumLikelihood (ML) beamformer that achieved a better speech recog-nition performance than a state-of-the-art DNN-based beamformerwith no environment-specific training. Superiority of the phasepreserving source separation to real-valued masks in beamformingis also confirmed through ASR experiments.

Recognizing Multi-Talker Speech with PermutationInvariant Training

Dong Yu 1, Xuankai Chang 2, Yanmin Qian 2; 1TencentAI Lab, USA; 2Shanghai Jiao Tong University, ChinaWed-O-7-2-3, Time: 14:10–14:30

In this paper, we propose a novel technique for direct recognition ofmultiple speech streams given the single channel of mixed speech,without first separating them. Our technique is based on permu-tation invariant training (PIT) for automatic speech recognition(ASR). In PIT-ASR, we compute the average cross entropy (CE) overall frames in the whole utterance for each possible output-targetassignment, pick the one with the minimum CE, and optimize forthat assignment. PIT-ASR forces all the frames of the same speakerto be aligned with the same output layer. This strategy elegantlysolves the label permutation problem and speaker tracing problemin one shot. Our experiments on artificially mixed AMI data showedthat the proposed approach is very promising.

Coupled Initialization of Multi-Channel Non-NegativeMatrix Factorization Based on Spatial and SpectralInformation

Yuuki Tachioka 1, Tomohiro Narita 1, Iori Miura 2,Takanobu Uramoto 2, Natsuki Monta 2, ShingoUenohara 2, Ken’ichi Furuya 2, Shinji Watanabe 3,Jonathan Le Roux 3; 1Mitsubishi Electric, Japan; 2OitaUniversity, Japan; 3MERL, USAWed-O-7-2-4, Time: 14:30–14:50

Multi-channel non-negative matrix factorization (MNMF) is a multi-channel extension of NMF and often outperforms NMF because itcan deal with spatial and spectral information simultaneously. Onthe other hand, MNMF has a larger number of parameters and itsperformance heavily depends on the initial values. MNMF factorizesan observation matrix into four matrices: spatial correlation, basis,cluster-indicator latent variables, and activation matrices. This paperproposes effective initialization methods for these matrices. First,the spatial correlation matrix, which shows the largest initial valuedependencies, is initialized using the cross-spectrum method fromenhanced speech by binary masking. Second, when the target isspeech, constructing bases from phonemes existing in an utterancecan improve the performance: this paper proposes a speech basesselection by using automatic speech recognition (ASR). Third, wealso propose an initialization method for the cluster-indicator latentvariables that couple the spatial and spectral information, which canachieve the simultaneous optimization of above two matrices. Ex-periments on a noisy ASR task show that the proposed initializationsignificantly improves the performance of MNMF by reducing theinitial value dependencies.

Channel Compensation in the Generalised VectorTaylor Series Approach to Robust ASR

Erfan Loweimi, Jon Barker, Thomas Hain; University ofSheffield, UKWed-O-7-2-5, Time: 14:50–15:10

Vector Taylor Series (VTS) is a powerful technique for robust ASRbut, in its standard form, it can only be applied to log-filter bankand MFCC features. In earlier work, we presented a generalised VTS(gVTS) that extends the applicability of VTS to front-ends whichemploy a power transformation non-linearity. gVTS was shownto provide performance improvements in both clean and additivenoise conditions. This paper makes two novel contributions. Firstly,while the previous gVTS formulation assumed that noise was purelyadditive, we now derive gVTS formulae for the case of speech in the

Notes

177

presence of both additive noise and channel distortion. Second, wepropose a novel iterative method for estimating the channel distor-tion which utilises gVTS itself and converges after a few iterations.Since the new gVTS blindly assumes the existence of both additivenoise and channel effects, it is important not to introduce extradistortion when either are absent. Experimental results conductedon LVCSR Aurora-4 database show that the new formulation passesthis test. In the presence of channel noise only, it provides relativeWER reductions of up to 30% and 26%, compared with previousgVTS and multi-style training with cepstral mean normalisation,respectively.

Robust Speech Recognition via Anchor WordRepresentations

Brian King 1, I-Fan Chen 1, Yonatan Vaizman 2, YuzongLiu 1, Roland Maas 1, Sree Hari Krishnan Parthasarathi 1,Björn Hoffmeister 1; 1Amazon.com, USA; 2University ofCalifornia at San Diego, USAWed-O-7-2-6, Time: 15:10–15:30

A challenge for speech recognition for voice-controlled householddevices, like the Amazon Echo or Google Home, is robustness againstinterfering background speech. Formulated as a far-field speechrecognition problem, another person or media device in proximitycan produce background speech that can interfere with the device-directed speech. We expand on our previous work on device-directedspeech detection in the far-field speech setting and introduce twoapproaches for robust acoustic modeling. Both methods are basedon the idea of using an anchor word taken from the device directedspeech. Our first method employs a simple yet effective normaliza-tion of the acoustic features by subtracting the mean derived overthe anchor word. The second method utilizes an encoder networkprojecting the anchor word onto a fixed-size embedding, whichserves as an additional input to the acoustic model. The encodernetwork and acoustic model are jointly trained. Results on anin-house dataset reveal that, in the presence of background speech,the proposed approaches can achieve up to 35% relative word errorrate reduction.

Wed-O-7-4 : Topic Spotting, Entity Extractionand Semantic AnalysisB4, 13:30–15:30, Wednesday, 23 Aug. 2017Chairs: Ville Hautamaki, Lin-shan Lee

Towards Zero-Shot Frame Semantic Parsing forDomain Scaling

Ankur Bapna, Gokhan Tür, Dilek Hakkani-Tür, LarryHeck; Google, USAWed-O-7-4-1, Time: 13:30–13:50

State-of-the-art slot filling models for goal-oriented human/ machineconversational language understanding systems rely on deep learn-ing methods. While multi-task training of such models alleviatesthe need for large in-domain annotated datasets, bootstrapping asemantic parsing model for a new domain using only the semanticframe, such as the back-end API or knowledge graph schema, is stillone of the holy grail tasks of language understanding for dialoguesystems. This paper proposes a deep learning based approach thatcan utilize only the slot description in context without the need forany labeled or unlabeled in-domain examples, to quickly bootstrap anew domain. The main idea of this paper is to leverage the encodingof the slot names and descriptions within a multi-task deep learnedslot filling model, to implicitly align slots across domains. The

proposed approach is promising for solving the domain scalingproblem and eliminating the need for any manually annotated dataor explicit schema alignment. Furthermore, our experiments on mul-tiple domains show that this approach results in significantly betterslot-filling performance when compared to using only in-domaindata, especially in the low data regime.

ClockWork-RNN Based Architectures for Slot Filling

Despoina Georgiadou, Vassilios Diakoloukas, VassiliosTsiaras, Vassilios Digalakis; Technical University ofCrete, GreeceWed-O-7-4-2, Time: 13:50–14:10

A prevalent and challenging task in spoken language understandingis slot filling. Currently, the best approaches in this domain are basedon recurrent neural networks (RNNs). However, in their simplestform, RNNs cannot learn long-term dependencies in the data. In thispaper, we propose the use of ClockWork recurrent neural network(CW-RNN) architectures in the slot-filling domain. CW-RNN is amulti-timescale implementation of the simple RNN architecture,which has proven to be powerful since it maintains relatively smallmodel complexity. In addition, CW-RNN exhibits a great ability tomodel long-term memory inherently. In our experiments on theATIS benchmark data set, we also evaluate several novel variants ofCW-RNN and we find that they significantly outperform simple RNNsand they achieve results among the state-of-the-art, while retainingsmaller complexity.

Investigating the Effect of ASR Tuning on NamedEntity Recognition

Mohamed Ameur Ben Jannet 1, Olivier Galibert 2,Martine Adda-Decker 3, Sophie Rosset 1; 1LIMSI, France;2LNE, France; 3LPP (UMR 7018), FranceWed-O-7-4-3, Time: 14:10–14:30

Information retrieval from speech is a key technology for manyapplications, as it allows access to large amounts of audio data. Thistechnology requires two major components: an automatic speechrecognizer (ASR) and a text-based information retrieval module suchas a key word extractor or a named entity recognizer (NER). Whencombining the two components, the resulting final application needsto be globally optimized. However, ASR and information retrievalare usually developed and optimized separately. The ASR tends tobe optimized to reduce the word error rate (WER), a metric whichdoes not take into account the contextual and syntactic roles ofthe words, which are valuable information for information retrievalsystems. In this paper we investigate different ways to tune the ASRfor a speech-based NER system. In an end-to-end configuration wealso tested several ASR metrics, including WER, NE-WER and ATENE,as well as the use of an oracle during the development step. Ourresults show that using a NER oracle to tune the system reduces thenamed entity recognition error rate by more than 1% absolute, andusing the ATENE metric allows us to reduce it by more than 0.75%.We also show that these optimization approaches favor a higherASR language model weight which entails an overall gain in NERperformance, despite a local increase of the WER.

Label-Dependency Coding in Simple RecurrentNetworks for Spoken Language Understanding

Marco Dinarelli 1, Vedran Vukotic 2, ChristianRaymond 2; 1Lattice (UMR 8094), France; 2INSA, FranceWed-O-7-4-4, Time: 14:30–14:50

Modeling target label dependencies is important for sequence label-ing tasks. This may become crucial in the case of Spoken Language

Notes

178

Understanding (SLU) applications, especially for the slot-filling taskwhere models have to deal often with a high number of target labels.Conditional Random Fields (CRF) were previously considered as themost efficient algorithm in these conditions. More recently, differentarchitectures of Recurrent Neural Networks (RNNs) have been pro-posed for the SLU slot-filling task. Most of them, however, have beensuccessfully evaluated on the simple ATIS database, on which it isdifficult to draw significant conclusions. In this paper we proposenew variants of RNNs able to learn efficiently and effectively labeldependencies by integrating label embeddings. We show first thatmodeling label dependencies is useless on the (simple) ATIS databaseand unstructured models can produce state-of-the-art results onthis benchmark. On ATIS our new variants achieve the same resultsas state-of-the-art models, while being much simpler. On the otherhand, on the MEDIA benchmark, we show that the modificationintroduced in the proposed RNN outperforms traditional RNNs andCRF models.

Minimum Semantic Error Cost Training of Deep LongShort-Term Memory Networks for Topic Spotting onConversational Speech

Zhong Meng, Biing-Hwang Juang; Georgia Institute ofTechnology, USAWed-O-7-4-5, Time: 14:50–15:10

The topic spotting performance on spontaneous conversationalspeech can be significantly improved by operating a support vectormachine with a latent semantic rational kernel (LSRK) on the decodedword lattices (i.e., weighted finite-state transducers) of the speech[1]. In this work, we propose the minimum semantic error cost(MSEC) training of a deep bidirectional long short-term memory(BLSTM)-hidden Markov model acoustic model for generating latticesthat are semantically accurate and are better suited for topic spottingwith LSRK. With the MSEC training, the expected semantic error costof all possible word sequences on the lattices is minimized giventhe reference. The word-word semantic error cost is first computedfrom either the latent semantic analysis or distributed vector-spaceword representations learned from the recurrent neural networksand is then accumulated to form the expected semantic error cost ofthe hypothesized word sequences. The proposed method achieves3.5%–4.5% absolute topic classification accuracy improvement overthe baseline BLSTM trained with cross-entropy on Switchboard-1Release 2 dataset.

Topic Identification for Speech Without ASR

Chunxi Liu, Jan Trmal, Matthew Wiesner, Craig Harman,Sanjeev Khudanpur; Johns Hopkins University, USAWed-O-7-4-6, Time: 15:10–15:30

Modern topic identification (topic ID) systems for speech use auto-matic speech recognition (ASR) to produce speech transcripts, andperform supervised classification on such ASR outputs. However,under resource-limited conditions, the manually transcribed speechrequired to develop standard ASR systems can be severely limited orunavailable. In this paper, we investigate alternative unsupervisedsolutions to obtaining tokenizations of speech in terms of a vocab-ulary of automatically discovered word-like or phoneme-like units,without depending on the supervised training of ASR systems. More-over, using automatic phoneme-like tokenizations, we demonstratethat a convolutional neural network based framework for learningspoken document representations provides competitive performancecompared to a standard bag-of-words representation, as evidencedby comprehensive topic ID evaluations on both single-label andmulti-label classification tasks.

Wed-O-7-6 : Dialog SystemsC6, 13:30–15:30, Wednesday, 23 Aug. 2017Chairs: Gabriel Skantze, Timo Baumann

An End-to-End Trainable Neural Network Model withBelief Tracking for Task-Oriented Dialog

Bing Liu, Ian Lane; Carnegie Mellon University, USAWed-O-7-6-1, Time: 13:30–13:50

We present a novel end-to-end trainable neural network model fortask-oriented dialog systems. The model is able to track dialogstate, issue API calls to knowledge base (KB), and incorporatestructured KB query results into system responses to successfullycomplete task-oriented dialogs. The proposed model produceswell-structured system responses by jointly learning belief trackingand KB result processing conditioning on the dialog history. Weevaluate the model in a restaurant search domain using a datasetthat is converted from the second Dialog State Tracking Challenge(DSTC2) corpus. Experiment results show that the proposed modelcan robustly track dialog state given the dialog history. Moreover,our model demonstrates promising results in producing appropriatesystem responses, outperforming prior end-to-end trainable neuralnetwork models using per-response accuracy evaluation metrics.

Deep Reinforcement Learning of Dialogue Policieswith Less Weight Updates

Heriberto Cuayáhuitl 1, Seunghak Yu 2; 1University ofLincoln, UK; 2Samsung Electronics, KoreaWed-O-7-6-2, Time: 13:50–14:10

Deep reinforcement learning dialogue systems are attractive becausethey can jointly learn their feature representations and policieswithout manual feature engineering. But its application is chal-lenging due to slow learning. We propose a two-stage methodfor accelerating the induction of single or multi-domain dialoguepolicies. While the first stage reduces the amount of weight updatesover time, the second stage uses very limited minibatches (of asmuch as two learning experiences) sampled from experience replaymemories. The former frequently updates the weights of the neuralnets at early stages of training, and decreases the amount of updatesas training progresses by performing updates during explorationand by skipping updates during exploitation. The learning processis thus accelerated through less weight updates in both stages. Anempirical evaluation in three domains (restaurants, hotels and tvguide) confirms that the proposed method trains policies 5 timesfaster than a baseline without the proposed method. Our findingsare useful for training larger-scale neural-based spoken dialoguesystems.

Towards End-to-End Spoken Dialogue Systems withTurn Embeddings

Ali Orkan Bayer, Evgeny A. Stepanov, Giuseppe Riccardi;Università di Trento, ItalyWed-O-7-6-3, Time: 14:10–14:30

Training task-oriented dialogue systems requires significant amountof manual effort and integration of many independently builtcomponents; moreover, the pipeline is prone to error-propagation.End-to-end training has been proposed to overcome these problemsby training the whole system over the utterances of both dialogueparties. In this paper we present an end-to-end spoken dialogue sys-tem architecture that is based on turn embeddings. Turn embeddingsencode a robust representation of user turns with a local dialoguehistory and they are trained using sequence-to-sequence models.

Notes

179

Turn embeddings are trained by generating the previous and thenext turns of the dialogue and additionally perform spoken languageunderstanding. The end-to-end spoken dialogue system is trainedusing the pre-trained turn embeddings in a stateful architecture thatconsiders the whole dialogue history. We observe that the proposedspoken dialogue system architecture outperforms the models basedon local-only dialogue history and it is robust to automatic speechrecognition errors.

Speech and Text Analysis for Multimodal AddresseeDetection in Human-Human-Computer Interaction

Oleg Akhtiamov 1, Maxim Sidorov 1, Alexey A. Karpov 2,Wolfgang Minker 1; 1Universität Ulm, Germany; 2ITMOUniversity, RussiaWed-O-7-6-4, Time: 14:30–14:50

The necessity of addressee detection arises in multiparty spokendialogue systems which deal with human-human-computer interac-tion. In order to cope with this kind of interaction, such a system issupposed to determine whether the user is addressing the systemor another human. The present study is focused on multimodaladdressee detection and describes three levels of speech and textanalysis: acoustical, syntactical, and lexical. We define the con-nection between different levels of analysis and the classificationperformance for different categories of speech and determine thedependence of addressee detection performance on speech recogni-tion accuracy. We also compare the obtained results with the resultsof the original research performed by the authors of the Smart VideoCorpus which we use in our computations. Our most effective meta-classifier working with acoustical, syntactical, and lexical featuresreaches an unweighted average recall equal to 0.917 showing almosta nine percent advantage over the best baseline model, though thisbaseline classifier additionally uses head orientation data. We alsopropose a universal meta-model based on acoustical and syntacticalanalysis, which may theoretically be applied in different domains.

Rushing to Judgement: How do Laypeople Rate CallerEngagement in Thin-Slice Videos of Human–MachineDialog?

Vikram Ramanarayanan, Chee Wee Leong, DavidSuendermann-Oeft; Educational Testing Service, USAWed-O-7-6-5, Time: 14:50–15:10

We analyze the efficacy of a small crowd of naïve human raters inrating engagement during human–machine dialog interactions. Eachrater viewed multiple 10 second, thin-slice videos of non-nativeEnglish speakers interacting with a computer-assisted languagelearning (CALL) system and rated how engaged and disengagedthose callers were while interacting with the automated agent. Weobserve how the crowd’s ratings compared to callers’ self ratings ofengagement, and further study how the distribution of these ratingassignments vary as a function of whether the automated systemor the caller was speaking. Finally, we discuss the potential appli-cations and pitfalls of such a crowdsourced paradigm in designing,developing and analyzing engagement-aware dialog systems.

Hyperarticulation of Corrections in MultilingualDialogue Systems

Ivan Kraljevski, Diane Hirschfeld; voice INTER connect,GermanyWed-O-7-6-6, Time: 15:10–15:30

This present paper aims at answering the question whether thereare distinctive cross-linguistic differences associated with hyper-articulated speech in correction dialogue acts. The objective is to

assess the effort for adaptation of a multilingual dialogue system in9 different languages, regarding the recovery strategies, particularlycorrections. If the presence of hyperarticulation significantly differsacross languages, it will have a significant impact on the dialoguedesign and recovery strategies.

Wed-O-7-8 : Lexical and PronunciationModelingD8, 13:30–15:30, Wednesday, 23 Aug. 2017Chairs: Izhak Shafran, Helen Meng

Multitask Sequence-to-Sequence Models forGrapheme-to-Phoneme Conversion

Benjamin Milde, Christoph Schmidt, Joachim Köhler;Fraunhofer IAIS, GermanyWed-O-7-8-1, Time: 13:30–13:50

Recently, neural sequence-to-sequence (Seq2Seq) models have beenapplied to the problem of grapheme-to-phoneme (G2P) conversion.These models offer a straightforward way of modeling the conver-sion by jointly learning the alignment and translation of input tooutput tokens in an end-to-end fashion. However, until now thisapproach did not show improved error rates on its own comparedto traditional joint-sequence based n-gram models for G2P. In thispaper, we investigate how multitask learning can improve the per-formance of Seq2Seq G2P models. A single Seq2Seq model is trainedon multiple phoneme lexicon datasets containing multiple languagesand phonetic alphabets. Although multi-language learning does notshow improved error rates, combining standard datasets and crawleddata with different phonetic alphabets of the same language showspromising error reductions on English and German Seq2Seq G2Pconversion. Finally, combining Seq2seq G2P models with standardn-grams based models yields significant improvements over usingeither model alone.

Acoustic Data-Driven Lexicon Learning Based on aGreedy Pronunciation Selection Framework

Xiaohui Zhang, Vimal Manohar, Daniel Povey, SanjeevKhudanpur; Johns Hopkins University, USAWed-O-7-8-2, Time: 13:50–14:10

Speech recognition systems for irregularly-spelled languages likeEnglish normally require hand-written pronunciations. In this paper,we describe a system for automatically obtaining pronunciationsof words for which pronunciations are not available, but for whichtranscribed data exists. Our method integrates information from theletter sequence and from the acoustic evidence. The novel aspect ofthe problem that we address is the problem of how to prune entriesfrom such a lexicon (since, empirically, lexicons with too many en-tries do not tend to be good for ASR performance). Experiments onvarious ASR tasks show that, with the proposed framework, startingwith an initial lexicon of several thousand words, we are able tolearn a lexicon which performs close to a full expert lexicon in termsof WER performance on test data, and is better than lexicons builtusing G2P alone or with a pruning criterion based on pronunciationprobability.

Notes

180

Semi-Supervised Learning of a PronunciationDictionary from Disjoint Phonemic Transcripts andText

Takahiro Shinozaki 1, Shinji Watanabe 2, DaichiMochihashi 3, Graham Neubig 4; 1Tokyo Institute ofTechnology, Japan; 2MERL, USA; 3ISM, Japan; 4CarnegieMellon University, USAWed-O-7-8-3, Time: 14:10–14:30

While the performance of automatic speech recognition systems hasrecently approached human levels in some tasks, the application isstill limited to specific domains. This is because system developmentrelies on extensive supervised training and expert tuning in thetarget domain. To solve this problem, systems must become moreself-sufficient, having the ability to learn directly from speech andadapt to new tasks. One open question in this area is how to learna pronunciation dictionary containing the appropriate vocabulary.Humans can recognize words, even ones they have never heardbefore, by reading text and understanding the context in which aword is used. However, this ability is missing in current speechrecognition systems. In this work, we propose a new frameworkthat automatically expands an initial pronunciation dictionary usingindependently sampled acoustic and textual data. While the task isvery challenging and in its initial stage, we demonstrate that a modelbased on Bayesian learning of Dirichlet processes can acquire wordpronunciations from phone transcripts and text of the WSJ data set.

Improved Subword Modeling for WFST-Based SpeechRecognition

Peter Smit, Sami Virpioja, Mikko Kurimo; AaltoUniversity, FinlandWed-O-7-8-4, Time: 14:30–14:50

Because in agglutinative languages the number of observed wordforms is very high, subword units are often utilized in speechrecognition. However, the proper use of subword units requirescareful consideration of details such as silence modeling, position-dependent phones, and combination of the units. In this paper,we implement subword modeling in the Kaldi toolkit by creatingmodified lexicon by finite-state transducers to represent the subwordunits correctly. We experiment with multiple types of word boundarymarkers and achieve the best results by adding a marker to the leftor right side of a subword unit whenever it is not preceded orfollowed by a word boundary, respectively. We also compare threedifferent toolkits that provide data-driven subword segmentations.In our experiments on a variety of Finnish and Estonian datasets, thebest subword models do outperform word-based models and naivesubword implementations. The largest relative reduction in WER is a23% over word-based models for a Finnish read speech dataset. Theresults are also better than any previously published ones for thesame datasets, and the improvement on all datasets is more than 5%.

Pronunciation Learning with RNN-Transducers

Antoine Bruguier, Danushen Gnanapragasam, LeifJohnson, Kanishka Rao, Françoise Beaufays; Google, USAWed-O-7-8-5, Time: 14:50–15:10

Most speech recognition systems rely on pronunciation dictionariesto provide accurate transcriptions. Typically, some pronunciationsare carved manually, but many are produced using pronunciationlearning algorithms. Successful algorithms must have the ability togenerate rich pronunciation variants, e.g. to accommodate words offoreign origin, while being robust to artifacts of the training data, e.g.noise in the acoustic segments from which the pronunciations arelearned if the method uses acoustic signals. We propose a general

finite-state transducer (FST) framework to describe such algorithms.This representation is flexible enough to accommodate a wide varietyof pronunciation learning algorithms, including approaches that relyon the availability of acoustic data, and methods that only rely onthe spelling of the target words. In particular, we show that thepronunciation FST can be built from a recurrent neural network(RNN) and tuned to provide rich yet constrained pronunciations.This new approach reduces the number of incorrect pronunciationslearned from Google Voice traffic by up to 25% relative.

Learning Similarity Functions for PronunciationVariations

Einat Naaman, Yossi Adi, Joseph Keshet; Bar-IlanUniversity, IsraelWed-O-7-8-6, Time: 15:10–15:30

A significant source of errors in Automatic Speech Recognition (ASR)systems is due to pronunciation variations which occur in sponta-neous and conversational speech. Usually ASR systems use a finitelexicon that provides one or more pronunciations for each word. Inthis paper, we focus on learning a similarity function between twopronunciations. The pronunciations can be the canonical and thesurface pronunciations of the same word or they can be two surfacepronunciations of different words. This task generalizes problemssuch as lexical access (the problem of learning the mapping betweenwords and their possible pronunciations), and defining word neigh-borhoods. It can also be used to dynamically increase the size of thepronunciation lexicon, or in predicting ASR errors. We propose twomethods, which are based on recurrent neural networks, to learn thesimilarity function. The first is based on binary classification, andthe second is based on learning the ranking of the pronunciations.We demonstrate the efficiency of our approach on the task of lexicalaccess using a subset of the Switchboard conversational speechcorpus. Results suggest that on this task our methods are superiorto previous methods which are based on graphical Bayesian methods.

Wed-O-7-10 : Language RecognitionE10, 13:30–15:30, Wednesday, 23 Aug. 2017Chairs: Yao Qian, Vidhyasaharan Sethu

Spoken Language Identification Using LSTM-BasedAngular Proximity

G. Gelly, J.L. Gauvain; LIMSI, FranceWed-O-7-10-1, Time: 13:30–13:50

This paper describes the design of an acoustic language identifica-tion (LID) system based on LSTMs that directly maps a sequence ofacoustic features to a vector in a vector space where the angularproximity corresponds to a measure of language/dialect similarity.A specific architecture for the LSTM-based language vector extractoris introduced along with the angular proximity loss function totrain it. This new LSTM-based LID system is quicker to train thana standard RNN topology using stacked layers trained with thecross-entropy loss function and obtains significantly lower languageerror rates. Experiments compare this approach to our previousdevelopments on the subject, as well as to two widely used LIDtechniques: a phonotactic system using DNN acoustic models andan i-vector system. Results are reported on two different data sets:the 14 languages of NIST LRE07 and the 20 closely related languagesand dialects of NIST LRE15. In addition to reporting the NIST Cavgmetric which served as the primary metric for the LRE07 and LRE15evaluations, the average LER is provided.

Notes

181

End-to-End Language Identification Using High-OrderUtterance Representation with Bilinear Pooling

Ma Jin 1, Yan Song 1, Ian McLoughlin 2, Wu Guo 1,Li-Rong Dai 1; 1USTC, China; 2University of Kent, UKWed-O-7-10-2, Time: 13:50–14:10

A key problem in spoken language identification (LID) is how todesign effective representations which are specific to languageinformation. Recent advances in deep neural networks have led tosignificant improvements in results, with deep end-to-end methodsproving effective. This paper proposes a novel network which aimsto model an effective representation for high (first and second)-orderstatistics of LID-senones, defined as being LID analogues of senonesin speech recognition. The high-order information extracted throughbilinear pooling is robust to speakers, channels and backgroundnoise. Evaluation with NIST LRE 2009 shows improved performancecompared to current state-of-the-art DBF/i-vector systems, achievingover 33% and 20% relative equal error rate (EER) improvement for 3sand 10s utterances and over 40% relative Cavg improvement for alldurations.

Dialect Recognition Based on UnsupervisedBottleneck Features

Qian Zhang, John H.L. Hansen; University of Texas atDallas, USAWed-O-7-10-3, Time: 14:10–14:30

Recently, bottleneck features (BNF) with an i-Vector strategy has beenused for state-of-the-art language/dialect identification. However,traditional bottleneck extraction requires an additional transcribedcorpus which is used for acoustic modeling. Alternatively, an unsu-pervised BNF extraction diagram is proposed in our study, which isderived from the traditional structure but trained with an estimatedphonetic label. The proposed method is evaluated on a 4-way Chi-nese dialect dataset and a 5-way closely spaced Pan-Arabic corpus.Compared to a baseline i-Vector system based on acoustic featuresMFCCs, the proposed unsupervised BNF consistently achieves betterperformance across two corpora. Specifically, the EER and overallperformance Cavg ∗ 100 are improved by a relative +48% and +52%,respectively. Even under the condition with limited training data,the proposed feature still achieves up to 24% relative improvementcompared to baseline, all without the need of a secondary transcribedcorpus.

Investigating Scalability in Hierarchical LanguageIdentification System

Saad Irtza 1, Vidhyasaharan Sethu 1, EliathambyAmbikairajah 1, Haizhou Li 2; 1University of New SouthWales, Australia; 2NUS, SingaporeWed-O-7-10-4, Time: 14:30–14:50

State-of-the-art language identification (LID) systems are not easilyscalable to accommodate new languages. Specifically, as the numberof target languages grows the error rate of these LID systems in-creases rapidly. This paper addresses such a challenge by adoptinga hierarchical language identification (HLID) framework. We demon-strate the superior scalability of the HLID framework. In particular,HLID only requires the training of relevant nodes in a hierarchicalstructure instead of re-training the entire tree. Experiments con-ducted on a dataset that combined languages from the NIST LRE2007, 2009, 2011 and 2015 databases show that as the number oftarget languages grows from 28 to 42, the performance of a singlelevel (non-hierarchical) system deteriorates by around 11% while thatof the hierarchical system only deteriorates by about 3.4% in termsof Cavg. Finally, experiments also suggest that SVM based systemsare more scalable than GPLDA based systems.

Improving Sub-Phone Modeling for Better NativeLanguage Identification with Non-Native EnglishSpeech

Yao Qian 1, Keelan Evanini 1, Xinhao Wang 1, DavidSuendermann-Oeft 1, Robert A. Pugh 1, Patrick L.Lange 1, Hillary R. Molloy 1, Frank K. Soong 2;1Educational Testing Service, USA; 2Microsoft, ChinaWed-O-7-10-5, Time: 14:50–15:10

Identifying a speaker’s native language with his speech in a sec-ond language is useful for many human-machine voice interfaceapplications. In this paper, we use a sub-phone-based i-vectorapproach to identify non-native English speakers’ native languagesby their English speech input. Time delay deep neural networks(TDNN) are trained on LVCSR corpora for improving the alignment ofspeech utterances with their corresponding sub-phonemic “senone”sequences. The phonetic variability caused by a speaker’s nativelanguage can be better modeled with the sub-phone models than theconventional phone model based approach. Experimental resultson the database released for the 2016 Interspeech ComParE NativeLanguage challenge with 11 different L1s show that our systemoutperforms the best system by a large margin (87.2% UAR comparedto 81.3% UAR for the best system from the 2016 ComParE challenge).

QMDIS: QCRI-MIT Advanced Dialect IdentificationSystem

Sameer Khurana 1, Maryam Najafian 2, Ahmed Ali 1,Tuka Al Hanai 2, Yonatan Belinkov 2, James Glass 2;1HBKU, Qatar; 2MIT, USAWed-O-7-10-6, Time: 15:10–15:30

As a continuation of our efforts towards tackling the problem ofspoken Dialect Identification (DID) for Arabic languages, we presentthe QCRI-MIT Advanced Dialect Identification System (QMDIS).QMDIS is an automatic spoken DID system for Dialectal Arabic (DA).In this paper, we report a comprehensive study of the three maincomponents used in the spoken DID task: phonotactic, lexical andacoustic. We use Support Vector Machines (SVMs), Logistic Regres-sion (LR) and Convolutional Neural Networks (CNNs) as backendclassifiers throughout the study. We perform all our experiments ona publicly available dataset and present new state-of-the-art results.QMDIS discriminates between the five most widely used dialectsof Arabic: namely Egyptian, Gulf, Levantine, North African, andModern Standard Arabic (MSA).We report ≈73% accuracy for systemcombination. All the data and the code used in our experiments arepublicly available for research.

Wed-O-8-1 : Speaker Database andAnti-spoofingAula Magna, 16:00–18:00, Wednesday, 23 Aug. 2017Chairs: Nicholas Evans, Karthika Vajayan

Detection of Replay Attacks Using Single FrequencyFiltering Cepstral Coefficients

K.N.R.K. Raju Alluri, Sivanand Achanta,Sudarsana Reddy Kadiri, Suryakanth V. Gangashetty,Anil Kumar Vuppala; IIIT Hyderabad, IndiaWed-O-8-1-1, Time: 16:00–16:20

Automatic speaker verification systems are vulnerable to spoofingattacks. Recently, various countermeasures have been developedfor detecting high technology attacks such as speech synthesis and

Notes

182

voice conversion. However, there is a wide gap in dealing withreplay attacks. In this paper, we propose a new feature for replayattack detection based on single frequency filtering (SFF), whichprovides high temporal and spectral resolution at each instant.Single frequency filtering cepstral coefficients (SFFCC) with Gaussianmixture model classifier are used for the experimentation on thestandard BTAS-2016 corpus. The previously reported best result,which is based on constant Q cepstral coefficients (CQCC) achieved ahalf total error rate of 0.67% on this data-set. Our proposed methodoutperforms the state of the art (CQCC) with a half total error rate of0.0002%.

Unsupervised Representation Learning UsingConvolutional Restricted Boltzmann Machine forSpoof Speech Detection

Hardik B. Sailor, Madhu R. Kamble, Hemant A. Patil;DA-IICT, IndiaWed-O-8-1-2, Time: 16:20–16:40

Speech Synthesis (SS) and Voice Conversion (VC) presents a genuinerisk of attacks for Automatic Speaker Verification (ASV) technology.In this paper, we use our recently proposed unsupervised filterbanklearning technique using Convolutional Restricted Boltzmann Ma-chine (ConvRBM) as a front-end feature representation. ConvRBM istrained on training subset of ASV spoof 2015 challenge database.Analyzing the filterbank trained on this dataset shows that ConvRBMlearned more low-frequency subband filters compared to trainingon natural speech database such as TIMIT. The spoofing detec-tion experiments were performed using Gaussian Mixture Models(GMM) as a back-end classifier. ConvRBM-based cepstral coefficients(ConvRBM-CC) perform better than hand crafted Mel FrequencyCepstral Coefficients (MFCC). On the evaluation set, ConvRBM-CCfeatures give an absolute reduction of 4.76% in Equal Error Rate(EER) compared to MFCC features. Specifically, ConvRBM-CC featuressignificantly perform better in both known attacks (1.93%) andunknown attacks (5.87%) compared to MFCC features.

Independent Modelling of High and Low EnergySpeech Frames for Spoofing Detection

Gajan Suthokumar, Kaavya Sriskandaraja,Vidhyasaharan Sethu, Chamith Wijenayake, EliathambyAmbikairajah; University of New South Wales, AustraliaWed-O-8-1-3, Time: 16:40–17:00

Spoofing detection systems for automatic speaker verification havemoved from only modelling voiced frames to modelling all speechframes. Unvoiced speech has been shown to carry information aboutspoofing attacks and anti-spoofing systems may further benefit bytreating voiced and unvoiced speech differently. In this paper, weseparate speech into low and high energy frames and independentlymodel the distributions of both to form two spoofing detection sys-tems that are then fused at the score level. Experiments conductedon the ASVspoof 2015, BTAS 2016 and Spoofing and Anti-Spoofing(SAS) corpora demonstrate that the proposed approach of fusingtwo independent high and low energy spoofing detection systemsconsistently outperforms the standard approach that does notdistinguish between high and low energy frames.

Improving Speaker Verification Performance inPresence of Spoofing Attacks Using Out-of-DomainSpoofed Data

Achintya Kr. Sarkar 1, Md. Sahidullah 2, Zheng-HuaTan 1, Tomi Kinnunen 2; 1Aalborg University, Denmark;2University of Eastern Finland, FinlandWed-O-8-1-4, Time: 17:00–17:20

Automatic speaker verification (ASV) systems are vulnerable tospoofing attacks using speech generated by voice conversion andspeech synthesis techniques. Commonly, a countermeasure (CM)system is integrated with an ASV system for improved protectionagainst spoofing attacks. But integration of the two systems ischallenging and often leads to increased false rejection rates. Fur-thermore, the performance of CM severely degrades if in-domaindevelopment data are unavailable. In this study, therefore, wepropose a solution that uses two separate background models — onefrom human speech and another from spoofed data. During test,the ASV score for an input utterance is computed as the differenceof the log-likelihood against the target model and the combinationof the log-likelihoods against two background models. Evaluationexperiments are conducted using the joint ASV and CM protocol ofASVspoof 2015 corpus consisting of text-independent ASV taskswith short utterances. Our proposed system reduces error rates inthe presence of spoofing attacks by using out-of-domain spoofeddata for system development, while maintaining the performancefor zero-effort imposter attacks compared to the baseline system.

VoxCeleb: A Large-Scale Speaker IdentificationDataset

Arsha Nagrani, Joon Son Chung, Andrew Zisserman;University of Oxford, UKWed-O-8-1-5, Time: 17:20–17:40

Most existing datasets for speaker identification contain samplesobtained under quite constrained conditions, and are usually hand-annotated, hence limited in size. The goal of this paper is to generatea large scale text-independent speaker identification dataset col-lected ‘in the wild’.

We make two contributions. First, we propose a fully automatedpipeline based on computer vision techniques to create the datasetfrom open-source media. Our pipeline involves obtaining videosfrom YouTube; performing active speaker verification using atwo-stream synchronization Convolutional Neural Network (CNN),and confirming the identity of the speaker using CNN based facialrecognition. We use this pipeline to curate VoxCeleb which containshundreds of thousands of ‘real world’ utterances for over 1,000celebrities.

Our second contribution is to apply and compare various state ofthe art speaker identification techniques on our dataset to establishbaseline performance. We show that a CNN based architectureobtains the best performance for both identification and verification.

Call My Net Corpus: A Multilingual Corpus forEvaluation of Speaker Recognition Technology

Karen Jones, Stephanie Strassel, Kevin Walker, DavidGraff, Jonathan Wright; University of Pennsylvania, USAWed-O-8-1-6, Time: 17:40–18:00

The Call My Net 2015 (CMN15) corpus presents a new resource forSpeaker Recognition Evaluation and related technologies. The corpusincludes conversational telephone speech recordings for a total of220 speakers spanning 4 languages: Tagalog, Cantonese, Mandarinand Cebuano. The corpus includes 10 calls per speaker made under

Notes

183

a variety of noise conditions. Calls were manually audited forlanguage, speaker identity and overall quality. The resulting datahas been used in the NIST 2016 SRE Evaluation and will be publishedin the Linguistic Data Consortium catalog. We describe the goals ofthe CMN15 corpus, including details of the collection protocol andauditing procedure and discussion of the unique properties of thiscorpus compared to prior NIST SRE evaluation corpora.

Wed-O-8-4 : Speech TranslationB4, 16:00–18:00, Wednesday, 23 Aug. 2017Chairs: Isabel Trancoso, Nicholas Ruiz

Sequence-to-Sequence Models Can Directly TranslateForeign Speech

Ron J. Weiss 1, Jan Chorowski 1, Navdeep Jaitly 2,Yonghui Wu 1, Zhifeng Chen 1; 1Google, USA; 2NVIDIA,USAWed-O-8-4-1, Time: 16:00–16:20

We present a recurrent encoder-decoder deep neural network archi-tecture that directly translates speech in one language into text inanother. The model does not explicitly transcribe the speech intotext in the source language, nor does it require supervision fromthe ground truth source language transcription during training.We apply a slightly modified sequence-to-sequence with attentionarchitecture that has previously been used for speech recognitionand show that it can be repurposed for this more complex task,illustrating the power of attention-based models.

A single model trained end-to-end obtains state-of-the-art perfor-mance on the Fisher Callhome Spanish-English speech translationtask, outperforming a cascade of independently trained sequence-to-sequence speech recognition and machine translation modelsby 1.8 BLEU points on the Fisher test set. In addition, we find thatmaking use of the training data in both languages by multi-tasktraining sequence-to-sequence speech translation and recognitionmodels with a shared encoder network can improve performance bya further 1.4 BLEU points.

Structured-Based Curriculum Learning for End-to-EndEnglish-Japanese Speech Translation

Takatomo Kano, Sakriani Sakti, Satoshi Nakamura;NAIST, JapanWed-O-8-4-2, Time: 16:20–16:40

Sequence-to-sequence attentional-based neural network architec-tures have been shown to provide a powerful model for machinetranslation and speech recognition. Recently, several works haveattempted to extend the models for end-to-end speech translationtask. However, the usefulness of these models were only investigatedon language pairs with similar syntax and word order (e.g., English-French or English-Spanish). In this work, we focus on end-to-endspeech translation tasks on syntactically distant language pairs (e.g.,English-Japanese) that require distant word reordering. To guide theencoder-decoder attentional model to learn this difficult problem,we propose a structured-based curriculum learning strategy. Unlikeconventional curriculum learning that gradually emphasizes difficultdata examples, we formalize learning strategies from easier networkstructures to more difficult network structures. Here, we start thetraining with end-to-end encoder-decoder for speech recognitionor text-based machine translation task then gradually move toend-to-end speech translation task. The experiment results showthat the proposed approach could provide significant improvementsin comparison with the one without curriculum learning.

Assessing the Tolerance of Neural MachineTranslation Systems Against Speech RecognitionErrors

Nicholas Ruiz, Mattia Antonino Di Gangi, Nicola Bertoldi,Marcello Federico; FBK, ItalyWed-O-8-4-3, Time: 16:40–17:00

Machine translation systems are conventionally trained on textualresources that do not model phenomena that occur in spokenlanguage. While the evaluation of neural machine translation sys-tems on textual inputs is actively researched in the literature, littlehas been discovered about the complexities of translating spokenlanguage data with neural models. We introduce and motivateinteresting problems one faces when considering the translationof automatic speech recognition (ASR) outputs on neural machinetranslation (NMT) systems. We test the robustness of sentenceencoding approaches for NMT encoder-decoder modeling, focusingon word-based over byte-pair encoding. We compare the translationof utterances containing ASR errors in state-of-the-art NMT encoder-decoder systems against a strong phrase-based machine translationbaseline in order to better understand which phenomena present inASR outputs are better represented under the NMT framework thanapproaches that represent translation as a linear model.

Toward Expressive Speech Translation: A UnifiedSequence-to-Sequence LSTMs Approach forTranslating Words and Emphasis

Quoc Truong Do, Sakriani Sakti, Satoshi Nakamura;NAIST, JapanWed-O-8-4-4, Time: 17:00–17:20

Emphasis is an important piece of paralinguistic information that isused to express different intentions, attitudes, or convey emotion.Recent works have tried to translate emphasis by developing addi-tional emphasis estimation and translation components apart froman existing speech-to-speech translation (S2ST) system. Althoughthese approaches can preserve emphasis, they introduce morecomplexity to the translation pipeline. The emphasis translationcomponent has to wait for the target language sentence and wordalignments derived from a machine translation system, resulting in asignificant translation delay. In this paper, we proposed an approachthat jointly trains and predicts words and emphasis in a unifiedarchitecture based on sequence-to-sequence models. The proposedmodel not only speeds up the translation pipeline but also allowsus to perform joint training. Our experiments on the emphasis andword translation tasks showed that we could achieve comparableperformance for both tasks compared with previous approacheswhile eliminating complex dependencies.

NMT-Based Segmentation and Punctuation Insertionfor Real-Time Spoken Language Translation

Eunah Cho, Jan Niehues, Alex Waibel; KIT, GermanyWed-O-8-4-5, Time: 17:20–17:40

Insertion of proper segmentation and punctuation into an ASRtranscript is crucial not only for the performance of subsequentapplications but also for the readability of the text. In a simultaneousspoken language translation system, the segmentation model has tofulfill real-time constraints and minimize latency as well.

In this paper, we show the successful integration of an attentionalencoder-decoder-based segmentation and punctuation insertionmodel into a real-time spoken language translation system. Theproposed technique can be easily integrated into the real-timeframework and improve the punctuation performance on referencetranscripts as well as on ASR outputs. Compared to the conventional

Notes

184

language model and prosody-based model, our experiments onend-to-end spoken language translation show that translation per-formance is improved by 1.3 BLEU points by adopting the NMT-basedpunctuation model, maintaining low-latency.

Wed-O-8-6 : Multi-channel SpeechEnhancementC6, 16:00–18:00, Wednesday, 23 Aug. 2017Chairs: Hynek Boril, Reinhold Haeb-Umbach

Tight Integration of Spatial and Spectral Features forBSS with Deep Clustering Embeddings

Lukas Drude, Reinhold Haeb-Umbach; UniversitätPaderborn, GermanyWed-O-8-6-1, Time: 16:00–16:20

Recent advances in discriminatively trained mask estimation net-works to extract a single source utilizing beamforming techniquesdemonstrate, that the integration of statistical models and deepneural networks (DNNs) are a promising approach for robust auto-matic speech recognition (ASR) applications. In this contribution wedemonstrate how discriminatively trained embeddings on spectralfeatures can be tightly integrated into statistical model-based sourceseparation to separate and transcribe overlapping speech. Good gen-eralization to unseen spatial configurations is achieved by estimatinga statistical model at test time, while still leveraging discriminativetraining of deep clustering embeddings on a separate training set.We formulate an expectation maximization (EM) algorithm whichjointly estimates a model for deep clustering embeddings andcomplex-valued spatial observations in the short time Fourier trans-form (STFT) domain at test time. Extensive simulations confirm, thatthe integrated model outperforms (a) a deep clustering model witha subsequent beamforming step and (b) an EM-based model with abeamforming step alone in terms of signal to distortion ratio (SDR)and perceptually motivated metric (PESQ) gains. ASR results on areverberated dataset further show, that the aforementioned gainstranslate to reduced word error rates (WERs) even in reverberantenvironments.

Speaker-Aware Neural Network Based Beamformerfor Speaker Extraction in Speech Mixtures

Katerina Žmolíková, Marc Delcroix, Keisuke Kinoshita,Takuya Higuchi, Atsunori Ogawa, Tomohiro Nakatani;NTT, JapanWed-O-8-6-2, Time: 16:20–16:40

In this work, we address the problem of extracting one target speakerfrom a multichannel mixture of speech. We use a neural network toestimate masks to extract the target speaker and derive beamformerfilters using these masks, in a similar way as the recently proposedapproach for extraction of speech in presence of noise. To overcomethe permutation ambiguity of neural network mask estimation,which arises in presence of multiple speakers, we propose to informthe neural network about the target speaker so that it learns to followthe speaker characteristics through the utterance. We investigateand compare different methods of passing the speaker informationto the network such as making one layer of the network dependenton speaker characteristics. Experiments on mixture of two speakersdemonstrate that the proposed scheme can track and extract a targetspeaker for both closed and open speaker set cases.

Eigenvector-Based Speech Mask Estimation UsingLogistic Regression

Lukas Pfeifenberger, Matthias Zöhrer, Franz Pernkopf;Technische Universität Graz, AustriaWed-O-8-6-3, Time: 16:40–17:00

In this paper, we use a logistic regression to learn a speech maskfrom the dominant eigenvector of the Power Spectral Density (PSD)matrix of a multi-channel speech signal corrupted by ambient noise.We employ this speech mask to construct the Generalized Eigenvalue(GEV) beamformer and a Wiener postfilter. Further, we extend thebeamformer to compensate for speech distortions. We do not makeany assumptions about the array geometry or the characteristicsof the speech and noise sources. Those parameters are learnedfrom training data. Our assumptions are that the speaker maymove slowly in the near-field of the array, and that the noise is inthe far-field. We compare our speech enhancement system againstrecent contributions using the CHiME4 corpus. We show that ourapproach yields superior results, both in terms of perceptual speechquality and speech mask estimation error.

Real-Time Speech Enhancement with GCC-NMF

Sean U.N. Wood, Jean Rouat; Université de Sherbrooke,CanadaWed-O-8-6-4, Time: 17:00–17:20

We develop an online variant of the GCC-NMF blind speech en-hancement algorithm and study its performance on two-channelmixtures of speech and real-world noise from the SiSEC separationchallenge. While GCC-NMF performs enhancement independentlyfor each time frame, the NMF dictionary, its activation coefficients,and the target TDOA are derived using the entire mixture signal,thus precluding its use online. Pre-learning the NMF dictionary usingthe CHiME dataset and inferring its activation coefficients onlineyields similar overall PEASS scores to the mixture-learned method,thus generalizing to new speakers, acoustic environments, and noiseconditions. Surprisingly, if we forgo coefficient inference altogether,this approach outperforms both the mixture-learned method andmost algorithms from the SiSEC challenge to date. Furthermore,the trade-off between interference suppression and target fidelitymay be controlled online by adjusting the target TDOA windowwidth. Finally, integrating online target localization with max-pooledGCC-PHAT yields only somewhat decreased performance comparedto offline localization. We test a real-time implementation of theonline GCC-NMF blind speech enhancement system on a variety ofhardware platforms, with performance made to degrade smoothlywith decreasing computational power using smaller pre-learneddictionaries.

Coherence-Based Dual-Channel Noise ReductionAlgorithm in a Complex Noisy Environment

Youna Ji, Jun Byun, Young-cheol Park; Yonsei University,KoreaWed-O-8-6-5, Time: 17:20–17:40

In this paper, a coherence-based noise reduction algorithm is pro-posed for a dual-channel speech enhancement system operatingin a complex noise environment. The spatial coherence betweentwo omnidirectional microphones is one of the crucial informationfor the dual-channel speech enhancement system. In this paper,we introduce a new model of coherence function for the complexnoise environment in which a target speech coexists with a coherentinterference and diffuse noise around. From the coherence model,three numerical methods of computing the normalized signal tointerference plus diffuse noise ratio (SINR), which is related to theWiener filter gain, are derived. Objective parameters measured

Notes

185

from the enhanced speech demonstrate superior performance of theproposed algorithm in terms of speech quality and intelligibility,over the conventional coherence-based noise reduction algorithm.

Glottal Model Based Speech Beamforming for ad-hocMicrophone Arrays

Yang Zhang 1, Dinei Florêncio 2, MarkHasegawa-Johnson 1; 1University of Illinois atUrbana-Champaign, USA; 2Microsoft, USAWed-O-8-6-6, Time: 17:40–18:00

We are interested in the task of speech beamforming in conferenceroom meetings, with microphones built in the electronic devicesbrought and casually placed by meeting participants. This task ischallenging because of the inaccuracy in position and interferencecalibration due to random microphone configuration, varianceof microphone quality, reverberation etc. As a result, not manybeamforming algorithms perform better than simply picking theclosest microphone in this setting. We propose a beamforming calledGlottal Residual Assisted Beamforming (GRAB). It does not rely onany position or interference calibration. Instead, it incorporates asource-filter speech model and minimizes the energy that cannot beaccounted for by the model. Objective and subjective evaluationson both simulation and real-world data show that GRAB is able tosuppress noise effectively while keeping the speech natural and dry.Further analyses reveal that GRAB can distinguish contaminated orreverberant channels and take appropriate action accordingly.

Wed-O-8-8 : Speech Recognition:Applications in Medical PracticeD8, 16:00–18:00, Wednesday, 23 Aug. 2017Chairs: Phil Green, Torbjørn Svendsen

Acoustic Assessment of Disordered Voice withContinuous Speech Based on Utterance-Level ASRPosterior Features

Yuanyuan Liu, Tan Lee, P.C. Ching, Thomas K.T. Law,Kathy Y.S. Lee; Chinese University of Hong Kong, ChinaWed-O-8-8-1, Time: 16:00–16:20

Most previous studies on acoustic assessment of disordered voicewere focused on extracting perturbation features from isolatedvowels produced with steady-state phonation. Natural speech,however, is considered to be more preferable in the aspects offlexibility, effectiveness and reliability for clinical practice. Thispaper presents an investigation on applying automatic speechrecognition (ASR) technology to disordered voice assessment ofCantonese speakers. A DNN-based ASR system is trained usingphonetically-rich continuous utterances from normal speakers. Itwas found that frame-level phone posteriors obtained from theASR system are strongly correlated with the severity level of voicedisorder. Phone posteriors in utterances with severe disorder exhibitsignificantly larger variation than those with mild disorder. A setof utterance-level posterior features are computed to quantify suchvariation for pattern recognition purpose. An SVM based classifieris used to classify an input utterance into the categories of mild,moderate and severe disorder. The two-class classification accuracyfor mild and severe disorders is 90.3%, and significant confusionbetween mild and moderate disorders is observed. For some of thesubjects with severe voice disorder, the classification results arehighly inconsistent among individual utterances. Furthermore, shortutterances tend to have more classification errors.

Multi-Stage DNN Training for Automatic Recognitionof Dysarthric Speech

Emre Yılmaz, Mario Ganzeboom, Catia Cucchiarini,Helmer Strik; Radboud Universiteit Nijmegen, TheNetherlandsWed-O-8-8-2, Time: 16:20–16:40

Incorporating automatic speech recognition (ASR) in individualizedspeech training applications is becoming more viable thanks tothe improved generalization capabilities of neural network-basedacoustic models. The main problem in developing applications fordysarthric speech is the relative in-domain data scarcity. Collectingrepresentative amounts of dysarthric speech data is difficult due torigorous ethical and medical permission requirements, problems inaccessing patients who are generally vulnerable and often subject toaltering health conditions and, last but not least, the high variabilityin speech resulting from different pathological conditions. Develop-ing such applications is even more challenging for languages which ingeneral have fewer resources, fewer speakers and, consequently, alsofewer patients than English, as in the case of a mid-sized languagelike Dutch. In this paper, we investigate a multi-stage deep neuralnetwork (DNN) training scheme aimed at obtaining better modelingof dysarthric speech by using only a small amount of in-domaintraining data. The results show that the system employing theproposed training scheme considerably improves the recognitionof Dutch dysarthric speech compared to a baseline system withsingle-stage training only on a large amount of normal speech or asmall amount of in-domain data.

Improving Child Speech Disorder Assessment byIncorporating Out-of-Domain Adult Speech

Daniel Smith 1, Alex Sneddon 2, Lauren Ward 3, AndreasDuenser 1, Jill Freyne 1, David Silvera-Tawil 1, AngelaMorgan 4; 1CSIRO, Australia; 2University of Sydney,Australia; 3University of Salford, UK; 4MCRI, AustraliaWed-O-8-8-3, Time: 16:40–17:00

This paper describes the continued development of a system to pro-vide early assessment of speech development issues in children andbetter triaging to professional services. Whilst corpora of children’sspeech are increasingly available, recognition of disordered chil-dren’s speech is still a data-scarce task. Transfer learning methodshave been shown to be effective at leveraging out-of-domain datato improve ASR performance in similar data-scarce applications.This paper combines transfer learning, with previously developedmethods for constrained decoding based on expert speech pathologyknowledge and knowledge of the target text. Results of this studyshow that transfer learning with out-of-domain adult speech canimprove phoneme recognition for disordered children’s speech.Specifically, a Deep Neural Network (DNN) trained on adult speechand fine-tuned on a corpus of disordered children’s speech reducedthe phoneme error rate (PER) of a DNN trained on a children’scorpus from 16.3% to 14.2%. Furthermore, this fine-tuned DNN alsoimproved the performance of a Hierarchal Neural Network basedacoustic model previously used by the system with a PER of 19.3%.We close with a discussion of our planned future developments ofthe system.

On Improving Acoustic Models for TORGO DysarthricSpeech Database

Neethu Mariam Joy, S. Umesh, Basil Abraham; IITMadras, IndiaWed-O-8-8-4, Time: 17:00–17:20

Assistive technologies based on speech have been shown to improvethe quality of life of people affected with dysarthria, a motor speechdisorder. Multiple ways to improve Gaussian mixture model-hidden

Notes

186

Markov model (GMM-HMM) and deep neural network (DNN) basedautomatic speech recognition (ASR) systems for TORGO databasefor dysarthric speech are explored in this paper. Past attempts indeveloping ASR systems for TORGO database were limited to trainingjust monophone models and doing speaker adaptation over them.Although a recent work attempted training triphone and neuralnetwork models, parameters like the number of context dependentstates, dimensionality of the principal component features etcwere not properly tuned. This paper develops speaker-specific ASRmodels for each dysarthric speaker in TORGO database by tuningparameters of GMM-HMM model, number of layers and hidden nodesin DNN. Employing dropout scheme and sequence discriminativetraining in DNN also gave significant gains. Speaker adapted featureslike feature-space maximum likelihood linear regression (FMLLR) areused to pass the speaker information to DNNs. To the best of ourknowledge, this paper presents the best recognition accuracies forTORGO database till date.

Glottal Source Features for Automatic Speech-BasedDepression Assessment

Olympia Simantiraki 1, Paulos Charonyktakis 2,Anastasia Pampouchidou 3, Manolis Tsiknakis 4, MartinCooke 1; 1Universidad del País Vasco, Spain; 2GnosisData Analysis, Greece; 3Le2i, France; 4TEI Crete, GreeceWed-O-8-8-5, Time: 17:20–17:40

Depression is one of the most prominent mental disorders, with anincreasing rate that makes it the fourth cause of disability worldwide.The field of automated depression assessment has emerged to aidclinicians in the form of a decision support system. Such a systemcould assist as a pre-screening tool, or even for monitoring highrisk populations. Related work most commonly involves multimodalapproaches, typically combining audio and visual signals to identifydepression presence and/or severity. The current study explorescategorical assessment of depression using audio features alone.Specifically, since depression-related vocal characteristics impact theglottal source signal, we examine Phase Distortion Deviation whichhas previously been applied to the recognition of voice qualitiessuch as hoarseness, breathiness and creakiness, some of which arethought to be features of depressed speech. The proposed methoduses as features DCT-coefficients of the Phase Distortion Deviationfor each frequency band. An automated machine learning tool,Just Add Data, is used to classify speech samples. The method isevaluated on a benchmark dataset (AVEC2014), in two conditions:read-speech and spontaneous-speech. Our findings indicate thatPhase Distortion Deviation is a promising audio-only feature forautomated detection and assessment of depressed speech.

Speech Processing Approach for DiagnosingDementia in an Early Stage

Roozbeh Sadeghian 1, J. David Schaffer 2, Stephen A.Zahorian 2; 1Harrisburg University of Science &Technology, USA; 2Binghamton University, USAWed-O-8-8-6, Time: 17:40–18:00

The clinical diagnosis of Alzheimer’s disease and other dementiasis very challenging, especially in the early stages. Our hypothesisis that any disease that affects particular brain regions involved inspeech production and processing will also leave detectable fingerprints in the speech. Computerized analysis of speech signals andcomputational linguistics have progressed to the point where anautomatic speech analysis system is a promising approach for a low-cost non-invasive diagnostic tool for early detection of Alzheimer’sdisease.

We present empirical evidence that strong discrimination betweensubjects with a diagnosis of probable Alzheimer’s versus matched

normal controls can be achieved with a combination of acousticfeatures from speech, linguistic features extracted from an automat-ically determined transcription of the speech including punctuation,and results of a mini mental state exam (MMSE). We also show thatdiscrimination is nearly as strong even if the MMSE is not used, whichimplies that a fully automated system is feasible. Since commercialautomatic speech recognition (ASR) tools were unable to providetranscripts for about half of our speech samples, a customized ASRsystem was developed.

Wed-O-8-10 : Language models for ASRE10, 16:00–18:00, Wednesday, 23 Aug. 2017Chairs: Yannick Estève, Dilek Hakkani-Tür

Effectively Building Tera Scale MaxEnt LanguageModels Incorporating Non-Linguistic Signals

Fadi Biadsy, Mohammadreza Ghodsi, DiamantinoCaseiro; Google, USAWed-O-8-10-1, Time: 16:00–16:20

Maximum Entropy (MaxEnt) language models are powerful modelsthat can incorporate linguistic and non-linguistic contextual signalsin a unified framework with a convex loss. MaxEnt models also havethe advantage of scaling to large model and training data sizes Wepresent the following two contributions to MaxEnt training: (1) Byleveraging smaller amounts of transcribed data, we demonstratethat a MaxEnt LM trained on various types of corpora can be easilyadapted to better match the test distribution of Automatic SpeechRecognition (ASR); (2) A novel adaptive-training approach that effi-ciently models multiple types of non-linguistic features in a universalmodel. We evaluate the impact of these approaches on Google’sstate-of-the-art ASR for the task of voice-search transcription anddictation. Training 10B parameter models utilizing a corpus of upto 1T words, we show large reductions in word error rate fromadaptation across multiple languages. Also, human evaluationsshow significant improvements on a wide range of domains fromusing non-linguistic features. For example, adapting to geographicaldomains (e.g., US States and cities) affects about 4% of test utterances,with 2:1 win to loss ratio.

Semi-Supervised Adaptation of RNNLMs byFine-Tuning with Domain-Specific Auxiliary Features

Salil Deena, Raymond W.M. Ng, Pranava Madhyastha,Lucia Specia, Thomas Hain; University of Sheffield, UKWed-O-8-10-2, Time: 16:20–16:40

Recurrent neural network language models (RNNLMs) can be aug-mented with auxiliary features, which can provide an extra modalityon top of the words. It has been found that RNNLMs perform bestwhen trained on a large corpus of generic text and then fine-tuned ontext corresponding to the sub-domain for which it is to be applied.However, in many cases the auxiliary features are available forthe sub-domain text but not for the generic text. In such cases,semi-supervised techniques can be used to infer such featuresfor the generic text data such that the RNNLM can be trained andthen fine-tuned on the available in-domain data with correspondingauxiliary features.

In this paper, several novel approaches are investigated for dealingwith the semi-supervised adaptation of RNNLMs with auxiliary fea-tures as input. These approaches include: using zero features duringtraining to mask the weights of the feature sub-network; adding thefeature sub-network only at the time of fine-tuning; deriving thefeatures using a parametric model and; back-propagating to infer thefeatures on the generic text. These approaches are investigated and

Notes

187

results are reported both in terms of PPL and WER on a multi-genrebroadcast ASR task.

Approximated and Domain-Adapted LSTM LanguageModels for First-Pass Decoding in Speech Recognition

Mittul Singh, Youssef Oualil, Dietrich Klakow; Universitätdes Saarlandes, GermanyWed-O-8-10-3, Time: 16:40–17:00

Traditionally, short-range Language Models (LMs) like the conven-tional n-gram models have been used for language model adaptation.Recent work has improved performance for such tasks using adaptedlong-span models like Recurrent Neural Network LMs (RNNLMs).With the first pass performed using a large background n-gram LM,the adapted RNNLMs are mostly used to rescore lattices or N-bestlists, as a second step in the decoding process. Ideally, these adaptedRNNLMs should be applied for first-pass decoding. Thus, we intro-duce two ways of applying adapted long-short-term-memory (LSTM)based RNNLMs for first-pass decoding. Using available techniques toconvert LSTMs to approximated versions for first-pass decoding, wecompare approximated LSTMs adapted in a Fast Marginal Adaptationframework (FMA) and an approximated version of architecture-based-adaptation of LSTM. On a conversational speech recognitiontask, these differently approximated and adapted LSTMs combinedwith a trigram LM outperform other adapted and unadapted LMs.Here, the architecture-adapted LSTM combination obtains a 35.9%word error rate (WER) and is outperformed by FMA-based LSTMcombination obtaining the overall lowest WER of 34.4%.

Sparse Non-Negative Matrix Language Modeling:Maximum Entropy Flexibility on the Cheap

Ciprian Chelba, Diamantino Caseiro, Fadi Biadsy;Google, USAWed-O-8-10-4, Time: 17:00–17:20

We present a new method for estimating the sparse non-negativemodel (SNM) by using a small amount of held-out data and themultinomial loss that is natural for language modeling; we validateit experimentally against the previous estimation method whichuses leave-one-out on training data and a binary loss function andshow that it performs equally well. Being able to train on held-outdata is very important in practical situations where training datais mismatched from held-out/test data. We find that fairly smallamounts of held-out data (on the order of 30–70 thousand words)are sufficient for training the adjustment model, which is the onlymodel component estimated using gradient descent; the bulk ofmodel parameters are relative frequencies counted on training data.

A second contribution is a comparison between SNM and the relatedclass of Maximum Entropy language models. While much cheapercomputationally, we show that SNM achieves slightly better perplex-ity results for the same feature set and same speech recognitionaccuracy on voice search and short message dictation.

Multi-Scale Context Adaptation for Improving ChildAutomatic Speech Recognition in Child-Adult SpokenInteractions

Manoj Kumar, Daniel Bone, Kelly McWilliams, ShannaWilliams, Thomas D. Lyon, Shrikanth S. Narayanan;University of Southern California, USAWed-O-8-10-5, Time: 17:20–17:40

The mutual influence of participant behavior in a dyadic interactionhas been studied for different modalities and quantified by compu-tational models. In this paper, we consider the task of automatic

recognition for children’s speech, in the context of child-adult spokeninteractions during interviews of children suspected to have beenmaltreated. Our long-term goal is to provide insights within thisimmensely important, sensitive domain through large-scale lexicaland paralinguistic analysis. We demonstrate improvement in childspeech recognition accuracy by conditioning on both the domain andthe interlocutor’s (adult) speech. Specifically, we use informationfrom the automatic speech recognizer outputs of the adult’s speech,for which we have more reliable estimates, to modify the recognitionsystem of child’s speech in an unsupervised manner. By learningfirst at session level, and then at the utterance level, we demonstratean absolute improvement of upto 28% WER and 55% perplexity overthe baseline results. We also report results of a parallel humanspeech recognition (HSR) experiment where annotators are asked totranscribe child’s speech under two conditions: with and withoutcontextual speech information. Demonstrated ASR improvementsand the HSR experiment illustrate the importance of context inaiding child speech recognition, whether by humans or computers.

Using Knowledge Graph and Search Query Click Logsin Statistical Language Model for Speech Recognition

Weiwu Zhu; Microsoft, USAWed-O-8-10-6, Time: 17:40–18:00

This paper demonstrates how Knowledge Graph (KG) and SearchQuery Click Logs (SQCL) can be leveraged in statistical languagemodels to improve named entity recognition for online speechrecognition systems. Due to the missing in the training data, somenamed entities may be recognized as other common words that havethe similar pronunciation. KG and SQCL cover comprehensive andfresh named entities and queries that can be used to mitigate thewrong recognition. First, all the entities located in the same area inKG are clustered together, and the queries that contain the entitynames are selected from SQCL as the training data of a geographicalstatistical language model for each entity cluster. These geographi-cal language models make the unseen named entities less likely tooccur during the model training, and can be dynamically switchedaccording to the user location in the recognition phase. Second, ifany named entities are identified in the previous utterances withina conversational dialog, the probability of the n-best word sequencepaths that contain their related entities will be increased for thecurrent utterance by utilizing the entity relationships from KG andSQCL. This way can leverage the long-term contexts within the dialog.Experiments for the proposed approach on voice queries from aspoken dialog system yielded a 12.5% relative perplexity reductionin the language model measurement, and a 1.1% absolute word errorrate reduction in the speech recognition measurement.

Wed-P-6-1 : Speech Recognition:Technologies for New andParadigmsPoster 1, 10:00–12:00, Wednesday, 23 Aug. 2017Chair: Kris Demuynck

Developing On-Line Speaker Diarization System

Dimitrios Dimitriadis 1, Petr Fousek 2; 1IBM, USA; 2IBM,Czech RepublicWed-P-6-1-1, Time: 10:00–12:00

In this paper we describe the process of converting a researchprototype system for Speaker Diarization into a fully deployedproduct running in real time and with low latency. The deploymentis a part of the IBM Cloud Speech-to-Text (STT) Service. First, theprototype system is described and the requirements for the on-line,

Notes

188

Applications

deployable system are introduced. Then we describe the technicalapproaches we took to satisfy these requirements and discuss someof the challenges we have faced. In particular, we present novel ideasfor speeding up the system by using Automatic Speech Recognition(ASR) transcripts as an input to diarization, we introduce a conceptof active window to keep the computational complexity linear, weimprove the speaker model using a new speaker-clustering algorithm,we automatically keep track of the number of active speakers andwe enable the users to set an operating point on a continuous scalebetween low latency and optimal accuracy. The deployed system hasbeen tuned on real-life data reaching average Speaker Error Ratesaround 3% and improving over the prototype system by about 10%relative.

Comparison of Non-Parametric Bayesian MixtureModels for Syllable Clustering and Zero-ResourceSpeech Processing

Shreyas Seshadri 1, Ulpu Remes 2, Okko Räsänen 1;1Aalto University, Finland; 2University of Helsinki,FinlandWed-P-6-1-2, Time: 10:00–12:00

Zero-resource speech processing (ZS) systems aim to learn structuralrepresentations of speech without access to labeled data. A startingpoint for these systems is the extraction of syllable tokens utilizingthe rhythmic structure of a speech signal. Several recent ZS systemshave therefore focused on clustering such syllable tokens intolinguistically meaningful units. These systems have so far usedheuristically set number of clusters, which can, however, be highlydataset dependent and cannot be optimized in actual unsupervisedsettings. This paper focuses on improving the flexibility of ZSsystems using Bayesian non-parametric (BNP) mixture models thatare capable of simultaneously learning the cluster models as wellas their number based on the properties of the dataset. We alsocompare different model design choices, namely priors over theweights and the cluster component models, as the impact of thesechoices is rarely reported in the previous studies. Experiments areconducted using conversational speech from several languages. Themodels are first evaluated in a separate syllable clustering task andthen as a part of a full ZS system in order to examine the potentialof BNP methods and illuminate the relative importance of differentmodel design choices.

Automatic Evaluation of Children Reading Aloud onSentences and Pseudowords

Jorge Proença 1, Carla Lopes 1, Michael Tjalve 2, AndreasStolcke 2, Sara Candeias 3, Fernando Perdigão 1;1Instituto de Telecomunicações, Portugal; 2Microsoft,USA; 3Microsoft, PortugalWed-P-6-1-3, Time: 10:00–12:00

Reading aloud performance in children is typically assessed byteachers on an individual basis, manually marking reading timeand incorrectly read words. A computational tool that assists withrecording reading tasks, automatically analyzing them and providingperformance metrics could be a significant help. Towards thatgoal, this work presents an approach to automatically predictingthe overall reading aloud ability of primary school children (6–10years old), based on the reading of sentences and pseudowords.The opinions of primary school teachers were gathered as groundtruth of performance, who provided 0–5 scores closely related tothe expectations at the end of each grade. To predict these scoresautomatically, features based on reading speed and number ofdisfluencies were extracted, after an automatic disfluency detection.Various regression models were trained, with Gaussian process re-

gression giving best results for automatic features. Feature selectionfrom both sentence and pseudoword reading tasks gave the closestpredictions, with a correlation of 0.944. Compared to the use ofmanual annotation with the best correlation being 0.952, automaticannotation was only 0.8% worse. Furthermore, the error rate ofpredicted scores relative to ground truth was found to be smallerthan the deviation of evaluators’ opinion per child.

Off-Topic Spoken Response Detection with WordEmbeddings

Su-Youn Yoon, Chong Min Lee, Ikkyu Choi, Xinhao Wang,Matthew Mulholland, Keelan Evanini; EducationalTesting Service, USAWed-P-6-1-4, Time: 10:00–12:00

In this study, we developed an automated off-topic response detec-tion system as a supplementary module for an automated proficiencyscoring system for non-native English speakers’ spontaneous speech.Given a spoken response, the system first generates an automatedtranscription using an ASR system trained on non-native speech,and then generates a set of features to assess similarity to thequestion. In contrast to previous studies which required a large setof training responses for each question, the proposed system onlyrequires the question text, thus increasing the practical impact ofthe system, since new questions can be added to a test dynamically.However, questions are typically short and the traditional approachbased on exact word matching does not perform well. In order toaddress this issue, a set of features based on neural embeddingsand a convolutional neural network (CNN) were used. A systembased on the combination of all features achieved an accuracy of87% on a balanced dataset, which was substantially higher than theaccuracy of a baseline system using question-based vector spacemodels (49%). Additionally, this system almost reached the accuracyof vector space based model using a large set of responses to testquestions (93%).

Improving Mispronunciation Detection forNon-Native Learners with Multisource Informationand LSTM-Based Deep Models

Wei Li 1, Nancy F. Chen 2, Sabato Marco Siniscalchi 1,Chin-Hui Lee 1; 1Georgia Institute of Technology, USA;2A*STAR, SingaporeWed-P-6-1-5, Time: 10:00–12:00

In this paper, we utilize manner and place of articulation features anddeep neural network models (DNNs) with long short-term memory(LSTM) to improve the detection performance of phonetic mispro-nunciations produced by second language learners. First, we showthat speech attribute scores are complementary to conventionalphone scores, so they can be concatenated as features to improve abaseline system based only on phone information. Next, pronuncia-tion representation, usually calculated by frame-level averaging in aDNN, is now learned by LSTM, which directly uses sequential contextinformation to embed a sequence of pronunciation scores into apronunciation vector to improve the performance of subsequentmispronunciation detectors. Finally, when both proposed techniquesare incorporated into the baseline phone-based GOP (goodness ofpronunciation) classifier system trained on the same data, theintegrated system reduces the false acceptance rate (FAR) and falserejection rate (FRR) by 37.90% and 38.44% (relative), respectively,from the baseline system.

Notes

189

Automatic Explanation Spot Estimation MethodTargeted at Text and Figures in Lecture Slides

Shoko Tsujimura 1, Kazumasa Yamamoto 2, SeiichiNakagawa 1; 1Toyohashi University of Technology,Japan; 2Chubu University, JapanWed-P-6-1-6, Time: 10:00–12:00

Because of the spread of the Internet in recent years, e-learning,which is a form of learning through the Internet, has been used inschool education. Many lecture videos delivered at The Open Univer-sity of Japan show lecturers and lecture slides alternately. In suchvideo style, it is hard to understand where on the slide the lectureris explaining. In this paper, we examined methods to automaticallyestimate spots where the lecturer explains on the slide using lecturespeech and slide data. This technology is expected to help learnersto study the lectures. For itemized text slides, using DTW with wordembedding based distance, we obtained higher estimation accuracythan a previous work. For slides containing figures, we estimatedexplanation spots using image classification results and text in thecharts. In addition, we modified the lecture browsing system toindicate estimation results on slides, and investigated the usefulnessof indicating explanation spots by subjective evaluation with thesystem.

Multiview Representation Learning via Deep CCA forSilent Speech Recognition

Myungjong Kim 1, Beiming Cao 1, Ted Mau 2, Jun Wang 1;1University of Texas at Dallas, USA; 2UT Southwestern,USAWed-P-6-1-7, Time: 10:00–12:00

Silent speech recognition (SSR) converts non-audio information suchas articulatory (tongue and lip) movements to text. Articulatorymovements generally have less information than acoustic featuresfor speech recognition, and therefore, the performance of SSR maybe limited. Multiview representation learning, which can learn betterrepresentations by analyzing multiple information sources simulta-neously, has been recently successfully used in speech processingand acoustic speech recognition. However, it has rarely been usedin SSR. In this paper, we investigate SSR based on multiview repre-sentation learning via canonical correlation analysis (CCA). Whenboth acoustic and articulatory data are available during training,it is possible to effectively learn a representation of articulatorymovements from the multiview data with CCA. To further representthe complex structure of the multiview data, we apply deep CCA,where the functional form of the feature mapping is a deep neuralnetwork. This approach was evaluated in a speaker-independent SSRtask using a data set collected from seven English speakers using anelectromagnetic articulograph (EMA). Experimental results showedthe effectiveness of the multiview representation learning via deepCCA over the CCA-based multiview approach as well as baselinearticulatory movement data on Gaussian mixture model and deepneural network-based SSR systems.

Use of Graphemic Lexicons for Spoken LanguageAssessment

K.M. Knill, Mark J.F. Gales, K. Kyriakopoulos, A. Ragni, Y.Wang; University of Cambridge, UKWed-P-6-1-8, Time: 10:00–12:00

Automatic systems for practice and exams are essential to supportthe growing worldwide demand for learning English as an additionallanguage. Assessment of spontaneous spoken English is, however,currently limited in scope due to the difficulty of achieving sufficientautomatic speech recognition (ASR) accuracy. “Off-the-shelf” English

ASR systems cannot model the exceptionally wide variety of accents,pronunciations and recording conditions found in non-native learnerdata. Limited training data for different first languages (L1s), acrossall proficiency levels, often with (at most) crowd-sourced transcrip-tions, limits the performance of ASR systems trained on non-nativeEnglish learner speech. This paper investigates whether the effect ofone source of error in the system, lexical modelling, can be mitigatedby using graphemic lexicons in place of phonetic lexicons basedon native speaker pronunciations. Graphemic-based English ASRis typically worse than phonetic-based due to the irregularity ofEnglish spelling-to-pronunciation but here lower word error ratesare consistently observed with the graphemic ASR. The effect ofusing graphemes on automatic assessment is assessed on differentgrader feature sets: audio and fluency derived features, includingsome phonetic level features; and phone/grapheme distance featureswhich capture a measure of pronunciation ability.

Distilling Knowledge from an Ensemble of Models forPunctuation Prediction

Jiangyan Yi, Jianhua Tao, Zhengqi Wen, Ya Li; ChineseAcademy of Sciences, ChinaWed-P-6-1-9, Time: 10:00–12:00

This paper proposes an approach to distill knowledge from anensemble of models to a single deep neural network (DNN) studentmodel for punctuation prediction. This approach makes the DNNstudent model mimic the behavior of the ensemble. The ensembleconsists of three single models. Kullback-Leibler (KL) divergenceis used to minimize the difference between the output distribu-tion of the DNN student model and the behavior of the ensemble.Experimental results on English IWSLT2011 dataset show that theensemble outperforms the previous state-of-the-art model by upto 4.0% absolute in overall F1-score. The DNN student model alsoachieves up to 13.4% absolute overall F1-score improvement over theconventionally-trained baseline models.

A Mostly Data-Driven Approach to Inverse TextNormalization

Ernest Pusateri 1, Bharat Ram Ambati 1, ElizabethBrooks 1, Ondrej Platek 2, Donald McAllaster 1, VenkiNagesha 1; 1Apple, USA; 2Charles University, CzechRepublicWed-P-6-1-10, Time: 10:00–12:00

For an automatic speech recognition system to produce sensibly for-matted, readable output, the spoken-form token sequence producedby the core speech recognizer must be converted to a written-formstring. This process is known as inverse text normalization (ITN).Here we present a mostly data-driven ITN system that leverages aset of simple rules and a few hand-crafted grammars to cast ITN asa labeling problem. To this labeling problem, we apply a compactbi-directional LSTM. We show that the approach performs well usingpractical amounts of training data.

Mismatched Crowdsourcing from Multiple AnnotatorLanguages for Recognizing Zero-ResourcedLanguages: A Nullspace Clustering Approach

Wenda Chen 1, Mark Hasegawa-Johnson 1, Nancy F.Chen 2, Boon Pang Lim 2; 1University of Illinois atUrbana-Champaign, USA; 2A*STAR, SingaporeWed-P-6-1-11, Time: 10:00–12:00

It is extremely challenging to create training labels for buildingacoustic models of zero-resourced languages, in which conventional

Notes

190

resources required for model training — lexicons, transcribed audio,or in extreme cases even orthographic system or a viable phoneset design for the language — are unavailable. Here, languagemismatched transcripts, in which audio is transcribed in the or-thographic system of a completely different language by possiblynon-speakers of the target language may play a vital role. Suchmismatched transcripts have recently been successfully obtainedthrough crowdsourcing and shown to be beneficial to ASR perfor-mance. This paper further studies this problem of using mismatchedcrowdsourced transcripts in a tonal language for which we haveno standard orthography, and in which we may not even know thephoneme inventory. It proposes methods to project the multilingualmismatched transcriptions of a tonal language to the target phonesegments. The results tested on Cantonese and Singapore Hokkienhave shown that the reconstructed phone sequences’ accuracieshave absolute increment of more than 3% from those of previouslyproposed monolingual probabilistic transcription methods.

Experiments in Character-Level Neural NetworkModels for Punctuation

William Gale 1, Sarangarajan Parthasarathy 2;1University of Adelaide, Australia; 2Microsoft, USAWed-P-6-1-12, Time: 10:00–12:00

We explore character-level neural network models for inferringpunctuation from text-only input. Punctuation inference is treatedas a sequence tagging problem where the input is a sequence of un-punctuated characters, and the output is a corresponding sequenceof punctuation tags. We experiment with six architectures, all ofwhich use a long short-term memory (LSTM) network for sequencemodeling. They differ in the way the context and lookahead fora given character is derived: from simple character embeddingand delayed output to enable lookahead, to complex convolutionalneural networks (CNN) to capture context. We demonstrate that theaccuracy of proposed character-level models are competitive withthe accuracy of a state-of-the-art word-level Conditional RandomField (CRF) baseline with carefully crafted features.

Multi-Channel Apollo Mission Speech TranscriptsCalibration

Lakshmish Kaushik, Abhijeet Sangwan, John H.L.Hansen; University of Texas at Dallas, USAWed-P-6-1-13, Time: 10:00–12:00

NASA’s Apollo program is a great achievement of mankind in the20th century. Previously we had introduced UTD-CRSS Apollo datadigitization initiative where we proposed to digitize Apollo missionspeech data (∼100,000 hours) and develop Spoken Language Tech-nology based algorithms to analyze and understand various aspectsof conversational speech[1]. A new 30 track analog audio decoder isdesigned to decode 30 track Apollo analog tapes and is mounted onto the NASA Soundscriber analog audio decoder (in place of singlechannel decoder). Using the new decoder all 30 channels of datacan be decoded simultaneously thereby reducing the digitizationtime significantly. We have digitized 19,000 hours of data fromApollo missions (including entire Apollo-11, most of Apollo-13,Apollo-1, and Gemini-8 missions). Each audio track correspondsto a specific personnel/position in NASA mission control room orastronauts in space. Since many of the planned Apollo relatedspoken language technology approaches need transcripts we havedeveloped an Apollo mission specific custom Deep Neural Networks(DNN) based Automatic Speech Recognition (ASR) system. Apollospecific language models are developed. Most audio channels aredegraded due to high channel noise, system noise, attenuated signalbandwidth, transmission noise, cosmic noise, analog tape staticnoise, noise due to tape aging, etc,. In this paper we propose a novel

method to improve the transcript quality by using Signal-to-Noiseratio of channels and N-Gram sentence similarity metrics across datachannels. The proposed method shows significant improvement intranscript quality of noisy channels. The Word Error Rate (WER)analysis of transcripts across channels shows significant reduction.

Wed-P-6-2 : Speaker and LanguageRecognition ApplicationsPoster 2, 10:00–12:00, Wednesday, 23 Aug. 2017Chair: Mitchell McLaren

Calibration Approaches for Language Detection

Mitchell McLaren 1, Luciana Ferrer 2, Diego Castan 1,Aaron Lawson 1; 1SRI International, USA; 2Universidadde Buenos Aires, ArgentinaWed-P-6-2-1, Time: 10:00–12:00

To date, automatic spoken language detection research has largelybeen based on a closed-set paradigm, in which the languages to bedetected are known prior to system application. In actual practice,such systems may face previously unseen languages (out-of-set(OOS) languages) which should be rejected, a common problem thathas received limited attention from the research community. In thispaper, we focus on situations in which either (1) the system-modeledlanguages are not observed during use or (2) the test data containsOOS languages that are unseen during modeling or calibration.In these situations, the common multi-class objective functionfor calibration of language-detection scores is problematic. Wedescribe how the assumptions of multi-class calibration are notalways fulfilled in a practical sense and explore applying global andlanguage-dependent binary objective functions to relax system con-straints. We contrast the benefits and sensitivities of the calibrationapproaches on practical scenarios by presenting results using bothLRE09 data and 14 languages from the BABEL dataset. We show thatthe global binary approach is less sensitive to the characteristics ofthe training data and that OOS modeling with individual detectorsis the best option when OOS test languages are not known to thesystem.

Bidirectional Modelling for Short Duration LanguageIdentification

Sarith Fernando, Vidhyasaharan Sethu, EliathambyAmbikairajah, Julien Epps; University of New SouthWales, AustraliaWed-P-6-2-2, Time: 10:00–12:00

Language identification (LID) systems typically employ i-vectors asfixed length representations of utterances. However, it may notbe possible to reliably estimate i-vectors from short utterances,which in turn could lead to reduced language identification accuracy.Recently, Long Short Term Memory networks (LSTMs) have beenshown to better model short utterances in the context of languageidentification. This paper explores the use of bidirectional LSTMsfor language identification with the aim of modelling temporaldependencies between past and future frame based features in shortutterances. Specifically, an end-to-end system for short durationlanguage identification employing bidirectional LSTM models ofutterances is proposed. Evaluations on both NIST 2007 and 2015LRE show state-of-the-art performance.

Notes

191

Conditional Generative Adversarial Nets Classifier forSpoken Language Identification

Peng Shen, Xugang Lu, Sheng Li, Hisashi Kawai; NICT,JapanWed-P-6-2-3, Time: 10:00–12:00

The i-vector technique using deep neural network has been suc-cessfully applied in spoken language identification systems. Neuralnetwork modeling showed its effectiveness as both discriminantfeature transformation and classification in many tasks, in par-ticular with a large training data set. However, on a small dataset, neural networks suffer from the overfitting problem whichdegrades the performance. Many strategies have been investigatedand used to improve the regularization for deep neural networks,for example, weigh decay, dropout, data augmentation. In thispaper, we study and use conditional generative adversarial nets asa classifier for the spoken language identification task. Unlike theprevious works on GAN for image generation, our purpose is tofocus on improving regularization of the neural network by jointlyoptimizing the “Real/Fake” objective function and the categoricalobjective function. Compared with dropout and data augmentationmethods, the proposed method obtained 29.7% and 31.8% relativeimprovement on NIST 2015 i-vector challenge data set for spokenlanguage identification.

Tied Hidden Factors in Neural Networks forEnd-to-End Speaker Recognition

Antonio Miguel, Jorge Llombart, Alfonso Ortega,Eduardo Lleida; Universidad de Zaragoza, SpainWed-P-6-2-4, Time: 10:00–12:00

In this paper we propose a method to model speaker and session vari-ability and able to generate likelihood ratios using neural networksin an end-to-end phrase dependent speaker verification system.As in Joint Factor Analysis, the model uses tied hidden variablesto model speaker and session variability and a MAP adaptation ofsome of the parameters of the model. In the training procedure ourmethod jointly estimates the network parameters and the values ofthe speaker and channel hidden variables. This is done in a two-stepbackpropagation algorithm, first the network weights and factorloading matrices are updated and then the hidden variables, whosegradients are calculated by aggregating the corresponding speakeror session frames, since these hidden variables are tied. The lastlayer of the network is defined as a linear regression probabilisticmodel whose inputs are the previous layer outputs. This choicehas the advantage that it produces likelihoods and additionallyit can be adapted during the enrolment using MAP without theneed of a gradient optimization. The decisions are made based onthe ratio of the output likelihoods of two neural network models,speaker adapted and universal background model. The method wasevaluated on the RSR2015 database.

Speaker Clustering by Iteratively FindingDiscriminative Feature Space and Cluster Labels

Sungrack Yun, Hye Jin Jang, Taesu Kim; Qualcomm,KoreaWed-P-6-2-5, Time: 10:00–12:00

This paper presents a speaker clustering framework by iterativelyperforming two stages: a discriminative feature space is obtainedgiven a cluster label set, and the cluster label set is updated usinga clustering algorithm given the feature space. In the iterations oftwo stages, the cluster labels may be different from the true labels,and thus the obtained feature space based on the labels may beinaccurately discriminated. However, by iteratively performing abovetwo stages, more accurate cluster labels and more discriminative

feature space can be obtained, and finally they are converged. In thisresearch, the linear discriminant analysis is used for discriminatingthe i-vector feature space, and the variational Bayesian expectation-maximization on Gaussian mixture model is used for clustering thei-vectors. Our iterative clustering framework was evaluated usingthe database of keyword utterances and compared with the recently-published approaches. In all experiments, the results show that ourframework outperforms the other approaches and converges in afew iterations.

Domain Adaptation of PLDA Models in BroadcastDiarization by Means of Unsupervised SpeakerClustering

Ignacio Viñals 1, Alfonso Ortega 1, Jesús Villalba 2,Antonio Miguel 1, Eduardo Lleida 1; 1Universidad deZaragoza, Spain; 2Johns Hopkins University, USAWed-P-6-2-6, Time: 10:00–12:00

This work presents a new strategy to perform diarization dealing withhigh variability data, such as multimedia information in broadcast.This variability is highly noticeable among domains (inter-domainvariability among chapters, shows, genres, etc.). Therefore, eachdomain requires its own specific model to obtain the optimal results.We propose to adapt the PLDA models of our diarization systemwith in-domain unlabeled data. To do it, we estimate pseudo-speakerlabels by unsupervised speaker clustering. This new method hasbeen included in a PLDA-based diarization system and evaluatedon the Multi-Genre Broadcast 2015 Challenge data. Given an audio,the system computes short-time i-vectors and clusters them using avariational Bayesian PLDA model with hidden labels. The proposedmethod improves 25.41% relative w.r.t. the system without PLDAadaptation.

LSTM Neural Network-Based Speaker SegmentationUsing Acoustic and Language Modelling

Miquel India, José A.R. Fonollosa, Javier Hernando;Universitat Politècnica de Catalunya, SpainWed-P-6-2-7, Time: 10:00–12:00

This paper presents a new speaker change detection system based onLong Short-Term Memory (LSTM) neural networks using acoustic dataand linguistic content. Language modelling is combined with twodifferent Joint Factor Analysis (JFA) acoustic approaches: i-vectorsand speaker factors. Both of them are compared with a baselinealgorithm that uses cosine distance to detect speaker turn changes.LSTM neural networks with both linguistic and acoustic featureshave been able to produce a robust speaker segmentation. Theexperimental results show that our proposal clearly outperforms thebaseline system.

Acoustic Pairing of Original and Dubbed Voices in theContext of Video Game Localization

Adrien Gresse, Mickael Rouvier, Richard Dufour, VincentLabatut, Jean-François Bonastre; LIA (EA 4128), FranceWed-P-6-2-8, Time: 10:00–12:00

The aim of this research work is the development of an automaticvoice recommendation system for assisted voice casting. In thisarticle, we propose preliminary work on acoustic pairing of originaland dubbed voices. The voice segments are taken from a videogame released in two different languages. The paired voice segmentscome from different languages but belong to the same video gamecharacter. Our wish is to exploit the relationship between a set ofpaired segments in order to model the perceptual aspects of a givencharacter depending on the target language. We use a state-of-the-art

Notes

192

approach in speaker recognition (i.e. based on the paradigm i-vector/PLDA). We first evaluate pairs of i-vectors using two differentacoustic spaces, one for each of the targeted languages. Secondly,we perform a transformation in order to project the source-languagei-vector into the target language. The results showed that this latestapproach is able to improve significantly the accuracy. Finally, wechallenge the system ability to model the latent information thatholds the video-game character independently of the speaker, thelinguistic content and the language.

Homogeneity Measure Impact on Target andNon-Target Trials in Forensic Voice Comparison

Moez Ajili 1, Jean-François Bonastre 1, WaadBen Kheder 1, Solange Rossato 2, Juliette Kahn 3; 1LIA (EA4128), France; 2LIG (UMR 5217), France; 3LNE, FranceWed-P-6-2-9, Time: 10:00–12:00

It is common to see mobile recordings being presented as a forensictrace in a court. In such cases, a forensic expert is asked to analyzeboth suspect and criminal’s voice samples in order to determine thestrength-of-evidence. This process is known as Forensic Voice Com-parison (FVC). The Likelihood ratio (LR) framework is commonly usedby the experts and quite often required by the expert’s associations“best practice guides”. Nevertheless, the LR accepts some practicallimitations due both to intrinsic aspects of its estimation processand the information used during the FVC process. These aspectsare embedded in a more general one, the lack of knowledge on FVCreliability. The question of reliability remains a major challenge,particularly for FVC systems where numerous variation factors likeduration, noise, linguistic content or. . . within-speaker variabilityare not taken into account. Recently, we proposed an informationtheory-based criterion able to estimate one of these factors, thehomogeneity of information between the two sides of a FVC trial.Thanks to this new criterion, we wish to explore new aspects ofhomogeneity in this article. We wish to question the impact ofhomogeneity on reliability separately on target and non-target trials.The study is performed using FABIOLE, a publicly available databasededicated to this kind of studies with a large number of record-ings per target speaker. Our experiments report large differencesof homogeneity impact between FVC genuine and impostor trials.These results show clearly the importance of intra-speaker variabilityeffects in FVC reliability estimation. This study confirms also theinterest of homogeneity measure for FVC reliability.

Null-Hypothesis LLR: A Proposal for ForensicAutomatic Speaker Recognition

Yosef A. Solewicz 1, Michael Jessen 2, Davidvan der Vloed 3; 1National Police, Israel;2Bundeskriminalamt, Germany; 3Netherlands ForensicInstitute, The NetherlandsWed-P-6-2-10, Time: 10:00–12:00

A new method named Null-Hypothesis LLR (H0LLR) is proposedfor forensic automatic speaker recognition. The method takes intoaccount the fact that forensically realistic data are difficult to collectand that inter-individual variation is generally better representedthan intra-individual variation. According to the proposal, intra-individual variation is modeled as a projection from case-customizedinter-individual variation. Calibrated log Likelihood Ratios (LLR) thatare calculated on the basis of the H0LLR method were tested on twocorpora of forensically-founded telephone interception test sets,German-based GFS 2.0 and Dutch-based NFI-FRITS. Five automaticspeaker recognition systems were tested based on the scores or theLLRs provided by these systems which form the input to H0LLR.Speaker-discrimination and calibration performance of H0LLR is

comparable to the performance indices of the system-internal LLRcalculation methods. This shows that external data and strategiesthat work with data outside the forensic domain and without casecustomization are not necessary. It is also shown that H0LLR leadsto a reduction in the diversity of LLR output patterns of differentautomatic systems. This is important for the credibility of the Like-lihood Ratio framework in forensics, and its application in forensicautomatic speaker recognition in particular.

The Opensesame NIST 2016 Speaker RecognitionEvaluation System

Gang Liu 1, Qi Qian 1, Zhibin Wang 1, Qingen Zhao 1,Tianzhou Wang 1, Hao Li 1, Jian Xue 1, Shenghuo Zhu 1,Rong Jin 1, Tuo Zhao 2; 1Alibaba Group, USA; 2Universityof Missouri, USAWed-P-6-2-11, Time: 10:00–12:00

Last two decades have witnessed a significant progress in speakerrecognition, as evidenced by the improving performance in thespeaker recognition evaluations (SRE) hosted by NIST. Despite theprogress, only a few research is focused on speaker recognition withshort duration and language mismatch condition, which often leadsto poor recognition performance. In NIST SRE2016, these concernswere first systematically investigated by the speaker recognitioncommunity. In this study, we address these challenges from theviewpoint of feature extraction and modeling. In particular, weimprove the robustness of features by combining GMM and DNNbased iVector extraction approaches, and improve the reliability ofthe back-end model by exploiting symmetric SVM that can effectivelyleverage the unlabeled data. Finally, we introduce distance metriclearning to improve the generalization capacity of the developmentdata that is usually in limited size. Then a fusion strategy isadopted to collectively boost the performance. The effectiveness ofthe proposed scheme for speaker recognition is demonstrated onSRE2016 evaluation data: compared with DNN-iVector PLDA baselinesystem, our method yields 25.6% relative improvement in terms ofmin_Cprimary.

IITG-Indigo System for NIST 2016 SRE Challenge

Nagendra Kumar 1, Rohan Kumar Das 1, Sarfaraz Jelil 1,Dhanush B.K. 2, H. Kashyap 2, K. Sri Rama Murty 3,Sriram Ganapathy 2, Rohit Sinha 1, S.R. MahadevaPrasanna 1; 1IIT Guwahati, India; 2Indian Institute ofScience, India; 3IIT Hyderabad, IndiaWed-P-6-2-12, Time: 10:00–12:00

This paper describes the speaker verification (SV) system submittedto the NIST 2016 speaker recognition evaluation (SRE) challengeby Indian Institute of Technology Guwahati (IITG) under the fixedtraining condition task. Various SV systems are developed followingthe idea-level collaboration with two other Indian institutions. Unlikethe previous SREs, this time the focus was on developing SV systemusing non-target language speech data and a small amount unlabeleddata from target language/ dialects. For addressing these novelchallenges, we tried exploring the fusion of systems created usingdifferent features, data conditioning, and classifiers. On NIST 2016SRE evaluation data, the presented fused system resulted in actualdetection cost function (actDCF) and equal error rate (EER) of 0.81and 12.91%, respectively. Post-evaluation, we explored a recentlyproposed pairwise support vector machine classifier and appliedadaptive S-norm to the decision scores before fusion. With thesechanges, the final system achieves the actDCF and EER of 0.67 and11.63%, respectively.

Notes

193

Locally Weighted Linear Discriminant Analysis forRobust Speaker Verification

Abhinav Misra, Shivesh Ranjan, John H.L. Hansen;University of Texas at Dallas, USAWed-P-6-2-13, Time: 10:00–12:00

Channel compensation is an integral part for any state-of-the-artspeaker recognition system. Typically, Linear Discriminant Analysis(LDA) is used to suppress directions containing channel information.LDA assumes a unimodal Gaussian distribution of the speakersamples to maximize the ratio of the between-speaker varianceto within-speaker variance. However, when speaker samples havemulti-modal non-Gaussian distributions due to channel or noisedistortions, LDA fails to provide optimal performance. In thisstudy, we propose Locally Weighted Linear Discriminant Analysis(LWLDA). LWLDA computes the within-speaker scatter in a pairwisemanner and then scales it by an affinity matrix so as to preserve thewithin-class local structure. This is in contrast to another recentlyproposed non-parametric discriminant analysis method called NDA.We show that LWLDA not only performs better than NDA but also iscomputationally much less expensive. Experiments are performedusing the DARPA Robust Automatic Transcription of Speech (RATS)corpus. Results indicate that LWLDA consistently outperforms bothLDA and NDA on all trial conditions.

Recursive Whitening Transformation for SpeakerRecognition on Language Mismatched Condition

Suwon Shon, Seongkyu Mun, Hanseok Ko; KoreaUniversity, KoreaWed-P-6-2-14, Time: 10:00–12:00

Recently in speaker recognition, performance degradation dueto the channel domain mismatched condition has been activelyaddressed. However, the mismatches arising from language is yetto be sufficiently addressed. This paper proposes an approachwhich employs recursive whitening transformation to mitigate thelanguage mismatched condition. The proposed method is based onthe multiple whitening transformation, which is intended to removeun-whitened residual components in the dataset associated withi-vector length normalization. The experiments were conductedon the Speaker Recognition Evaluation 2016 trials of which thetask is non-English speaker recognition using development datasetconsist of both a large scale out-of-domain (English) dataset andan extremely low-quantity in-domain (non-English) dataset. Forperformance comparison, we develop a state-of-the-art system usingdeep neural network and bottleneck feature, which is based on aphonetically aware model. From the experimental results, alongwith other prior studies, effectiveness of the proposed method onlanguage mismatched condition is validated.

Wed-P-6-3 : Spoken Document ProcessingPoster 3, 10:00–12:00, Wednesday, 23 Aug. 2017Chair: Roland Kuhn

Query-by-Example Search with Discriminative NeuralAcoustic Word Embeddings

Shane Settle 1, Keith Levin 2, Herman Kamper 1, KarenLivescu 1; 1TTIC, USA; 2Johns Hopkins University, USAWed-P-6-3-1, Time: 10:00–12:00

Query-by-example search often uses dynamic time warping (DTW)for comparing queries and proposed matching segments. Recentwork has shown that comparing speech segments by representingthem as fixed-dimensional vectors — acoustic word embeddings— and measuring their vector distance (e.g., cosine distance) can

discriminate between words more accurately than DTW-based ap-proaches. We consider an approach to query-by-example search thatembeds both the query and database segments according to a neuralmodel, followed by nearest-neighbor search to find the matchingsegments. Earlier work on embedding-based query-by-example, us-ing template-based acoustic word embeddings, achieved competitiveperformance. We find that our embeddings, based on recurrentneural networks trained to optimize word discrimination, achievesubstantial improvements in performance and run-time efficiencyover the previous approaches.

Constructing Acoustic Distances Between Subwordsand States Obtained from a Deep Neural Network forSpoken Term Detection

Daisuke Kaneko 1, Ryota Konno 1, Kazunori Kojima 1,Kazuyo Tanaka 2, Shi-wook Lee 3, Yoshiaki Itoh 1; 1IwatePrefectural University, Japan; 2University of Tsukuba,Japan; 3AIST, JapanWed-P-6-3-2, Time: 10:00–12:00

The detection of out-of-vocabulary (OOV) query terms is a crucialproblem in spoken term detection (STD), because OOV query termsare likely. To enable search of OOV query terms in STD systems,a query subword sequence is compared with subword sequencesgenerated using an automatic speech recognizer against spokendocuments. When comparing two subword sequences, the edit dis-tance is a typical distance between any two subwords. We previouslyproposed an acoustic distance defined from statistics between statesof the hidden Markov model (HMM) and showed its effectivenessin STD [4]. This paper proposes an acoustic distance betweensubwords and HMM states where the posterior probabilities outputby a deep neural network are used to improve the STD accuracyfor OOV query terms. Experiments are conducted to evaluate theperformance of the proposed method, using the open test collectionsfor the “Spoken&Doc” tasks of the NTCIR-9 [13] and NTCIR-10 [14]workshops. The proposed method shows improvements in meanaverage precision.

Fast and Accurate OOV Decoder on High-LevelFeatures

Yuri Khokhlov, Natalia Tomashenko, Ivan Medennikov,Aleksei Romanenko; STC-innovations, RussiaWed-P-6-3-3, Time: 10:00–12:00

This work proposes a novel approach to out-of-vocabulary (OOV)keyword search (KWS) task. The proposed approach is based onusing high-level features from an automatic speech recognition(ASR) system, so called phoneme posterior based (PPB) features, fordecoding. These features are obtained by calculating time-dependentphoneme posterior probabilities from word lattices, followed bytheir smoothing. For the PPB features we developed a special novelvery fast, simple and efficient OOV decoder. Experimental resultsare presented on the Georgian language from the IARPA BabelProgram, which was the test language in the OpenKWS 2016 evalu-ation campaign. The results show that in terms of maximum termweighted value (MTWV) metric and computational speed, for singleASR systems, the proposed approach significantly outperforms thestate-of-the-art approach based on using in-vocabulary proxies forOOV keywords in the indexed database. The comparison of the twoOOV KWS approaches on the fusion results of the nine different ASRsystems demonstrates that the proposed OOV decoder outperformsthe proxy-based approach in terms of MTWV metric given the com-parable processing speed. Other important advantages of the OOVdecoder include extremely low memory consumption and simplicityof its implementation and parameter optimization.

Notes

194

Exploring the Use of Significant Words LanguageModeling for Spoken Document Retrieval

Ying-Wen Chen 1, Kuan-Yu Chen 2, Hsin-Min Wang 2,Berlin Chen 1; 1National Taiwan Normal University,Taiwan; 2Academia Sinica, TaiwanWed-P-6-3-4, Time: 10:00–12:00

Owing to the rapid global access to tremendous amounts of multi-media associated with speech information on the Internet, spokendocument retrieval (SDR) has become an emerging applicationrecently. Apart from much effort devoted to developing robustindexing and modeling techniques for spoken documents, a recentline of research targets at enriching and reformulating query rep-resentations in an attempt to enhance retrieval effectiveness. Inpractice, pseudo-relevance feedback is by far the most prevalentparadigm for query reformulation, which assumes that top-rankedfeedback documents obtained from the initial round of retrieval arepotentially relevant and can be exploited to reformulate the originalquery. Continuing this line of research, the paper presents a novelmodeling framework, which aims at discovering significant wordsoccurring in the feedback documents, to infer an enhanced querylanguage model for SDR. Formally, the proposed framework targetsat extracting the essential words representing a common notion ofrelevance (i.e., the significant words which occur in almost all of thefeedback documents), so as to deduce a new query language modelthat captures these significant words and meanwhile modulatesthe influence of both highly frequent words and too specific words.Experiments conducted on a benchmark SDR task demonstrate theperformance merits of our proposed framework.

Incorporating Acoustic Features for SpontaneousSpeech Driven Content Retrieval

Hiroto Tasaki, Tomoyosi Akiba; Toyohashi University ofTechnology, JapanWed-P-6-3-5, Time: 10:00–12:00

A speech-driven information retrieval system is expected to beuseful for gathering information with greater ease. In a conventionalsystem, users have to decide on the contents of their utterancebefore speaking, which takes quite a long time when their requestis complicated. To overcome that problem, it is required for theretrieval system to handle a spontaneously spoken query directly.In this work, we propose an extension technique of spoken contentretrieval (SCR) for effectively using spontaneously spoken queries.Acoustic features of meaningful terms in the retrieval may haveprominence compared to other terms. Also, those terms will have lin-guistic specificity. From this assumption, we predict the contributionof terms included in spontaneously spoken queries using acousticand linguistic features, and incorporate it in the query likelihoodmodel (QLM) which is a probabilistic retrieval model. We verifiedthe effectiveness of the proposed method through experiments. Ourproposed method was successful in improving retrieval performanceunder various conditions.

Order-Preserving Abstractive Summarization forSpoken Content Based on Connectionist TemporalClassification

Bo-Ru Lu, Frank Shyu, Yun-Nung Chen, Hung-Yi Lee,Lin-Shan Lee; National Taiwan University, TaiwanWed-P-6-3-6, Time: 10:00–12:00

Connectionist temporal classification (CTC) is a powerful approachfor sequence-to-sequence learning, and has been popularly usedin speech recognition. The central ideas of CTC include adding alabel “blank” during training. With this mechanism, CTC eliminates

the need of segment alignment, and hence has been applied tovarious sequence-to-sequence learning problems. In this work, weapplied CTC to abstractive summarization for spoken content. The“blank” in this case implies the corresponding input data are lessimportant or noisy; thus it can be ignored. This approach was shownto outperform the existing methods in term of ROUGE scores overChinese Giga-word and MATBN corpora. This approach also has thenice property that the ordering of words or characters in the inputdocuments can be better preserved in the generated summaries.

Automatic Alignment Between Classroom LectureUtterances and Slide Components

Masatoshi Tsuchiya, Ryo Minamiguchi; ToyohashiUniversity of Technology, JapanWed-P-6-3-7, Time: 10:00–12:00

Multimodal alignment between classroom lecture utterances andlecture slide components is one of the crucial problems to realizea multimodal e-Learning application. This paper proposes the newmethod for the automatic alignment, and formulates the alignmentas the integer linear programming (ILP) problem to maximize thescore function which consists of three factors: the similarity scorebetween utterances and slide components, the consistency of theexplanation order, and the explanation coverage of slide compo-nents. The experimental result on the Corpus of Japanese classroomLecture Contents (CJLC) shows that the automatic alignment infor-mation acquired by the proposed method is effective to improve theperformance of the automatic extraction of important utterances.

Compensating Gender Variability inQuery-by-Example Search on Speech Using VoiceConversion

Paula Lopez-Otero, Laura Docio-Fernandez, CarmenGarcia-Mateo; Universidade de Vigo, SpainWed-P-6-3-8, Time: 10:00–12:00

The huge amount of available spoken documents has raised theneed for tools to perform automatic searches within large audiodatabases. These collections usually consist of documents with agreat variability regarding speaker, language or recording channel,among others. Reducing this variability would boost the performanceof query-by-example search on speech systems, especially in zero-resource systems that use acoustic features for audio representation.Hence, in this work, a technique to compensate the variability causedby speaker gender is proposed. Given a data collection composedof documents spoken by both male and female voices, every time aspoken query has to be searched, an alternative version of the queryon its opposite gender is generated using voice conversion. Afterthat, the female version of the query is used to search within docu-ments spoken by females and vice versa. Experimental validation ofthe proposed strategy shows an improvement of search on speechperformance caused by the reduction of gender variability.

Zero-Shot Learning Across HeterogeneousOverlapping Domains

Anjishnu Kumar 1, Pavankumar Reddy Muddireddy 2,Markus Dreyer 1, Björn Hoffmeister 1; 1Amazon.com,USA; 2University of Illinois at Urbana-Champaign, USAWed-P-6-3-9, Time: 10:00–12:00

We present a zero-shot learning approach for text classification,predicting which natural language understanding domain can handlea given utterance. Our approach can predict domains at runtimethat did not exist at training time. We achieve this extensibility bylearning to project utterances and domains into the same embedding

Notes

195

space while generating each domain-specific embedding from a setof attributes that characterize the domain. Our model is a neuralnetwork trained via ranking loss. We evaluate the performance ofthis zero-shot approach on a subset of a virtual assistant’s third-party domains and show the effectiveness of the technique on newdomains not observed during training. We compare to generativebaselines and show that our approach requires less storage andperforms better on new domains.

Hierarchical Recurrent Neural Network for StorySegmentation

Emiru Tsunoo, Peter Bell, Steve Renals; University ofEdinburgh, UKWed-P-6-3-10, Time: 10:00–12:00

A broadcast news stream consists of a number of stories and eachstory consists of several sentences. We capture this structure using ahierarchical model based on a word-level Recurrent Neural Network(RNN) sentence modeling layer and a sentence-level bidirectionalLong Short-Term Memory (LSTM) topic modeling layer. First, theword-level RNN layer extracts a vector embedding the sentence infor-mation from the given transcribed lexical tokens of each sentence.These sentence embedding vectors are fed into a bidirectional LSTMthat models the sentence and topic transitions. A topic posterior foreach sentence is estimated discriminatively and a Hidden Markovmodel (HMM) follows to decode the story sequence and identify storyboundaries. Experiments on the topic detection and tracking (TDT2)task indicate that the hierarchical RNN topic modeling achievesthe best story segmentation performance with a higher F1-measurecompared to conventional state-of-the-art methods. We also comparevariations of our model to infer the optimal structure for the storysegmentation task.

Evaluating Automatic Topic Segmentation as aSegment Retrieval Task

Abdessalam Bouchekif 1, Delphine Charlet 2, GéraldineDamnati 2, Nathalie Camelin 1, Yannick Estève 1; 1LIUM(EA 4023), France; 2Orange Labs, FranceWed-P-6-3-11, Time: 10:00–12:00

Several evaluation metrics have been proposed for topic segmen-tation. Most of them rely on the paradigm that segmentation ismainly a task that detects boundaries, and thus are oriented onboundary detection evaluation. Nevertheless, this paradigm is notappropriate to get homogeneous chapters, which is one of the majorapplications of topic segmentation. For instance on Broadcast News,topic segmentation enables users to watch a chapter independentlyof the others.

We propose to consider segmentation as a task that detects ho-mogeneous segments, and we propose evaluation metrics orientedon segment retrieval. The proposed metrics are experimented onvarious TV shows from different channels. Results are analysed anddiscussed, highlighting their relevance.

Improving Speech Recognizers by Refining BroadcastData with Inaccurate Subtitle Timestamps

Jeong-Uk Bang 1, Mu-Yeol Choi 2, Sang-Hun Kim 2,Oh-Wook Kwon 1; 1Chungbuk National University, Korea;2ETRI, KoreaWed-P-6-3-12, Time: 10:00–12:00

This paper proposes an automatic method to refine broadcastdata collected every week for efficient acoustic model training.For training acoustic models, we use only audio signals, subtitle

texts, and subtitle timestamps accompanied by recorded broadcastprograms. However, the subtitle timestamps are often inaccuratedue to inherent characteristics of closed captioning. In the proposedmethod, we remove subtitle texts with low subtitle quality index,concatenate adjacent subtitle texts into a merged subtitle text,and correct the timestamp of the merged subtitle text by adding amargin. Then, a speech recognizer is used to obtain a hypothesistext from the speech segment corresponding to the merged subtitletext. Finally, the refined speech segments to be used for acousticmodel training, are generated by selecting the subparts of themerged subtitle text that matches the hypothesis text. It is shownthat the acoustic models trained by using refined broadcast datagive significantly higher speech recognition accuracy than thosetrained by using raw broadcast data. Consequently, the proposedmethod can efficiently refine a large amount of broadcast data withinaccurate timestamps taking about half of the time, compared withthe previous approaches.

A Relevance Score Estimation for Spoken TermDetection Based on RNN-Generated PronunciationEmbeddings

Jan Švec 1, Josef V. Psutka 1, Luboš Šmídl 1, Jan Trmal 2;1University of West Bohemia, Czech Republic; 2JohnsHopkins University, USAWed-P-6-3-13, Time: 10:00–12:00

In this paper, we present a novel method for term score estimation.The method is primarily designed for scoring the out-of-vocabularyterms, however it could also estimate scores for in-vocabularyresults. The term score is computed as a cosine distance of twopronunciation embeddings. The first one is generated from thegrapheme representation of the searched term, while the second oneis computed from the recognized phoneme confusion network. Theembeddings are generated by specifically trained recurrent neuralnetwork built on the idea of Siamese neural networks. The RNNis trained from recognition results on word- and phone-level in anunsupervised fashion without need of any hand-labeled data. Themethod is evaluated on the MALACH data in two languages, Englishand Czech. The results are compared with two baseline methods forOOV term detection.

Wed-P-6-4 : Speech IntelligibilityPoster 4, 10:00–12:00, Wednesday, 23 Aug. 2017Chair: Preeti Rao

Predicting Automatic Speech RecognitionPerformance Over Communication Channels fromInstrumental Speech Quality and Intelligibility Scores

Laura Fernández Gallardo 1, Sebastian Möller 1, JohnBeerends 2; 1T-Labs, Germany; 2TNO, The NetherlandsWed-P-6-4-1, Time: 10:00–12:00

The performance of automatic speech recognition based on coded-decoded speech heavily depends on the quality of the transmittedsignals, determined by channel impairments. This paper examinesrelationships between speech recognition performance and mea-surements of speech quality and intelligibility over transmissionchannels. Different to previous studies, the effects of super-widebandtransmissions are analyzed and compared to those of wideband andnarrowband channels. Furthermore, intelligibility scores, gatheredby conducting a listening test based on logatomes, are also consid-ered for the prediction of automatic speech recognition results. Themodern instrumental measurement techniques POLQA and POLQA-based intelligibility have been respectively applied to estimate the

Notes

196

quality and the intelligibility of transmitted speech. Based on ourresults, polynomial models are proposed that permit the predictionof speech recognition accuracy from the subjective and instrumentalmeasures, involving a number of channel distortions in the threebandwidths. This approach can save the costs of performing auto-matic speech recognition experiments and can be seen as a first steptowards a useful tool for communication channel designers.

Speech Intelligibility in Cars: The Effect of SpeakingStyle, Noise and Listener Age

Cassia Valentini Botinhao, Junichi Yamagishi; Universityof Edinburgh, UKWed-P-6-4-2, Time: 10:00–12:00

Intelligibility of speech in noise becomes lower as the listeners ageincreases, even when no apparent hearing impairment is present.The losses are, however, different depending on the nature of thenoise and the characteristics of the voice. In this paper we investi-gate the effect that age, noise type and speaking style have on theintelligibility of speech reproduced by car loudspeakers. Using abinaural mannequin we recorded a variety of voices and speakingstyles played from the audio system of a car while driving in differentconditions. We used this material to create a listening test whereparticipants were asked to transcribe what they could hear andrecruited groups of young and older adults to take part in it. Wefound that intelligibility scores of older participants were lower forthe competing speaker and background music conditions. Resultsalso indicate that clear and Lombard speech was more intelligiblethan plain speech for both age groups. A mixed effect model revealedthat the largest effect was the noise condition, followed by sentencetype, speaking style, voice, age group and pure tone average.

Predicting Speech Intelligibility Using a GammachirpEnvelope Distortion Index Based on theSignal-to-Distortion Ratio

Katsuhiko Yamamoto 1, Toshio Irino 1, Toshie Matsui 1,Shoko Araki 2, Keisuke Kinoshita 2, Tomohiro Nakatani 2;1Wakayama University, Japan; 2NTT, JapanWed-P-6-4-3, Time: 10:00–12:00

A new intelligibility prediction measure, called “Gammachirp En-velope Distortion Index (GEDI)” is proposed for the evaluation ofspeech enhancement algorithms. This model calculates the signal-to-distortion ratio (SDR) in envelope responses SDRenv derived fromthe gammachirp filterbank outputs of clean and enhanced speech,and is an extension of the speech based envelope power spectrummodel (sEPSM) to improve prediction and usability. An evaluationwas performed by comparing human subjective results and modelpredictions for the speech intelligibility of noise-reduced soundsprocessed by spectral subtraction and a recent Wiener filteringtechnique. The proposed GEDI predicted the subjective results ofthe Wiener filtering better than those predicted by the original sEPSMand well-known conventional measures, i.e., STOI, CSII, and HASPI.

Intelligibilities of Mandarin Chinese Sentences withSpectral “Holes”

Yafan Chen, Yong Xu, Jun Yang; Chinese Academy ofSciences, ChinaWed-P-6-4-4, Time: 10:00–12:00

The speech intelligibility of Mandarin Chinese sentences of variousspectral regions, regarding band-stop conditions (one or two “holes”in the spectrum), was investigated through subjective listeningtests. Results demonstrated significant effects on Mandarin Chinesesentence intelligibilities when a single or a pair of spectral holeswas introduced. Meanwhile, it revealed the importance of the first

and second formant (F1, F2) frequencies for the comprehension ofMandarin sentences. More importantly, the first formant frequenciesplayed a more primary role rather than those of the second formants.Sentence intelligibilities declined evidently with the lacking of F1frequencies, but the effect became small when the spectrum holescovered more than 50% of F1 frequencies, and F2 frequencies cameinto a major play in the intelligibility of Mandarin sentence.

The Effect of Situation-Specific Non-Speech AcousticCues on the Intelligibility of Speech in Noise

Lauren Ward, Ben Shirley, Yan Tang, William J. Davies;University of Salford, UKWed-P-6-4-5, Time: 10:00–12:00

In everyday life, speech is often accompanied by a situation-specificacoustic cue; a hungry bark as you ask ‘Has anyone fed the dog?’. Thispaper investigates the effect such cues have on speech intelligibilityin noise and evaluates their interaction with the established effectof situation-specific semantic cues. This work is motivated by theintroduction of new object-based broadcast formats, which havethe potential to optimise intelligibility by controlling the level ofindividual broadcast audio elements, at point of service. Resultsof this study show that situation-specific acoustic cues alone canimprove word recognition in multi-talker babble by 69.5%, a similaramount to semantic cues. The combination of both semantic andacoustic cues provide further improvement of 106.0% compared withno cues, and 18.7% compared with semantic cues only. Interestingly,whilst increasing subjective intelligibility of the target word, thepresence of acoustic cues degraded the objective intelligibility ofthe speech-based semantic cues by 47.0% (equivalent to reducingthe speech level by 4.5 dB). This paper discusses the interactionsbetween the two types of cues and the implications that these resultshave for assessing and improving the intelligibility of broadcastspeech.

On the Use of Band Importance Weighting in theShort-Time Objective Intelligibility Measure

Asger Heidemann Andersen 1, Jan Mark de Haan 2,Zheng-Hua Tan 1, Jesper Jensen 1; 1Aalborg University,Denmark; 2Oticon, DenmarkWed-P-6-4-6, Time: 10:00–12:00

Speech intelligibility prediction methods are popular tools withinthe speech processing community for objective evaluation of speechintelligibility of e.g. enhanced speech. The Short-Time ObjectiveIntelligibility (STOI) measure has become highly used due to itssimplicity and high prediction accuracy. In this paper we investigatethe use of Band Importance Functions (BIFs) in the STOI measure, i.e.of unequally weighting the contribution of speech information fromeach frequency band. We do so by fitting BIFs to several datasetsof measured intelligibility, and cross evaluating the predictionperformance. Our findings indicate that it is possible to improveprediction performance in specific situations. However, it has notbeen possible to find BIFs which systematically improve predictionperformance beyond the data used for fitting. In other words, wefind no evidence that the performance of the STOI measure can beimproved considerably by extending it with a non-uniform BIF.

Listening in the Dips: Comparing Relevant Featuresfor Speech Recognition in Humans and Machines

Constantin Spille, Bernd T. Meyer; Carl von OssietzkyUniversität Oldenburg, GermanyWed-P-6-4-7, Time: 10:00–12:00

In recent years, automatic speech recognition (ASR) systems gradu-ally decreased (and for some tasks closed) the gap between human

Notes

197

and automatic speech recognition. However, it is unclear if similarperformance implies humans and ASR systems to rely on similarsignal cues. In the current study, ASR and HSR are compared usingspeech material from a matrix sentence test mixed with either astationary speech-shaped noise (SSN) or amplitude-modulated SSN.Recognition performance of HSR and ASR is measured in term ofthe speech recognition threshold (SRT), i.e., the signal-to-noise ratiowith 50% recognition rate and by comparing psychometric functions.ASR results are obtained with matched-trained DNN-based systemsthat use FBank features as input and compared to results obtainedfrom eight normal-hearing listeners and two established modelsof speech intelligibility. For both maskers, HSR and ASR achievesimilar SRTs with an average deviation of only 0.4 dB. A relevancepropagation algorithm is applied to identify features relevant forASR. The analysis shows that relevant features coincide either withspectral peaks of the speech signal or with dips of the noise masker,indicating that similar cues are important in HSR and ASR.

Wed-P-7-2 : Articulatory and AcousticPhoneticsPoster 2, 13:30–15:30, Wednesday, 23 Aug. 2017Chair: Rachid Ridouane

Mental Representation of Japanese Mora; Focusing onits Intrinsic Duration

Kosuke Sugai; Kindai University, JapanWed-P-7-2-1, Time: 13:30–15:30

Japanese is one of the typical languages in which vowel quantityplays a key role. In Japanese, a phonological structure called “mora”is a fundamental rhythmic unit, and theoretically, each mora issupposed to have a similar duration (isochronicity). The rhythmof a native language has great importance on spoken languageprocessing, including second language speaking; therefore, in orderto get a clear picture of bottom-up speech processing, it is crucial todiscern how morae are mentally represented. Various studies havebeen conducted to understand the nature of speech processing as acognitive construct; however, most of this research was conductedwith the target stimuli embedded in words or carrier sentencesto clarify on specifically the relative duration of morae. In thisstudy, two reaction-time experiments were conducted to investigatewhether morae are mentally represented and how long the durationis. The isolated vowels /i/, /e/, /a/, /o/, /u/, and syllable /tan/were chosen as target stimuli, and the first morae were digitallymanipulated into 15 durations with 20 ms variations in length, from150 ms to 330 ms. The results revealed the existence of a durationalthreshold between one and two morae, ranging around 250 ms.

Temporal Dynamics of Lateral Channel Formation in/l/: 3D EMA Data from Australian English

Jia Ying 1, Christopher Carignan 1, Jason A. Shaw 2,Michael Proctor 3, Donald Derrick 4, Catherine T. Best 1;1Western Sydney University, Australia; 2Yale University,USA; 3University of Canterbury, New Zealand;4Macquarie University, AustraliaWed-P-7-2-2, Time: 13:30–15:30

This study investigated the dynamics of lateral channel formationof /l/ in Australian-accented English (AusE) using 3D electromag-netic articulography (EMA). Coils were placed on the tongue bothmid-sagitally and para-sagitally. We varied the vowel preceding /l/between /I/ and /æ/, e.g., filbert vs. talbot, and the syllable positionof /l/, e.g., /'tæl.b@t/ vs. /'tæb.l@t/. The articulatory analyses of

lateral /l/ show that: (1) the mid-sagittal delay (from the tongue tipgesture to the tongue middle/tongue back gesture) changes acrossdifferent syllable positions and vowel contexts; (2) the para-sagittallateralization duration remains the same across syllable positionsand vowel contexts; (3) the lateral formation reaches its peak earlierthan the mid-sagittal gesture peak; (4) the magnitude of tongueasymmetrical lateralization is greater than the magnitude of tonguecurvature in the coronal plane. We discuss these results in light ofthe temporal dynamics of lateral channel formation. We interpretour results as evidence that the formation of the lateral channel isthe primary goal of /l/ production.

Vowel and Consonant Sequences in three BavarianDialects of Austria

Nicola Klingler 1, Sylvia Moosmüller 1, Hannes Scheutz 2;1ÖAW, Austria; 2Universität Salzburg, AustriaWed-P-7-2-3, Time: 13:30–15:30

In 1913, Anton Pfalz described a specific relation of vowel andconsonant sequences for East Middle Bavarian dialects, located inthe eastern parts of Austria. According to his observations, a longvowel is always followed by a lenis consonant, and a short vowel isalways followed by a fortis consonant. Consequently, vowel dura-tion depends on the quality of the following consonant. Phoneticexaminations of what became to be known as the Pfalz’s Law yieldeddifferent results. Specifically, the occurrence of a third category,namely a long vowel followed by a fortis consonant, seems to befirmly embedded in East Middle Bavarian.

Up till now, phonetic examinations concentrated on CVCV sequences.The analysis of monosyllables and of sequences including consonantclusters has been largely neglected so far. In the present contribu-tion, we analyse the impact of initial and final consonant clusters inmonosyllables on the assumed relationship of vowel + consonantsequences. Thus, we included 18 speakers from three Bavarianvarieties. The results show that in all examined varieties long vowel+ fortis consonants occur and that the cluster complexity has noinfluence on the absolute vowel duration, contradicting Pfalz’s Law.

Acoustic Cues to the Singleton-Geminate Contrast:The Case of Libyan Arabic Sonorants

Amel Issa; University of Leeds, UKWed-P-7-2-4, Time: 13:30–15:30

This study examines the acoustic correlates of the singleton andgeminate consonants in Tripolitanian Libyan Arabic (TLA). Severalmeasurements were obtained including target segment duration,preceding vowel duration, RMS amplitude for the singleton andgeminate consonants, and F1, F2 and F3 for the target consonants.The results confirm that the primary acoustic correlate that distin-guishes singletons from geminates in TLA is duration regardless ofsound type with the ratio of C to CC being 1 to 2.42. The durationof the preceding vowels is suggestive and may be considered asanother cue to the distinction between them. There was no evidenceof differences in RMS amplitude between singleton and geminateconsonants of any type. F1, F2 and F3 frequencies are found toshow similar patterns for singleton and geminate consonants forall sound types, suggesting no gestural effects of gemination inTLA. Preliminary results from the phonetic cues investigated heresuggest that the acoustic distinction between singleton and geminateconsonants in TLA is dependent mainly on durational correlates.

Notes

198

Mel-Cepstral Distortion of German Vowels inDifferent Information Density Contexts

Erika Brandt, Frank Zimmerer, Bistra Andreeva, BerndMöbius; Universität des Saarlandes, GermanyWed-P-7-2-5, Time: 13:30–15:30

This study investigated whether German vowels differ significantlyfrom each other in mel-cepstral distortion (MCD) when they standin different information density (ID) contexts. We hypothesizedthat vowels in the same ID contexts are more similar to each otherthan vowels that stand in different ID conditions. Read speechmaterial from PhonDat2 of 16 German natives (m = 10, f = 6) wasanalyzed. Bi-phone and word language models were calculated basedon DeWaC. To account for additional variability in the data, prosodicfactors, as well as corpus-specific frequency values were also enteredinto the statistical models. Results showed that vowels in differentID conditions were significantly different in their MCD values. Uni-gram word probability and corpus-specific word frequency showedthe expected effect on vowel similarity with a hierarchy betweennon-contrasting and contrasting conditions. However, these didnot form a homogeneous group since there were group-internalsignificant differences. The largest distance can be found betweenvowels produced at fast speech rate, and between unstressed vowels.

Effect of Formant and F0 Discontinuity on PerceivedVowel Duration: Impacts for Concatenative SpeechSynthesis

Tomáš Boril, Pavel Šturm, Radek Skarnitzl, Jan Volín;Charles University, Czech RepublicWed-P-7-2-6, Time: 13:30–15:30

Unit selection systems of speech synthesis offer good overall quality,but this may be countervailed by a sporadic and unpredictableoccurrence of audible artifacts, such as discontinuities in F0 andthe spectrum. Informal observations suggested that such breaksmay have an effect on perceived vowel duration. This study there-fore investigates the effect of F0 and formant discontinuities onthe perceived duration of vowels in Czech synthetic speech. Tenmanipulations of F0, F1 and F2 were performed on target vowels inshort synthesized phrases creating abrupt breaks in the contoursat the midpoint of the vowels. Listeners decided in a 2AFC task inwhich phrase the last syllable was longer. The results showed thatdespite identical duration of the compared stimuli, vowels whichwere manipulated in the second part towards centralized values(i.e., less peripheral) were systematically considered to be shorterby the listeners than stimuli without such discontinuities, and viceversa. However, the influence seems to be distinct from an overallformant change (without a discontinuity) since a control stimulusin which the manipulation was performed within the entire vowelwas not perceived as significantly shorter or longer. No effect of F0manipulations was observed.

An Ultrasound Study of Alveolar and RetroflexConsonants in Arrernte: Stressed and UnstressedSyllables

Marija Tabain 1, Richard Beare 2; 1La Trobe University,Australia; 2Monash University, AustraliaWed-P-7-2-7, Time: 13:30–15:30

This study presents ultrasound data from six female speakers of theCentral Australian language Arrernte. We focus on the apical stopcontrast, alveolar /t/ versus retroflex /ú/, which may be consideredphonemically marginal. We compare these sounds in stressed andunstressed position. Consistent with previous results on this apicalcontrast, we show that there are minimal differences between the

retroflex and the alveolar at stop offset; however, at stop onset,the retroflex has a higher front portion of the tongue, and oftena more forward posterior portion of the tongue. This differencebetween the alveolar and the retroflex is particularly remarked inunstressed prosodic context. This result confirms our previous EPGand EMA results from two of the speakers in the present study,which showed that the most prototypical retroflex consonant occursin the unstressed prosodic position.

Reshaping the Transformed LF Model: Generating theGlottal Source from the Waveshape Parameter Rd

Christer Gobl; Trinity College Dublin, IrelandWed-P-7-2-8, Time: 13:30–15:30

Precise specification of the voice source would facilitate bettermodelling of expressive nuances in human spoken interaction. Thispaper focuses on the transformed version of the widely used LF voicesource model, and proposes an algorithm which makes it possibleto use the waveshape parameter Rd to directly control the LF pulse,for more effective analysis and synthesis of voice modulations. TheRd parameter, capturing much of the natural covariation betweenglottal parameters, is central to the transformed LF model. It is usedto predict the standard R-parameters, which in turn are used tosynthesise the LF waveform. However, the LF pulse that results fromthese predictions may have an Rd value noticeably different from thespecified Rd, yielding undesirable artefacts, particularly when themodel is used for detailed analysis and synthesis of non-modal voice.A further limitation is that only a subset of possible Rd values can beused, to avoid conflicting LF parameter settings. To eliminate theseproblems, a new iterative algorithm was developed based on theNewton-Raphson method for two variables, but modified to includeconstraints. This ensures that the correct Rd is always obtained andthat the algorithm converges for effectively all permissible Rd values.

Kinematic Signatures of Prosody in Lombard Speech

Štefan Benuš 1, Juraj Šimko 2, Mona Lehtinen 2; 1UKF,Slovak Republic; 2University of Helsinki, FinlandWed-P-7-2-9, Time: 13:30–15:30

Human spoken interactions are embodied and situated. Better un-derstanding of the restrictions and affordances this embodiment andsituational awareness has on human speech informs the quest formore natural models of human-machine spoken interactions. Herewe examine the articulatory realization of communicative meaningsexpressed through f0 falling and rising prosodic boundaries inquiet and noisy conditions. Our data show that 1) the effect ofenvironmental noise is more robustly present in the post-boundarythan the pre-boundary movements, 2) f0 falls and rises are onlyweakly differentiated in supra-laryngeal articulation and differminimally in their response to noise, 3) individual speakers finddifferent solutions for achieving the communicative goals, and 4) lipmovements are affected by noise and boundary type more than thetongue movements.

What do Finnish and Central Bavarian Have inCommon? Towards an Acoustically Based QuantityTypology

Markus Jochim, Felicitas Kleber; LMU München,GermanyWed-P-7-2-10, Time: 13:30–15:30

The aim of this study was to investigate vowel and consonantquantity in Finnish, a typical quantity language, and to set up areference corpus for a large-scale project studying the diachronicdevelopment of quantity contrasts in German varieties. Although

Notes

199

German is not considered a quantity language, both tense and laxvowels and voiced and voiceless stops are differentiated by voweland closure duration, respectively. The role of these cues, however,has undergone different diachronic changes in various Germanvarieties. To understand the conditions for such prosodic changes,the present study investigates the stability of quantity relations inan undisputed quantity language. To this end, recordings of wordsdiffering in vowel and stop length were obtained from seven olderand six younger L1 Finnish speakers, both in a normal and a loudvoice. We then measured vowel and stop duration and calculated thevowel to vowel-plus-consonant ratio (a measure known to differenti-ate German VC sequences) as well as the geminate-to-singleton ratio.Results show stability across age groups but variability across speechstyles. Moreover, VC ratios were similar for Finnish and BavarianGerman speakers. We discuss our findings against the backgroundof a typology of vowel and consonant quantity.

Locating Burst Onsets Using SFF Envelope and PhaseInformation

Bhanu Teja Nellore, RaviShankar Prasad,Sudarsana Reddy Kadiri, Suryakanth V. Gangashetty, B.Yegnanarayana; IIIT Hyderabad, IndiaWed-P-7-2-11, Time: 13:30–15:30

Bursts are produced by closing the oral tract at a place of articulationand suddenly releasing the acoustic energy built-up behind the clo-sure in the tract. The release of energy is an impulse-like behavior,and it is followed by a short duration of frication. The burst releaseis short and mostly weak in nature (compared to sonorant sounds),thus making it difficult to detect its presence in continuous speech.This paper attempts to identify burst onsets based on parametersderived from single frequency filtering (SFF) analysis of speechsignals. The SFF envelope and phase information give good spectraland temporal resolutions of certain features of the signal. Signalreconstructed from the SFF phase information is shown to be usefulin locating burst onsets. Entropy and spectral distance parametersfrom the SFF spectral envelopes are used to refine the burst onsetcandidate set. The identified burst onset locations are comparedwith manual annotations in the TIMIT database.

A Preliminary Phonetic Investigation of AlphabeticWords in Mandarin Chinese

Hongwei Ding 1, Yuanyuan Zhang 1, Hongchao Liu 2,Chu-Ren Huang 2; 1Shanghai Jiao Tong University,China; 2Hong Kong Polytechnic University, ChinaWed-P-7-2-12, Time: 13:30–15:30

Chinese words written partly or fully in roman letters have gainedpopularity in Mandarin Chinese in the last few decades and anappendix of such Mandarin Alphabetical Words (MAWs) is includedin the authoritative dictionary of Standard Mandarin. However,no transcription of MAWs has been provided because it is notclear whether we should keep the original English pronunciation ortranscribe MAWs with Mandarin Pinyin system. This study aims toinvestigate the phonetic adaptation of several most frequent MAWsextracted from the corpus. We recruited eight students from Shang-hai, 18 students from Shandong Province, and one student from theUSA. All the subjects were asked to read both 24 Chinese sentencesembedding the MAWs and all 26 letters of the English alphabet. Theresults showed that Letters A O N T were predominantly pronouncedin Tone 1; H was often produced with vowel epenthesis after thefinal consonant; and B was usually produced in Tone 2 by Shanghaispeakers and in Tone 4 by Shandong speakers. We conclude thatthe phonetic adaptation of MAWs is influenced by the dialects of thespeakers, tones of other Chinese characters in the MAWs, as well asindividual preferences.

A Quantitative Measure of the Impact ofCoarticulation on Phone Discriminability

Thomas Schatz 1, Rory Turnbull 1, Francis Bach 2,Emmanuel Dupoux 1; 1LSCP (UMR 8554), France; 2DIENS (UMR 8548), FranceWed-P-7-2-13, Time: 13:30–15:30

Acoustic realizations of a given phonetic segment are typicallyaffected by coarticulation with the preceding and following pho-netic context. While coarticulation has been extensively studiedusing descriptive phonetic measurements, little is known about thefunctional impact of coarticulation for speech processing. Here,we use DTW-based similarity defined on raw acoustic features andABX scores to derive a measure of the effect of coarticulation onphonetic discriminability. This measure does not rely on definingsegment-specific phonetic cues (formants, duration, etc.) and canbe applied systematically and automatically to any segment in largescale corpora. We illustrate our method using stimuli in English andJapanese. We confirm some expected trends, i.e., stronger anticipa-tory than perseveratory coarticulation and stronger coarticulationfor lax/short vowels than for tense/long vowels. We then quantify forthe first time the impact of coarticulation across different segmenttypes (like vowels and consonants). We discuss how our metric andits possible extensions can help addressing current challenges in thesystematic study of coarticulation.

Wed-P-7-3 : Music and Audio ProcessingPoster 3, 13:30–15:30, Wednesday, 23 Aug. 2017Chairs: Prasanta Ghosh, Unto Laine

Sinusoidal Partials Tracking for Singing AnalysisUsing the Heuristic of the Minimal Frequency andMagnitude Difference

Kin Wah Edward Lin 1, Hans Anderson 1, Clifford So 2,Simon Lui 1; 1SUTD, Singapore; 2Chinese University ofHong Kong, ChinaWed-P-7-3-1, Time: 13:30–15:30

We present a simple heuristic-based Sinusoidal Partial Tracking (PT)algorithm for singing analysis. Our PT algorithm uses a heuristicof minimal frequency and magnitude difference to track sinusoidalpartials in the popular music. An Ideal Binary Mask (IBM), which iscreated from the ground truth of the singing voice and the musicaccompaniment, is used to identify the sound source of the partials.In this justifiable way, we are able to assess the quality of the partialsidentified from the PT algorithm. Using the iKala dataset along withthe IBM and BSS Eval 3.0 as a new method of quantifying the partialsquality, the comparative results show that our PT algorithm canachieve 0.8746 ∼ 1.7029 dB GNSDR gain, compared to two commonbenchmarks, namely the MQ algorithm and the SMS-PT algorithm.Thus, our PT algorithm can be considered as a new benchmark ofthe PT algorithm used in singing analysis.

Audio Scene Classification with Deep RecurrentNeural Networks

Huy Phan, Philipp Koch, Fabrice Katzberg, Marco Maass,Radoslaw Mazur, Alfred Mertins; Universität zu Lübeck,GermanyWed-P-7-3-2, Time: 13:30–15:30

We introduce in this work an efficient approach for audio sceneclassification using deep recurrent neural networks. An audioscene is firstly transformed into a sequence of high-level label treeembedding feature vectors. The vector sequence is then divided

Notes

200

into multiple subsequences on which a deep GRU-based recurrentneural network is trained for sequence-to-label classification. Theglobal predicted label for the entire sequence is finally obtained viaaggregation of subsequence classification outputs. We will showthat our approach obtains an F1-score of 97.7% on the LITIS Rouendataset, which is the largest dataset publicly available for the task.Compared to the best previously reported result on the dataset, ourapproach is able to reduce the relative classification error by 35.3%.

Automatic Time-Frequency Analysis of EcholocationSignals Using the Matched Gaussian MultitaperSpectrogram

Maria Sandsten, Isabella Reinhold, JosefinStarkhammar; Lund University, SwedenWed-P-7-3-3, Time: 13:30–15:30

High-resolution time-frequency (TF) images of multi-componentsignals are of great interest for visualization, feature extraction andestimation. The matched Gaussian multitaper spectrogram has beenproposed to optimally resolve multi-component transient functionsof Gaussian shape. Hermite functions are used as multitapers and theweights of the different spectrogram functions are optimized. For afixed number of multitapers, the optimization gives the approximateWigner distribution of the Gaussian shaped function. Increasing thenumber of multitapers gives a better approximation, i.e. a betterresolution, but the cross-terms also become more prominent forclose TF components. In this submission, we evaluate a numberof different concentration measures to automatically estimate thenumber of multitapers resulting in the optimal spectrogram for TFimages of dolphin echolocation signals. The measures are evaluatedfor different multi-component signals and noise levels and a sugges-tion of an automatic procedure for optimal TF analysis is given. Theresults are compared to other well known TF estimation algorithmsand examples of real data measurements of echolocation signalsfrom a beluga whale (Delphinapterus leucas) are presented.

Classification-Based Detection of Glottal ClosureInstants from Speech Signals

Jindrich Matoušek, Daniel Tihelka; University of WestBohemia, Czech RepublicWed-P-7-3-4, Time: 13:30–15:30

In this paper a classification-based method for the automatic de-tection of glottal closure instants (GCIs) from the speech signal isproposed. Peaks in the speech waveforms are taken as candidatesfor GCI placements. A classification framework is used to train aclassification model and to classify whether or not a peak corre-sponds to the GCI. We show that the detection accuracy in terms ofF1 score is 97.27%. In addition, despite using the speech signal only,the proposed method behaves comparably to a method utilizing theglottal signal. The method is also compared with three existing GCIdetection algorithms on publicly available databases.

A Domain Knowledge-Assisted Nonlinear Model forHead-Related Transfer Functions Based on BottleneckDeep Neural Network

Xiaoke Qi, Jianhua Tao; Chinese Academy of Sciences,ChinaWed-P-7-3-5, Time: 13:30–15:30

Many methods have been proposed for modeling head-related trans-fer functions (HRTFs) and yield a good performance level in termsof log-spectral distortion (LSD). However, most of them utilize linearweighting to reconstruct or interpolate HRTFs, but not consider theinherent nonlinearity relationship between the basis function andHRTFs. Motivated by this, a domain knowledge-assisted nonlinear

modeling method is proposed based on bottleneck features. Domainknowledge is used in two aspects. One is to generate the inputfeatures derived from the solution to sound wave propagationequation at the physical level, and the other is to design the lossfunction for model training based on the knowledge of objectiveevaluation criterion, i.e., LSD. Furthermore, with utilizing the strongrepresentation ability of the bottleneck features, the nonlinear modelhas the potential to achieve a more accurate mapping. The objec-tive and subjective experimental results show that the proposedmethod gains less LSD when compared with linear model, and theinterpolated HRTFs can generate a similar perception to those of thedatabase.

Laryngeal Articulation During Trumpet Performance:An Exploratory Study

Luis M.T. Jesus, Bruno Rocha, Andreia Hall;Universidade de Aveiro, PortugalWed-P-7-3-6, Time: 13:30–15:30

Music teacher’s reports suggest that the respiratory function andlaryngeal control in wind instruments, stimulate muscular tensionof the involved anatomical structure. However, the physiology andacoustics of the larynx during trumpet playing has seldom beenstudied. Therefore, the current paper describes the laryngeal artic-ulation during trumpet performance with biomedical signals andauditory perception. The activation of laryngeal musculature of sixprofessional trumpeters when playing a standard musical passagewas analysed using audio, electroglottography (EGG), oxygen satura-tion and heart rate signals. Two University trumpet teachers listenedto the audio recordings, to evaluate the participants’ laryngeal effort(answers on a 100 mm Visual-Analogue-Scale (VAS): 0 “no perceivedeffort”; 100 “extreme effort”). Correlations between parametersextracted from the EGG data and the perception of the audio stimuliby the teachers were explored. Two hundred and fifty laryngealarticulations, where raising of the larynx and muscular effort wereobserved, were annotated and analysed. No correlation betweenthe EGG data and the auditory evaluation was observed. However,both teachers perceived the laryngeal effort (VAS mean scores =61±14). Our findings show that EGG and auditory perception datacan provide new insights into laryngeal articulation and breathingcontrol that are key to low muscular tension.

Matrix of Polynomials Model Based PolynomialDictionary Learning Method for Acoustic ImpulseResponse Modeling

Jian Guan 1, Xuan Wang 1, Pengming Feng 2, Jing Dong 3,Wenwu Wang 4; 1Harbin Institute of Technology, China;2Newcastle University, UK; 3Nanjing Tech University,China; 4University of Surrey, UKWed-P-7-3-7, Time: 13:30–15:30

We study the problem of dictionary learning for signals that canbe represented as polynomials or polynomial matrices, such asconvolutive signals with time delays or acoustic impulse responses.Recently, we developed a method for polynomial dictionary learningbased on the fact that a polynomial matrix can be expressed as apolynomial with matrix coefficients, where the coefficient of thepolynomial at each time lag is a scalar matrix. However, a polynomialmatrix can be also equally represented as a matrix with polynomialelements. In this paper, we develop an alternative method for learn-ing a polynomial dictionary and a sparse representation method forpolynomial signal reconstruction based on this model. The proposedmethods can be used directly to operate on the polynomial matrixwithout having to access its coefficients matrices. We demonstratethe performance of the proposed method for acoustic impulseresponse modeling.

Notes

201

Acoustic Scene Classification Using aCNN-SuperVector System Trained with Auditory andSpectrogram Image Features

Rakib Hyder 1, Shabnam Ghaffarzadegan 2, Zhe Feng 2,John H.L. Hansen 3, Taufiq Hasan 1; 1BUET, Bangladesh;2Robert Bosch, USA; 3University of Texas at Dallas, USAWed-P-7-3-8, Time: 13:30–15:30

Enabling smart devices to infer about the environment using au-dio signals has been one of the several long-standing challengesin machine listening. The availability of public-domain datasets,e.g., Detection and Classification of Acoustic Scenes and Events(DCASE) 2016, enabled researchers to compare various algorithmson standard predefined tasks. Most of the current best performingindividual acoustic scene classification systems utilize different spec-trogram image based features with a Convolutional Neural Network(CNN) architecture. In this study, we first analyze the performanceof a state-of-the-art CNN system for different auditory image andspectrogram features, including Mel-scaled, logarithmically scaled,linearly scaled filterbank spectrograms, and Stabilized AuditoryImage (SAI) features. Next, we benchmark an MFCC based GaussianMixture Model (GMM) SuperVector (SV) system for acoustic sceneclassification. Finally, we utilize the activations from the final layerof the CNN to form a SuperVector (SV) and use them as featurevectors for a Probabilistic Linear Discriminative Analysis (PLDA)classifier. Experimental evaluation on the DCASE 2016 databasedemonstrates the effectiveness of the proposed CNN-SV approachcompared to conventional CNNs with a fully connected softmaxoutput layer. Score fusion of individual systems provides up to7% relative improvement in overall accuracy compared to the CNNbaseline system.

An Environmental Feature Representation for RobustSpeech Recognition and for EnvironmentIdentification

Xue Feng 1, Brigitte Richardson 2, Scott Amman 2, JamesGlass 1; 1MIT, USA; 2Ford, USAWed-P-7-3-9, Time: 13:30–15:30

In this paper we investigate environment feature representations,which we refer to as e-vectors, that can be used for environmentadaption in automatic speech recognition (ASR), and for environmentidentification. Inspired by the fact that i-vectors in the total variabil-ity space capture both speaker and channel environment variability,our proposed e-vectors are extracted from i-vectors. Two extractionmethods are proposed: one is via linear discriminant analysis (LDA)projection, and the other via a bottleneck deep neural network(BN-DNN). Our evaluations show that by augmenting DNN-HMM ASRsystems with the proposed e-vectors for environment adaptation,ASR performance is significantly improved. We also demonstratethat the proposed e-vector yields promising results on environmentidentification.

Attention and Localization Based on a DeepConvolutional Recurrent Model for WeaklySupervised Audio Tagging

Yong Xu, Qiuqiang Kong, Qiang Huang, Wenwu Wang,Mark D. Plumbley; University of Surrey, UKWed-P-7-3-10, Time: 13:30–15:30

Audio tagging aims to perform multi-label classification on au-dio chunks and it is a newly proposed task in the Detection andClassification of Acoustic Scenes and Events 2016 (DCASE 2016)challenge. This task encourages research efforts to better analyzeand understand the content of the huge amounts of audio data on the

web. The difficulty in audio tagging is that it only has a chunk-levellabel without a frame-level label. This paper presents a weaklysupervised method to not only predict the tags but also indicate thetemporal locations of the occurred acoustic events. The attentionscheme is found to be effective in identifying the important frameswhile ignoring the unrelated frames. The proposed framework isa deep convolutional recurrent model with two auxiliary modules:an attention module and a localization module. The proposedalgorithm was evaluated on the Task 4 of DCASE 2016 challenge.State-of-the-art performance was achieved on the evaluation set withequal error rate (EER) reduced from 0.13 to 0.11, compared with theconvolutional recurrent baseline system.

An Audio Based Piano Performance EvaluationMethod Using Deep Neural Network Based AcousticModeling

Jing Pan 1, Ming Li 1, Zhanmei Song 2, Xin Li 2, XiaolinLiu 2, Hua Yi 2, Manman Zhu 2; 1Sun Yat-sen University,China; 2Shandong Yingcai University, ChinaWed-P-7-3-11, Time: 13:30–15:30

In this paper, we propose an annotated piano performance evalu-ation dataset with 185 audio pieces and a method to evaluate theperformance of piano beginners based on their audio recordings.The proposed framework includes three parts: piano key posteriorprobability extraction, Dynamic Time Warping (DTW) based match-ing and performance score regression. First, a deep neural networkmodel is trained to extract 88 dimensional piano key features fromConstant-Q Transform (CQT) spectrum. The proposed acousticmodel shows high robustness to the recording environments. Sec-ond, we employ the DTW algorithm on the high-level piano keyfeature sequences to align the input with the template. Upon thealignment, we extract multiple global matching features that couldreflect the similarity between the input and the template. Finally, weapply linear regression upon these matching features with the scoresannotated by expertise in training data to estimate performancescores for test audio. Experimental results show that our automaticevaluation method achieves 2.64 average absolute score error inscore range from 0 to 100, and 0.73 average correlation coefficienton our in-house collected YCU-MPPE-II dataset.

Music Tempo Estimation Using Sub-Band Synchrony

Shreyan Chowdhury, Tanaya Guha, Rajesh M. Hegde;IIT Kanpur, IndiaWed-P-7-3-12, Time: 13:30–15:30

Tempo estimation aims at estimating the pace of a musical piecemeasured in beats per minute. This paper presents a new tempoestimation method that utilizes coherent energy changes acrossmultiple frequency sub-bands to identify the onsets. A new measure,called the sub-band synchrony, is proposed to detect and quantifythe coherent amplitude changes across multiple sub-bands. Givena musical piece, our method first detects the onsets using thesub-band synchrony measure. The periodicity of the resulting onsetcurve, measured using the autocorrelation function, is used toestimate the tempo value. The performance of the sub-band syn-chrony based tempo estimation method is evaluated on two musicdatabases. Experimental results indicate a reasonable improvementin performance when compared to conventional methods of tempoestimation.

Notes

202

A Transfer Learning Based Feature Extractor forPolyphonic Sound Event Detection UsingConnectionist Temporal Classification

Yun Wang, Florian Metze; Carnegie Mellon University,USAWed-P-7-3-13, Time: 13:30–15:30

Sound event detection is the task of detecting the type, onset time,and offset time of sound events in audio streams. The mainstreamsolution is recurrent neural networks (RNNs), which usually predictthe probability of each sound event at every time step. Connectionisttemporal classification (CTC) has been applied in order to relax theneed for exact annotations of onset and offset times; the CTC outputlayer is expected to generate a peak for each event boundary wherethe acoustic signal is most salient. However, with limited trainingdata, the CTC network has been found to train slowly, and generalizepoorly to new data.

In this paper, we try to introduce knowledge learned from a muchlarger corpus into the CTC network. We train two variants ofSoundNet, a deep convolutional network that takes the audio tracksof videos as input, and tries to approximate the visual informationextracted by an image recognition network. A lower part of SoundNetor its variants is then used as a feature extractor for the CTC networkto perform sound event detection. We show that the new featureextractor greatly accelerates the convergence of the CTC network,and slightly improves the generalization.

A Note Based Query By Humming System UsingConvolutional Neural Network

Naziba Mostafa, Pascale Fung; HKUST, ChinaWed-P-7-3-14, Time: 13:30–15:30

In this paper, we propose a note-based query by humming (QBH)system with Hidden Markov Model (HMM) and Convolutional NeuralNetwork (CNN) since note-based systems are much more efficientthan the traditional frame-based systems. A note-based QBH systemhas two main components: humming transcription and candidatemelody retrieval.

For humming transcription, we are the first to use a hybrid modelusing HMM and CNN. We use CNN for its ability to learn the featuresdirectly from raw audio data and for being able to model the localityand variability often present in a note and we use HMM for handlingthe variability across the time-axis.

For candidate melody retrieval, we use locality sensitive hashing tonarrow down the candidates for retrieval and dynamic time warpingand earth mover’s distance for the final ranking of the selectedcandidates.

We show that our HMM-CNN humming transcription system outper-forms other state of the art humming transcription systems by ∼2%using the transcription evaluation framework by Molina et. al andour overall query by humming system has a Mean Reciprocal Rankof 0.92 using the standard MIREX dataset, which is higher than otherstate of the art note-based query by humming systems.

Unsupervised Filterbank Learning UsingConvolutional Restricted Boltzmann Machine forEnvironmental Sound Classification

Hardik B. Sailor, Dharmesh M. Agrawal, Hemant A.Patil; DA-IICT, IndiaWed-P-7-3-15, Time: 13:30–15:30

In this paper, we propose to use Convolutional Restricted BoltzmannMachine (ConvRBM) to learn filterbank from the raw audio signals.ConvRBM is a generative model trained in an unsupervised way to

model the audio signals of arbitrary lengths. ConvRBM is trained us-ing annealed dropout technique and parameters are optimized usingAdam optimization. The subband filters of ConvRBM learned fromthe ESC-50 database resemble Fourier basis in the mid-frequencyrange while some of the low-frequency subband filters resembleGammatone basis. The auditory-like filterbank scale is nonlinearw.r.t. the center frequencies of the subband filters and follows thestandard auditory scales. We have used our proposed model asa front-end for the Environmental Sound Classification (ESC) taskwith supervised Convolutional Neural Network (CNN) as a back-end.Using CNN classifier, the ConvRBM filterbank (ConvRBM-BANK) andits score-level fusion with the Mel filterbank energies (FBEs) gave anabsolute improvement of 10.65%, and 18.70% in the classificationaccuracy, respectively, over FBEs alone on the ESC-50 database. Thisshows that the proposed ConvRBM filterbank also contains highlycomplementary information over the Mel filterbank, which is helpfulin the ESC task.

Novel Shifted Real Spectrum for Exact SignalReconstruction

Meet H. Soni, Rishabh Tak, Hemant A. Patil; DA-IICT,IndiaWed-P-7-3-16, Time: 13:30–15:30

Retrieval of the phase of a signal is one of the major problems in sig-nal processing. For an exact signal reconstruction, both magnitude,and phase spectrum of the signal is required. In many speech-basedapplications, only the magnitude spectrum is processed and thephase is ignored, which leads to degradation in the performance.Here, we propose a novel technique that enables the reconstructionof the speech signal from magnitude spectrum only. We consider theeven-odd part decomposition of a causal sequence and process onlyon the real part of the DTFT of the signal. We propose the shiftingof the real part of DTFT of the sequence to make it non-negative. Byadding a constant of sufficient value to the real part of the DTFT,the exact signal reconstruction is possible from the magnitude orpower spectrum alone. Moreover, we have compared our proposedapproach with recently proposed phase retrieval method frommagnitude spectrum of the Causal Delta Dominant (CDD) signal.We found that the method of phase retrieval from CDD signaland proposed method are identical under certain approximation.However, proposed method involves the less computational cost forthe exact processing of the signal.

Wed-P-7-4 : Disorders Related to Speech andLanguagePoster 4, 13:30–15:30, Wednesday, 23 Aug. 2017Chair: Jan Rusz

Manual and Automatic Transcriptions in DementiaDetection from Speech

Jochen Weiner, Mathis Engelbart, Tanja Schultz;Universität Bremen, GermanyWed-P-7-4-1, Time: 13:30–15:30

As the population in developed countries is aging, larger numbers ofpeople are at risk of developing dementia. In the near future therewill be a need for time- and cost-efficient screening methods. Speechcan be recorded and analyzed in this manner, and as speech andlanguage are affected early on in the course of dementia, automaticspeech processing can provide valuable support for such screeningmethods.

We present two pipelines of feature extraction for dementia detec-tion: the manual pipeline uses manual transcriptions while the fully

Notes

203

automatic pipeline uses transcriptions created by automatic speechrecognition (ASR). The acoustic and linguistic features that we extractneed no language specific tools other than the ASR system. Usingthese two different feature extraction pipelines we automatically de-tect dementia. Our results show that the ASR system’s transcriptionquality is a good single feature and that the features extracted fromautomatic transcriptions perform similar or slightly better than thefeatures extracted from the manual transcriptions.

An Affect Prediction Approach Through DepressionSeverity Parameter Incorporation in Neural Networks

Rahul Gupta 1, Saurabh Sahu 2, Carol Espy-Wilson 2,Shrikanth S. Narayanan 3; 1Amazon.com, USA;2University of Maryland, USA; 3University of SouthernCalifornia, USAWed-P-7-4-2, Time: 13:30–15:30

Humans use emotional expressions to communicate their internalaffective states. These behavioral expressions are often multi-modal(e.g. facial expression, voice and gestures) and researchers haveproposed several schemes to predict the latent affective states basedon these expressions. The relationship between the latent affectivestates and their expression is hypothesized to be affected by severalfactors; depression disorder being one of them. Despite a wideinterest in affect prediction, and several studies linking the effectof depression on affective expressions, only a limited number ofaffect prediction models account for the depression severity. Inthis work, we present a novel scheme that incorporates depressionseverity as a parameter in Deep Neural Networks (DNNs). In or-der to predict affective dimensions for an individual at hand, ourscheme alters the DNN activation function based on the subject’sdepression severity. We perform experiments on affect predictionin two different sessions of the Audio-Visual Depressive languageCorpus, which involves patients with varying degree of depression.Our results show improvements in arousal and valence predictionon both the sessions using the proposed DNN modeling. We alsopresent analysis of the impact of such an alteration in DNNs duringtraining and testing.

Cross-Database Models for the Classification ofDysarthria Presence

Stephanie Gillespie 1, Yash-Yee Logan 1, Elliot Moore 1,Jacqueline Laures-Gore 2, Scott Russell 3, Rupal Patel 4;1Georgia Institute of Technology, USA; 2Georgia StateUniversity, USA; 3Grady Memorial Hospital, USA;4Northeastern University, USAWed-P-7-4-3, Time: 13:30–15:30

Dysarthria is a motor speech disorder that impacts verbal articulationand co-ordination, resulting in slow, slurred and imprecise speech.Automated classification of dysarthria subtypes and severities couldprovide a useful clinical tool in assessing the onset and progress intreatment. This study represents a pilot project to train models todetect the presence of dysarthria in continuous speech. Subsets ofthe Universal Access Research Dataset (UA-Speech) and the AtlantaMotor Speech Disorders Corpus (AMSDC) database were utilized ina cross-database training strategy (training on UA-Speech / testingon AMSDC) to distinguish speech with and without dysarthria. Inaddition to traditional spectral and prosodic features, the currentstudy also includes features based on the Teager Energy Operator(TEO) and the glottal waveform. Baseline results on the UA-Speechdataset maximize word- and participant-level accuracies at 75.3%and 92.9% using prosodic features. However, the cross-training ofUA-Speech tested on the AMSDC maximize word- and participant-level accuracies at 71.3% and 90% based on a TEO feature. The results

of this pilot study reinforce consideration of dysarthria subtypes incross-dataset training as well as highlight additional features thatmay be sensitive to the presence of dysarthria in continuous speech.

Acoustic Evaluation of Nasality in CerebellarSyndromes

M. Novotný 1, Jan Rusz 1, K. Spálenka 1, Jirí Klempír 2, D.Horáková 2, Evžen Ružicka 2; 1CTU, Czech Republic;2Charles University, Czech RepublicWed-P-7-4-4, Time: 13:30–15:30

Although previous studies have reported the occurrence of velopha-ryngeal incompetence connected with ataxic dysarthria, there is a lackof evidence related to nasality assessment in cerebellar disorders.This is partly due to the limited reliability of challenging analyses andpartly due to nasality being a less pronounced manifestation of ataxicdysarthria. Therefore, we employed 1/3-octave spectra analysis asan objective measurement of nasality disturbances. We analyzed20 subjects with multiple system atrophy (MSA), 13 subjects withcerebellar ataxia (CA), 20 subjects with multiple sclerosis (MS) and 20healthy (HC) speakers. Although we did not detect the presence ofhypernasality, our results showed increased nasality fluctuation in65% of MSA, 43% of CA and 30% of MS subjects compared to 15% ofHC speakers, suggesting inconsistent velopharyngeal motor control.Furthermore, we found a statistically significant difference betweenMSA and HC participants (p<0.001), and significant correlationbetween the natural history cerebellar subscore and neuroprotectionin Parkinson plus syndromes — Parkinson plus scale and nasalityfluctuations in MSA (r=0.51, p<0.05). In conclusion, acoustic analysisshowed an increased presence of abnormal nasality fluctuations inall ataxic groups and revealed that nasality fluctuation is associatedwith distortion of cerebellar functions.

Emotional Speech of Mentally and Physically DisabledIndividuals: Introducing the EmotAsS Database andFirst Findings

Simone Hantke, Hesam Sagha, Nicholas Cummins, BjörnSchuller; Universität Passau, GermanyWed-P-7-4-5, Time: 13:30–15:30

The automatic recognition of emotion from speech is a matureresearch field with a large number of publicly available corpora.However, to the best of the authors knowledge, none of thesedatasets consist solely of emotional speech samples from individu-als with mental, neurological and/or physical disabilities. Yet, suchindividuals could benefit from speech-based assistive technologies toenhance their communication with their environment and to managetheir daily work process. With the aim of advancing these technolo-gies, we fill this void in emotional speech resources by introducingthe EmotAsS (Emotional Sensitivity Assistance System for Peoplewith Disabilities) corpus consisting of spontaneous emotional Ger-man speech data recorded from 17 mentally, neurologically and/orphysically disabled participants in their daily work environment,resulting in just under 11 hours of total speech time and featuringapproximately 12.7 k utterances after segmentation. Transcriptionwas performed and labelling was carried out in seven emotionalcategories, as well as for the intelligibility of the speaker. We presenta set of baseline results, based on using standard acoustic andlinguistic features, for arousal and valence emotion recognition.

Notes

204

Phonological Markers of Oxytocin and MDMAIngestion

Carla Agurto 1, Raquel Norel 1, Rachel Ostrand 1,Gillinder Bedi 2, Harriet de Wit 2, Matthew J. Baggott 2,Matthew G. Kirkpatrick 3, Margaret Wardle 4,Guillermo A. Cecchi 1; 1IBM, USA; 2University of Chicago,USA; 3University of Southern California, USA;4UTHealth, USAWed-P-7-4-6, Time: 13:30–15:30

Speech data has the potential to become a powerful tool to providequantitative information about emotion beyond that achieved by sub-jective assessments. Based on this concept, we investigate the use ofspeech to identify effects in subjects under the influence of two differ-ent drugs: Oxytocin (OT) and 3,4-methylenedioxymethamphetamine(MDMA), also known as ecstasy. We extract a set of informativephonological features that can characterize emotion. Then, weperform classification to detect if the subject is under the influenceof a drug. Our best results show low error rates of 13% and 17% forthe subject classification of OT and MDMA vs. placebo, respectively.We also analyze the performance of the features to differentiatethe two levels of MDMA doses, obtaining an error rate of 19%. Theresults indicate that subtle emotional changes can be detected in thecontext of drug use.

An Avatar-Based System for Identifying IndividualsLikely to Develop Dementia

Bahman Mirheidari 1, Daniel Blackburn 1, KirstyHarkness 2, Traci Walker 1, Annalena Venneri 1, MarkusReuber 2, Heidi Christensen 1; 1University of Sheffield,UK; 2Royal Hallamshire Hospital, UKWed-P-7-4-7, Time: 13:30–15:30

This paper presents work on developing an automatic dementiascreening test based on patients’ ability to interact and communicate— a highly cognitively demanding process where early signs ofdementia can often be detected. Such a test would help generalpractitioners, with no specialist knowledge, make better diagnosticdecisions as current tests lack specificity and sensitivity. We inves-tigate the feasibility of basing the test on conversations between a‘talking head’ (avatar) and a patient and we present a system foranalysing such conversations for signs of dementia in the patient’sspeech and language. Previously we proposed a semi-automaticsystem that transcribed conversations between patients and neurol-ogists and extracted conversation analysis style features in order todifferentiate between patients with progressive neurodegenerativedementia (ND) and functional memory disorders (FMD). Determiningwho talks when in the conversations was performed manually. Inthis study, we investigate a fully automatic system including speakerdiarisation, and the use of additional acoustic and lexical features.Initial results from a pilot study are presented which shows that theavatar conversations can successfully classify ND/FMD with around91% accuracy, which is in line with previous results for conversationsthat were led by a neurologist.

Cross-Domain Classification of Drowsiness in Speech:The Case of Alcohol Intoxication and SleepDeprivation

Yue Zhang 1, Felix Weninger 2, Björn Schuller 1; 1ImperialCollege London, UK; 2Nuance Communications,GermanyWed-P-7-4-8, Time: 13:30–15:30

In this work, we study the drowsy state of a speaker, induced by alco-

hol intoxication or sleep deprivation. In particular, we investigate thecoherence between the two pivotal causes of drowsiness, as featuredin the Intoxication and Sleepiness tasks of the INTERSPEECH SpeakerState Challenge. In this way, we aim to exploit the interrelationsbetween these different, yet highly correlated speaker states, whichneed to be reliably recognised in safety and security critical envi-ronments. To this end, we perform cross-domain classification ofalcohol intoxication and sleepiness, thus leveraging the acoustic sim-ilarities of these speech phenomena for transfer learning. Further,we conducted in-depth feature analysis to quantitatively assess thetask relatedness and to determine the most relevant features for bothtasks. To test our methods in realistic contexts, we use the AlcoholLanguage Corpus and the Sleepy Language Corpus containing intotal 60 hours of genuine intoxicated and sleepy speech. In theresult, cross-domain classification combined with feature selectionyields up to 60.3% unweighted average recall, which is significantlyabove-chance (50%) and highly notable given the mismatch in thetraining and validation data. Finally, we show that an effective,general drowsiness classifier can be obtained by aggregating thetraining data from both domains.

Depression Detection Using Automatic Transcriptionsof De-Identified Speech

Paula Lopez-Otero 1, Laura Docio-Fernandez 1, AlbertoAbad 2, Carmen Garcia-Mateo 1; 1Universidade de Vigo,Spain; 2INESC-ID Lisboa, PortugalWed-P-7-4-9, Time: 13:30–15:30

Depression is a mood disorder that is usually addressed by outpa-tient treatments in order to favour patient’s inclusion in society.This leads to a need for novel automatic tools exploiting speechprocessing approaches that can help to monitor the emotional stateof patients via telephone or the Internet. However, the transmission,processing and subsequent storage of such sensitive data raisesseveral privacy concerns. Speech de-identification can be used toprotect the patients’ identity. Nevertheless, these techniques modifythe speech signal, eventually affecting the performance of depres-sion detection approaches based on either speech characteristicsor automatic transcriptions. This paper presents a study on theinfluence of speech de-identification when using transcription-basedapproaches for depression detection. To this effect, a system basedon the global vectors method for natural language processing is pro-posed. In contrast to previous works, two main sources of nuisancehave been considered: the de-identification process itself and thetranscription errors introduced by the automatic recognition of thepatients’ speech. Experimental validation on the DAIC-WOZ corpusreveals very promising results, obtaining only a slight performancedegradation with respect to the use of manual transcriptions.

An N-Gram Based Approach to the AutomaticDiagnosis of Alzheimer’s Disease from SpokenLanguage

Sebastian Wankerl, Elmar Nöth, Stefan Evert; FAUErlangen-Nürnberg, GermanyWed-P-7-4-10, Time: 13:30–15:30

Alzheimer’s disease (AD) is the most common cause of dementiaand affects wide parts of the elderly population. Since there existsno cure for this illness, it is of particular interest to develop reliableand easy-to-use diagnostic methods to alleviate its effects. Speechcan be a useful indicator to reach this goal. We propose a purelystatistical approach towards the automatic diagnosis of AD whichis solely based on n-gram models with subsequent evaluation of theperplexity and does not incorporate any further linguistic features.Hence, it works independently of a concrete language. We evaluate

Notes

205

our approach on the DementiaBank which contains spontaneousspeech of test subjects describing a picture. Using the Equal-Error-Rate as classification threshold, we achieve an accuracy of 77.1%. Inaddition to that, we studied the correlation between the calculatedperplexities and the Mini-Mental State Examination (MMSE) scoresof the test subjects. While there is little correlation for the healthycontrol group, a higher correlation could be found when consideringthe demented speakers. This makes it reasonable to conclude thatour approach reveals some of the cognitive limitations of AD patientsand can help to better diagnose the disease based on speech.

Exploiting Intra-Annotator Rating ConsistencyThrough Copeland’s Method for Estimation of GroundTruth Labels in Couples’ Therapy

Karel Mundnich, Md. Nasir, Panayiotis Georgiou,Shrikanth S. Narayanan; University of SouthernCalifornia, USAWed-P-7-4-11, Time: 13:30–15:30

Behavioral and mental health research and its clinical applicationswidely rely on quantifying human behavioral expressions. This oftenrequires human-derived behavioral annotations, which tend to benoisy, especially when the psychological objects of interest are latentand subjective in nature. This paper focuses on exploiting multiplehuman annotations toward improving reliability of the ensembledecision, by creating a ranking of the evaluated objects. To createthis ranking, we employ an adapted version of Copeland’s count-ing method, which results in robust inter-annotator rankings andagreement. We use a simple mapping between the ranked objectsand the scale of evaluation, which preserves the original distributionof ratings, based on maximum likelihood estimation. We apply thealgorithm to ratings that lack a ground truth. Therefore, we assessour algorithm in two ways: (1) by corrupting the annotations withdifferent distributions of noise, and computing the inter-annotatoragreement between the ensemble estimates derived from the originaland corrupted data using Krippendorff’s α; and (2) by replacing oneannotator at a time with the ensemble estimate. Our results suggestthat the proposed method provides a robust alternative that suffersless from individual annotator preferences/biases and scale misuse.

Rhythmic Characteristics of Parkinsonian Speech: AStudy on Mandarin and Polish

Massimo Pettorino 1, Wentao Gu 2, Paweł Półrola 3, PingFan 2; 1Università di Napoli “L’Orientale”, Italy; 2NanjingNormal University, China; 3UJK, PolandWed-P-7-4-12, Time: 13:30–15:30

Previous studies on Italian speech showed that the percentage ofvocalic portion in the utterance (%V) and the duration of the intervalbetween two consecutive vowel onset points (VtoV) were larger forparkinsonian (PD) than for healthy controls (HC). Especially, thevalues of %V were distinctly separated between PD and HC. Thepresent study aimed to further test the finding on Mandarin andPolish. Twenty-five Mandarin speakers (13 PD and 12 HC matchedon age) and thirty-one Polish speakers (18 PD and 13 HC matchedon age) read aloud a passage of story. The recorded speeches weresegmented into vocalic and consonantal intervals, and then %V andVtoV were calculated. For both languages, VtoV overlapped betweenHC and PD. For Polish, %V was distinctly higher in PD than in HC,while for Mandarin there was no significant difference. It suggeststhat %V could be used for automatic diagnosis of PD for Italian andPolish, but not for Mandarin. The effectiveness of the rhythmicmetric appears to be language-dependent, varying with the rhythmictypology of the language.

Wed-P-8-1 : ProsodyPoster 1, 16:00–18:00, Wednesday, 23 Aug. 2017Chair: Stefanie Jannedy

Trisyllabic Tone 3 Sandhi Patterns in MandarinProduced by Cantonese Speakers

Jung-Yueh Tu 1, Janice Wing-Sze Wong 2, Jih-Ho Cha 3;1Shanghai Jiao Tong University, China; 2Hong KongBaptist University, China; 3National Tsing HuaUniversity, TaiwanWed-P-8-1-1, Time: 16:00–18:00

The third tone sandhi in Mandarin is a well-studied rule, where aTone 3 followed by another Tone 3 is changed as a rising tone, similarto Tone 2. This Tone 3 sandhi rule is straightforward in disyllabicwords, which is phonetically driven for the ease of production. Inthree or more than three syllables with Tone 3, however, the Tone3 sandhi application is more complicated and involves both theprosodic and morph-syntactic domains, which makes it difficult forL2 learners. This study aims to understand how L2 learners withanother tone language experience could master the Mandarin Tone3 sandhi rule. Specifically, the study investigates the production ofTone 3 sandhi in trisyllabic Mandarin words by Cantonese speakers.In the current study, 30 Cantonese speakers were requested to pro-duce 15 trisyllabic words (“1+[2+3]” and “[1+2]+3” sandhi patterns)and 5 hexasyllabic sentences with Tone 3 in sequences. The analysesof results center on three major types of error patterns: overgen-eralization, under application, and combination. The findings arediscussed with regard to the phono-syntactic interactions of Tone 3sandhi at the lexical and phrasal levels as well as the influence of theCantonese tonal system.

Intonation of Contrastive Topic in Estonian

Heete Sahkai, Meelis Mihkla; Institute of the EstonianLanguage, EstoniaWed-P-8-1-2, Time: 16:00–18:00

Contrastive topic is an information structural category that is usuallyassociated with a specific intonation, which tends to be similaracross languages (a rising pitch accent). The aim of the present studyis to examine whether this also true of Estonian. Three potentialprosodic correlates of contrastive topics are examined: markingwith a particular pitch accent type, an emphatic realization of thepitch accent, and a following prosodic boundary. With respect topitch accent types, it is found that only two subjects out of eightdistinguish sentences with a contrastive topic from other types ofinformation structure; the contour bears resemblance to contrastivetopic intonation in other languages (consisting of an H* accent onthe contrastive topic and an HL* accent on the focus), but is notrestricted to sentences with contrastive topics. A more consistentcorrelate turns out to be an emphatic realization of the pitch accentcarried by the contrastive topic constituent. No evidence is foundof a tendency to produce contrastive topics as separate prosodicphrases.

Reanalyze Fundamental Frequency Peak Delay inMandarin

Lixia Hao, Wei Zhang, Yanlu Xie, Jinsong Zhang; BLCU,ChinaWed-P-8-1-3, Time: 16:00–18:00

In Mandarin, Fundamental Frequency (F0) peak delay has beenreported to occur frequently in the rising (R) tone or high (H)

Notes

206

tone succeeding by a low (L) tone. Its occurrence was ascribed toarticulatory constraints within a conflicting tonal context: a highoffset target followed by a low onset target. To further examinethe underlying mechanism of the phenomenon, the current studytests the possibility that valley delay, as opposed to peak delay, mayoccur in an L+H tonal context; and peak or valley delay may alsooccur within a compatible tonal context where adjacent tonal valuesare identical or similar. An experiment was done on AnnotatedSpeech Corpus of Chinese Discourse to investigate the frequencyof occurrence and amount of peak and valley delay. The resultsindicated that: F0 peak and valley delay frequently occurred in bothconflicting and compatible tonal contexts; the phenomenon wasfound extensively in R tone and F (falling) tone, but barely in H toneand L tone. The findings suggest that while peak or valley delay ispartially due to articulatory constraints in certain tonal contexts,the speakers’ active effort-distribution strategy based on economicalprinciple is also behind the phenomenon.

How Does the Absence of Shared Knowledge BetweenInterlocutors Affect the Production of FrenchProsodic Forms?

Amandine Michelas, Cecile Cau, MaudChampagne-Lavau; LPL (UMR 7309), FranceWed-P-8-1-4, Time: 16:00–18:00

We examine the hypothesis that modelling the addressee in spokeninteraction affects the production of prosodic forms by the speaker.This question was tested in an interactive paradigm that enabledus to measure prosodic variations at two levels: the global/acousticlevel and the phonological one. We used a semi-spontaneous task inwhich French speakers gave instructions to addressees about whereto place a cross between different objects (e.g., Tu mets la croix entrela souris bordeau et la maison bordeau; ‘You put the cross betweenthe red mouse and the red house’). Each trial was composed of twonoun-adjective fragments and the target was the second fragment.We manipulated (i) whether the two interlocutors shared or didn’tshare the same objects and (ii) the informational status of targetsto obtain variations in abstract prosodic phrasing. We found thatthe absence of shared knowledge between interlocutors affectedthe speaker’s production of prosodic forms at the global/acousticlevel (i.e., pitch range and speech rate) but not at the phonologicalone (i.e., prosodic phrasing). These results are consistent with amechanism in which global prosodic variations are influenced byaudience design because they reflect the way that speakers helpaddressees to understand speech.

Three Dimensions of Sentence Prosody and Their(Non-)Interactions

Michael Wagner, Michael McAuliffe; McGill University,CanadaWed-P-8-1-5, Time: 16:00–18:00

Prosody simultaneously encodes different kinds of information, in-cluding the type of speech act of an utterance (e.g., falling declarativevs. rising interrogative intonational tunes), the location of semanticfocus (via prosodic prominence), and syntactic constituent structure(via prosodic phrasing). The syntactic/ semantic functional dimen-sions (speech act, focus, constituency) are orthogonal to each other,but to which extent their prosodic correlates (tune, prominence,phrasing) are remains controversial. This paper takes a ‘bottom up’approach to test for interactions, and reports evidence that contraryto many current theories of sentence intonation, the cues to the threedimensions are often orthogonal where interactions are predicted.

Using Prosody to Classify Discourse Relations

Janine Kleinhans 1, Mireia Farrús 1, Agustín Gravano 2,Juan Manuel Pérez 2, Catherine Lai 3, Leo Wanner 1;1Universitat Pompeu Fabra, Spain; 2Universidad deBuenos Aires, Argentina; 3University of Edinburgh, UKWed-P-8-1-6, Time: 16:00–18:00

This work aims to explore the correlation between the discoursestructure of a spoken monologue and its prosody by predictingdiscourse relations from different prosodic attributes. For thispurpose, a corpus of semi-spontaneous monologues in English hasbeen automatically annotated according to the Rhetorical StructureTheory, which models coherence in text via rhetorical relations.From corresponding audio files, prosodic features such as pitch,intensity, and speech rate have been extracted from different con-texts of a relation. Supervised classification tasks using SupportVector Machines have been performed to find relationships betweenprosodic features and rhetorical relations. Preliminary results showthat intensity combined with other features extracted from intra-and intersegmental environments is the feature with the highestpredictability for a discourse relation. The prediction of rhetoricalrelations from prosodic features and their combinations is straight-forwardly applicable to several tasks such as speech understandingor generation. Moreover, the knowledge of how rhetorical relationsshould be marked in terms of prosody will serve as a basis to improvespeech synthesis applications and make voices sound more naturaland expressive.

Canonical Correlation Analysis and Prediction ofPerceived Rhythmic Prominences and Pitch Tones inSpeech

Elizabeth Godoy, James R. Williamson, Thomas F.Quatieri; MIT Lincoln Laboratory, USAWed-P-8-1-7, Time: 16:00–18:00

Speech prosody encodes information about language and commu-nicative intent as well as speaker identity and state. Consequently,a host of speech technologies could benefit from increased under-standing of prosodic phenomena and corresponding acoustics. Arecently developed comprehensive prosodic transcription systemcalled RaP (Rhythm-and-Pitch) annotates both perceived rhythmicprominences and pitch tones in speech. Using RaP-annotated speechcorpora, the present work analyzes relationships between perceivedprosodic events and acoustic features including syllable durationand novel measures of intensity and fundamental frequency. Canon-ical Correlation Analysis (CCA) reveals two dominant prosodicdimensions relating the acoustic features and RaP annotations. Thefirst captures perceived prosodic emphasis of syllables indicated bystrong metrical beats and significant pitch variability (i.e. presenceof either high or low pitch tones). Acoustically, this dimension is de-scribed most by syllable duration followed by the mean intensity andfundamental frequency measures. The second CCA dimension thenprimarily discriminates pitch tone level (high versus low), indicatedmainly by the mean fundamental frequency measure. Finally, withina leave-one-out cross-validation framework, RaP prosodic events arewell-predicted from acoustic features (AUC between 0.78 and 0.84).Future work will exploit automated RaP labelling in contexts rangingfrom language learning to neurological disorder recognition.

Evaluation of Spectral Tilt Measures for SentenceProminence Under Different Noise Conditions

Sofoklis Kakouros, Okko Räsänen, Paavo Alku; AaltoUniversity, FinlandWed-P-8-1-8, Time: 16:00–18:00

Spectral tilt has been suggested to be a correlate of prominence in

Notes

207

speech, although several studies have not replicated this empirically.This may be partially due to the lack of a standard method for tiltestimation from speech, rendering interpretations and comparisonsbetween studies difficult. In addition, little is known about the per-formance of tilt estimators for prominence detection in the presenceof noise. In this work, we investigate and compare several standardtilt measures on quantifying prominence in spoken Dutch and underdifferent levels of additive noise. We also compare these measureswith other acoustic correlates of prominence, namely, energy, F0,and duration. Our results provide further empirical support for thefinding that tilt is a systematic correlate of prominence, at least inDutch, even though energy, F0, and duration appear still to be morerobust features for the task. In addition, our results show that thereare notable differences between different tilt estimators in theirability to discriminate prominent words from non-prominent onesin different levels of noise.

Creaky Voice as a Function of Tonal Categories andProsodic Boundaries

Jianjing Kuang; University of Pennsylvania, USAWed-P-8-1-9, Time: 16:00–18:00

This study looks into the distribution of creaky voice in Mandarin incontinuous speech. A creaky voice detector was used to automati-cally detect the appearance of creaky voice in a large-scale Mandarincorpus (Sinica COSPRO corpus). As the prosodic information hasbeen annotated in the corpus, we were able to look at the distributionof creaky voice as a function of the interaction between tone andprosodic structures. As expected, among the five tonal categories(four lexical tones and one neutral tone), creaky voice is most likely tooccur with Tone 3 and the neutral tone, followed by Tone 2 and Tone4. Prosodic boundaries also play important roles, as the likelihood ofcreak increases when the prosodic boundaries are larger, regardlessof the tonal categories. It is also confirmed that the pitch range forthe occurrence of creaky voice is 110 Hz for male speakers and 170Hz for female speakers, consistent with previous small-scale studies.Finally, male speakers have a higher overall rate of creaky voice thanfemale speakers. Altogether, this study validates the hypothesesfrom previous studies, and provides a better understanding ofvoice-source variation in different prosodic conditions.

The Acoustics of Word Stress in Czech as a Functionof Speaking Style

Radek Skarnitzl 1, Anders Eriksson 2; 1Charles University,Czech Republic; 2Stockholm University, SwedenWed-P-8-1-10, Time: 16:00–18:00

The study is part of a series of studies which examine the acousticcorrelates of lexical stress in several typologically different lan-guages, in three speech styles: spontaneous speech, phrase reading,and wordlist reading. This study focuses on Czech, a languagewith stress fixed on the first syllable of a prosodic word, with nocontrastive function at the level of individual words. The acousticparameters examined here are F0-level, F0-variation, Duration, SoundPressure Level, and Spectral Emphasis. Values for over 6,000 vowelswere analyzed.

Unlike the other languages examined so far, lexical stress in Czech isnot manifested by clear prominence markings on the first, stressedsyllable: the stressed syllable is neither higher, realized with greaterF0 variation, longer; nor does it have a higher SPL or higher SpectralEmphasis. There are slight, but insignificant tendencies pointingto a delayed rise, that is, to higher values of some of the acousticparameters on the second, post-stressed syllable. Since lexicalstress does not serve a contrastive function in Czech, the absence ofacoustic marking on the stressed syllable is not surprising.

What You See is What You Get Prosodically Less —Visibility Shapes Prosodic Prominence Production inSpontaneous Interaction

Petra Wagner, Nataliya Bryhadyr; Universität Bielefeld,GermanyWed-P-8-1-11, Time: 16:00–18:00

We investigated the expression of prosodic prominence related tounpredictability and relevance in spontaneous dyadic interactionsin which interlocutors could or could not see each other. Interac-tions between visibility and prominence were analyzed in a verbalversion of the game TicTacToe. This setting allows for disentanglingdifferent types of information structure: early moves tend to beunpredictable, but are typically irrelevant for the immediate outcomeof the game, while late moves tend to be predictable but relevant,as they usually prevent an opponent’s winning move or constitute awinning move by themselves.

Our analyses on German reveal that prominence expression isaffected globally by visibility conditions: speech becomes overallsofter and faster when interlocutors can see each other. However,speakers differentiate unpredictability and relevance-related accentsrather consistently using intensity cues both under visibility andinvisibility conditions. We also find that pitch excursions related toprosodic information structure are not affected by visibility. Ourfindings support effort-optimization models of speech production,but also models that regard speech production as an integratedbimodal process with a high degree of congruency across domains.

Focus Acoustics in Mandarin Nominals

Yu-Yin Hsu, Anqi Xu; Hong Kong Polytechnic University,ChinaWed-P-8-1-12, Time: 16:00–18:00

In addition to deciding what to say, interlocutors have to decidehow to say it. One of the important tasks of linguists is then tomodel how differences in acoustic patterns influence the interpre-tation of a sentence. In light of previous studies on how prosodicstructure convey discourse-level of information in a sentence, thisstudy makes use of a speech production experiment to investigatehow expressions related to different information packaging, suchas information focus, corrective focus, and old information, areprosodically realized within a complex nominal. Special attentionwas paid to the sequence of “numeral-classifier-noun” in Mandarin,which consists of closely related sub-syntactic units internally,and provides a phonetically controlled environment comparable toprevious phonetic studies on focus prominence at the sententiallevel. The result shows that a multi-dimensional strategy is usedin focus-marking, and that focus prosody is sensitive to the size offocus domain and is observable in various lexical tonal environmentsin Mandarin.

Exploring Multidimensionality: Acoustic andArticulatory Correlates of Swedish Word Accents

Malin Svensson Lundmark, Gilbert Ambrazaitis, OttoEwald; Lund University, SwedenWed-P-8-1-13, Time: 16:00–18:00

This study investigates acoustic and articulatory correlates of SouthSwedish word accents (Accent 1 vs. 2) — a tonal distinction tra-ditionally associated with F0 timing. The study is motivated byprevious findings on (i) the acoustic complexity of tonal prosody and(ii) tonal-articulatory interplay in other languages.

Acoustic and articulatory (EMA) data from two controlled experi-ments are reported (14 speakers in total; pilot EMA recordings with 2speakers). Apart from the well-established F0 timing pattern, results

Notes

208

of Experiment 1 reveal a longer duration of a post-stress consonantin Accent 2 than in Accent 1, a higher degree of creaky voice in Accent1, as well as a deviant (two-peak) pitch pattern in Accent 2 for oneof eight discourse conditions used in the experiment. Experiment2 reveals an effect of word accent on vowel articulation, as thetongue body gesture target is reached earlier in Accent 2. It alsosuggests slight but (marginally) significant word-accent effects onword-initial gestural coordination, taking slightly different forms inthe two speakers, as well as corresponding differences in word-initialformant patterns. Results are discussed concerning their potentialperceptual relevance, as well as with reference to the c-center effectdiscussed within Articulatory Phonology.

The Perception of English Intonation Patterns byGerman L2 Speakers of English

Karin Puga 1, Robert Fuchs 2, Jane Setter 3, Peggy Mok 4;1JLU Gießen, Germany; 2Hong Kong Baptist University,China; 3University of Reading, UK; 4Chinese Universityof Hong Kong, ChinaWed-P-8-1-14, Time: 16:00–18:00

Previous research suggests that intonation is a particularly challeng-ing aspect of L2 speech learning. While most research focuses onspeech production, we widen the focus and study the perceptionof intonation by L2 learners. We investigate whether advancedGerman learners of English have knowledge of the appropriateEnglish intonation patterns in a narrative context with differentsentence types (e.g. statements, questions). The results of a tonalpattern selection task indicate that learners (n=20) performed sim-ilar to British English controls (n=25) for some sentence types (e.g.statements, yes/no-questions), but performed significantly worsethan the control group in the case of open and closed tag questionsand the expression of sarcasm. The results can be explained by thefact that tag questions are the only sentence type investigated thatdoes not exist in the learners’ L1, and sarcasm is not representedsyntactically. This suggests that L1 influence can partly account forwhy some intonation patterns are more challenging than others,and that contextualized knowledge of the intonation patterns ofthe target language rather than knowledge of intonation patterns inisolation is crucial for the successful L2 learning of intonation.

Wed-P-8-2 : Speaker States and TraitsPoster 2, 16:00–18:00, Wednesday, 23 Aug. 2017Chair: Emily Provost

The Perception of Emotions in Noisified NonsenseSpeech

Emilia Parada-Cabaleiro, Alice Baird, Anton Batliner,Nicholas Cummins, Simone Hantke, Björn Schuller;Universität Passau, GermanyWed-P-8-2-1, Time: 16:00–18:00

Noise pollution is part of our daily life, affecting millions of people,particularly those living in urban environments. Noise alters our per-ception and decreases our ability to understand others. Consideringthis, speech perception in background noise has been extensivelystudied, showing that especially white noise can damage listenerperception. However, the perception of emotions in noisified speechhas not been explored with as much depth. In the present study,we use artificial background noise conditions, by applying noiseto a subset of the GEMEP corpus (emotions expressed in nonsensespeech). Noises were at varying intensities and ‘colours’; white, pink,and brownian. The categorical and dimensional perceptual test wascompleted by 26 listeners. The results indicate that background

noise conditions influence the perception of emotion in speech— pink noise most, brownian least. Worsened perception invokeshigher confusion, especially with sadness, an emotion with lesspronounced prosodic characteristics. Yet, all this does not lead toa break-down of the ‘cognitive-emotional space’ in a Non-metricMultiDimensional Scaling representation. The gender of speakersand the cultural background of listeners do not seem to play a role.

Attention Networks for Modeling Behaviors inAddiction Counseling

James Gibson 1, Dogan Can 1, Panayiotis Georgiou 1,David C. Atkins 2, Shrikanth S. Narayanan 1; 1Universityof Southern California, USA; 2University of Washington,USAWed-P-8-2-2, Time: 16:00–18:00

In psychotherapy interactions there are several desirable and unde-sirable behaviors that give insight into the efficacy of the counselorand the progress of the client. It is important to be able to identifywhen these target behaviors occur and what aspects of the interactionsignal their occurrence. Manual observation and annotation of thesebehaviors is costly and time intensive. In this paper, we use longshort term memory networks equipped with an attention mechanismto process transcripts of addiction counseling sessions and predictprominent counselor and client behaviors. We demonstrate thatthis approach gives competitive performance while also providingadditional interpretability.

Computational Analysis of Acoustic Descriptors inPsychotic Patients

Torsten Wörtwein 1, Tadas Baltrušaitis 2, EugeneLaksana 3, Luciana Pennant 4, Elizabeth S. Liebson 4, DostÖngür 4, Justin T. Baker 4, Louis-Philippe Morency 2; 1KIT,Germany; 2Carnegie Mellon University, USA; 3Universityof Southern California, USA; 4McLean Hospital, USAWed-P-8-2-3, Time: 16:00–18:00

Various forms of psychotic disorders, including schizophrenia, caninfluence how we speak. Therefore, clinicians assess speech andlanguage behaviors of their patients. While it is difficult for humansto quantify speech behaviors precisely, acoustic descriptors, such astenseness of voice and speech rate, can be quantified automatically.In this work, we identify previously unstudied acoustic descriptorsrelated to the severity of psychotic symptoms within a clinicalpopulation (N=29). Our dataset consists of semi-structured inter-views between patients and clinicians. Psychotic disorders are oftencharacterized by two groups of symptoms: negative and positive.While negative symptoms are also prevalent in disorders such asdepression, positive symptoms in psychotic disorders have rarelybeen studied from an acoustic and computational perspective. Ourexperiments show relationships between psychotic symptoms andacoustic descriptors related to voice quality consistency, variationof speech rate and volume, vowel space, and a parameter of glottalflow. Further, we show that certain acoustic descriptors can track apatient’s state from admission to discharge. Finally, we demonstratethat measures from the Brief Psychiatric Rating Scale (BPRS) can beestimated with acoustic descriptors.

Notes

209

Modeling Perceivers Neural-Responses UsingLobe-Dependent Convolutional Neural Network toImprove Speech Emotion Recognition

Ya-Tse Wu 1, Hsuan-Yu Chen 1, Yu-Hsien Liao 1, Li-WeiKuo 2, Chi-Chun Lee 1; 1National Tsing Hua University,Taiwan; 2National Health Research Institute, TaiwanWed-P-8-2-4, Time: 16:00–18:00

Developing automatic emotion recognition by modeling expressivebehaviors is becoming crucial in enabling the next generation designof human-machine interface. Also, with the availability of functionalmagnetic resonance imaging (fMRI), researchers have also conductedstudies into quantitative understanding of vocal emotion perceptionmechanism. In this work, our aim is two folds: 1) investigatingwhether the neural-responses can be used to automatically decodethe emotion labels of vocal stimuli, and 2) combining acoustic andfMRI features to improve the speech emotion recognition accuracies.We introduce a novel framework of lobe-dependent convolutionalneural network (LD-CNN) to provide better modeling of perceiversneural-responses on vocal emotion. Furthermore, by fusing LD-CNNwith acoustic features, we demonstrate an overall 63.17% accuraciesin a four-class emotion recognition task (9.89% and 14.42% relativeimprovement compared to the acoustic-only and the fMRI-onlyfeatures). Our analysis further shows that temporal lobe possessthe most information in decoding emotion labels; the fMRI andthe acoustic information are complementary to each other, whereneural-responses and acoustic features are better at discriminatingalong the valence and activation dimensions, respectively.

Implementing Gender-Dependent Vowel-LevelAnalysis for Boosting Speech-Based DepressionRecognition

Bogdan Vlasenko, Hesam Sagha, Nicholas Cummins,Björn Schuller; Universität Passau, GermanyWed-P-8-2-5, Time: 16:00–18:00

Whilst studies on emotion recognition show that gender-dependentanalysis can improve emotion classification performance, the poten-tial differences in the manifestation of depression between male andfemale speech have yet to be fully explored. This paper presentsa qualitative analysis of phonetically aligned acoustic features tohighlight differences in the manifestation of depression. Gender-dependent analysis with phonetically aligned gender-dependentfeatures are used for speech-based depression recognition. Thepresented experimental study reveals gender differences in the effectof depression on vowel-level features. Considering the experimentalstudy, we also show that a small set of knowledge-driven gender-dependent vowel-level features can outperform state-of-the-artturn-level acoustic features when performing a binary depressedspeech recognition task. A combination of these preselectedgender-dependent vowel-level features with turn-level standardisedopenSMILE features results in additional improvement for depressionrecognition.

Bilingual Word Embeddings for Cross-LingualPersonality Recognition Using Convolutional NeuralNets

Farhad Bin Siddique, Pascale Fung; HKUST, ChinaWed-P-8-2-6, Time: 16:00–18:00

We propose a multilingual personality classifier that uses text datafrom social media and Youtube Vlog transcriptions, and maps theminto Big Five personality traits using a Convolutional Neural Network(CNN). We first train unsupervised bilingual word embeddings froman English-Chinese parallel corpus, and use these trained word

representations as input to our CNN. This enables our model toyield relatively high cross-lingual and multilingual performance onChinese texts, after training on the English dataset for example. Wealso train monolingual Chinese embeddings from a large Chinesetext corpus and then train our CNN model on a Chinese datasetconsisting of conversational dialogue labeled with personality. Weachieve an average F-score of 66.1 in our multilingual task comparedto 63.3 F-score in cross-lingual, and 63.2 F-score in the monolingualperformance.

Emotion Category Mapping to Emotional Space byCross-Corpus Emotion Labeling

Yoshiko Arimoto 1, Hiroki Mori 2; 1RIKEN, Japan;2Utsunomiya University, JapanWed-P-8-2-7, Time: 16:00–18:00

The psychological classification of emotion has two main approaches.One is emotion category, in which emotions are classified into dis-crete and fundamental groups; the other is emotion dimension, inwhich emotions are characterized by multiple continuous scales.The cognitive classification of emotion by humans perceived fromspeech is not sufficiently established. Although there have beenseveral studies on such classification, they did not discuss it deeply.Moreover, the relationship between emotion category and emotiondimension perceived from speech is not well studied. Aiming toestablish common emotion labels for emotional speech, this studyelucidated the relationship between the emotion category and theemotion dimension perceived by speech by conducting an experi-ment of cross-corpus emotion labeling with two different Japanesedialogue corpora (Online Gaming Voice Chat Corpus with EmotionalLabel (OGVC) and Utsunomiya University Spoken Dialogue Databasefor Paralinguistic Information Studies (UUDB)). A likelihood ratiotest was conducted to assess the independency of one emotioncategory from the others in three-dimensional emotional space.This experiment revealed that many emotion categories exhibitedindependency from the other emotion categories. Only the neutralstates did not exhibit independency from the three emotions ofsadness, disgust, and surprise.

Big Five vs. Prosodic Features as Cues to DetectAbnormality in SSPNET-Personality Corpus

Cedric Fayet, Arnaud Delhay, Damien Lolive,Pierre-François Marteau; IRISA, FranceWed-P-8-2-8, Time: 16:00–18:00

This paper presents an attempt to evaluate three different sets offeatures extracted from prosodic descriptors and Big Five traits forbuilding an anomaly detector. The Big Five model enables to capturepersonality information. Big Five traits are extracted from a manualannotation while Prosodic features are extracted directly from thespeech signal. Two different anomaly detection methods are evalu-ated: Gaussian Mixture Model (GMM) and One-Class SVM (OC-SVM),each one combined with a threshold classification to decide the“normality” of a sample. The different combinations of models andfeature sets are evaluated on the SSPNET-Personality corpus whichhas already been used in several experiments, including a previouswork on separating two types of personality profiles in a supervisedway. In this work, we propose the above mentioned unsupervised orsemi-supervised methods, and discuss their performance, to detectparticular audio-clips produced by a speaker with an abnormal per-sonality. Results show that using automatically extracted prosodicfeatures competes with the Big Five traits. The overall detectionperformance achieved by the best model is around 0.8 (F1-measure).

Notes

210

Speech Rate Comparison When Talking to a Systemand Talking to a Human: A Study from aSpeech-to-Speech, Machine Translation Mediated MapTask

Hayakawa Akira 1, Carl Vogel 1, Saturnino Luz 2, NickCampbell 1; 1Trinity College Dublin, Ireland; 2Universityof Edinburgh, UKWed-P-8-2-9, Time: 16:00–18:00

This study focuses on the adaptation of subjects in Human-to-Human(H2H) communication in spontaneous dialogues in two differentsettings. The speech rate of sixteen dialogues from the HCRC MapTask corpus have been analyzed as direct H2H communication, whilefifteen dialogues from the ILMT-s2s corpus have been analyzed asa Speech-to-Speech Machine Translation (S2S-MT) mediated H2Hcommunication comparison. The analysis shows that while the meanspeech rate of the subjects in the two task oriented corpora differ, inboth corpora the role of the subject causes a significant differencein the speech rate with the Information Giver using a slower speechrate than the Information Follower. Also the different settings of thedialogue recordings (with or without eye contact in the HCRC corpusand with or without live video streaming in the ILMT-s2s corpus)only show a negligible difference in the speech rate. However, thegender of the subjects have provided an interesting difference withthe female subjects of the ILMT-s2s corpus using a slower speechrate than the male subjects, gender does not show any differencein the HCRC corpus. This indicates that the difference is not fromperforming the map task, but a result of their adaptation strategy tothe S2S-MT system.

Approaching Human Performance in BehaviorEstimation in Couples Therapy Using Deep SentenceEmbeddings

Shao-Yen Tseng 1, Brian Baucom 2, PanayiotisGeorgiou 1; 1University of Southern California, USA;2University of Utah, USAWed-P-8-2-10, Time: 16:00–18:00

Identifying complex behavior in human interactions for observationalstudies often involves the tedious process of transcribing and anno-tating large amounts of data. While there is significant work towardsaccurate transcription in Automatic Speech Recognition, automaticNatural Language Understanding of high-level human behaviorsfrom the transcribed text is still at an early stage of development.In this paper we present a novel approach for modeling humanbehavior using sentence embeddings and propose an automaticbehavior annotation framework. We explore unsupervised methodsof extracting semantic information, using seq2seq models, intodeep sentence embeddings and demonstrate that these embeddingscapture behaviorally meaningful information. Our proposed frame-work utilizes LSTM Recurrent Neural Networks to estimate behaviortrajectories from these sentence embeddings. Finally, we employfusion to compare our high-resolution behavioral trajectories withthe coarse, session-level behavioral ratings of human annotators inCouples Therapy. Our experiments show that behavior annotationusing this framework achieves better results than prior methods andapproaches or exceeds human performance in terms of annotatoragreement.

Complexity in Speech and its Relation to EmotionalBond in Therapist-Patient Interactions During SuicideRisk Assessment Interviews

Md. Nasir 1, Brian Baucom 2, Craig J. Bryan 2,Shrikanth S. Narayanan 1, Panayiotis Georgiou 1;1University of Southern California, USA; 2University ofUtah, USAWed-P-8-2-11, Time: 16:00–18:00

In this paper, we analyze a 53-hour speech corpus of interactionsof soldiers who had recently attempted suicide or had strong sui-cidal ideation conversing with their therapists. In particular, westudy the complexity in therapist-patient speech as a marker oftheir emotional bond. Emotional bond is the extent to which thepatient feels understood by and connected to the therapist. First,we extract speech features from audio recordings of their interac-tions. Then, we consider the nonlinear time series representationof those features and compute complexity measures based on theLyapunov coefficient and correlation dimension. For the majority ofthe subjects, we observe that speech complexity in therapist-patientpairs is higher for the interview sessions, when compared to thatof the rest of their interactions (intervention and post-interviewfollow-up). This indicates that entrainment (adapting to each other’sspeech) between the patient and the therapist is lower during theinterview than regular interactions. This observation is consistentwith prior studies in clinical psychology, considering that assessmentinterviews typically involve the therapist asking routine questionsto enquire about the patient’s suicidal thoughts and feelings. Inaddition, we find that complexity is negatively correlated with thepatient’s perceived emotional bond with the therapist.

An Investigation of Emotion Dynamics and KalmanFiltering for Speech-Based Emotion Prediction

Zhaocheng Huang, Julien Epps; University of New SouthWales, AustraliaWed-P-8-2-12, Time: 16:00–18:00

Despite recent interest in continuous prediction of dimensional emo-tions, the dynamical aspect of emotions has received less attentionin automated systems. This paper investigates how emotion changecan be effectively incorporated to improve continuous prediction ofarousal and valence from speech. Significant correlations were foundbetween emotion ratings and their dynamics during investigations onthe RECOLA database, and here we examine how to best exploit themusing a Kalman filter. In particular, we investigate the correlationbetween predicted arousal and valence dynamics with arousal andvalence ground truth; the Kalman filter internal delay for estimatingthe state transition matrix; the use of emotion dynamics as a mea-surement input to a Kalman filter; and how multiple probabilisticKalman filter outputs can be effectively fused. Evaluation resultsshow that correct dynamics estimation and internal delay settingsallow up to 5% and 58% relative improvement in arousal and valenceprediction respectively over existing Kalman filter implementations.Fusion based on probabilistic Kalman filter outputs yields furthergains.

Notes

211

Wed-P-8-3 : Language Understanding andGenerationPoster 3, 16:00–18:00, Wednesday, 23 Aug. 2017Chairs: Jose David Lopes, Heriberto Cuayahuitl

Zero-Shot Learning for Natural LanguageUnderstanding Using Domain-Independent SequentialStructure and Question Types

Kugatsu Sadamitsu, Yukinori Homma, RyuichiroHigashinaka, Yoshihiro Matsuo; NTT, JapanWed-P-8-3-1, Time: 16:00–18:00

Natural language understanding (NLU) is an important module ofspoken dialogue systems. One of the difficulties when it comes toadapting NLU to new domains is the high cost of constructing newtraining data for each domain. To reduce this cost, we propose azero-shot learning of NLU that takes into account the sequentialstructures of sentences together with general question types acrossdifferent domains. Experimental results show that our methodsachieve higher accuracy than baseline methods in two completelydifferent domains (insurance and sightseeing).

Parallel Hierarchical Attention Networks with SharedMemory Reader for Multi-Stream ConversationalDocument Classification

Naoki Sawada 1, Ryo Masumura 1, Hiromitsu Nishizaki 2;1NTT, Japan; 2University of Yamanashi, JapanWed-P-8-3-2, Time: 16:00–18:00

This paper describes a novel classification method for multi-streamconversational documents. Documents of contact center dialoguesor meetings are often composed of multiple source documents thatare transcriptions of the recordings of each speaker’s channel. Toenhance the classification performance of such multi-stream conver-sational documents, three main advances over the previous methodare introduced. The first is a parallel hierarchical attention network(PHAN) for multi-stream conversational document modeling. PHANcan precisely capture word and sentence structures of individualsource documents and efficiently integrate them. The second isa shared memory reader that can yield a shared attention mech-anism. The shared memory reader highlights common importantinformation in a conversation. Our experiments on a call categoryclassification in contact center dialogues show that PHAN togetherwith the shared memory reader outperforms the single documentmodeling method and previous multi-stream document modelingmethod.

Internal Memory Gate for Recurrent Neural Networkswith Application to Spoken Language Understanding

Mohamed Morchid; LIA (EA 4128), FranceWed-P-8-3-3, Time: 16:00–18:00

Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNN)require 4 gates to learn short- and long-term dependencies for agiven sequence of basic elements. Recently, “Gated Recurrent Unit”(GRU) has been introduced and requires fewer gates than LSTM (resetand update gates), to code short- and long-term dependencies andreaches equivalent performances to LSTM, with less processing timeduring the learning. The “Leaky integration Unit” (LU) is a GRU witha single gate (update) that codes mostly long-term dependenciesquicker than LSTM or GRU (small number of operations for learning).This paper proposes a novel RNN that takes advantage of LSTM, GRU(short- and long-term dependencies) and the LU (fast learning) called

“Internal Memory Gate” (IMG). The effectiveness and the robustnessof the proposed IMG-RNN is evaluated during a classification task ofa small corpus of spoken dialogues from the DECODA project thatallows us to evaluate the capability of each RNN to code short-termdependencies. The experiments show that IMG-RNNs reach betteraccuracies with a gain of 0.4 points compared to LSTM- and GRU-RNNs and 0.7 points compared to the LU-RNN. Moreover, IMG-RNNrequires less processing time than GRU or LSTM with a gain of 19%and 50% respectively.

Character-Based Embedding Models and RerankingStrategies for Understanding Natural Language MealDescriptions

Mandy Korpusik, Zachary Collins, James Glass; MIT, USAWed-P-8-3-4, Time: 16:00–18:00

Character-based embedding models provide robustness for handlingmisspellings and typos in natural language. In this paper, we exploreconvolutional neural network based embedding models for handlingout-of-vocabulary words in a meal description food ranking task. Wedemonstrate that character-based models combined with a standardword-based model improves the top-5 recall of USDA database fooditems from 26.3% to 30.3% on a test set of all USDA foods with typossimulated in 10% of the data. We also propose a new rerankingstrategy for predicting the top USDA food matches given a mealdescription, which significantly outperforms our prior method ofn-best decoding with a finite state transducer, improving the top-5recall on the all USDA foods task from 20.7% to 63.8%.

Quaternion Denoising Encoder-Decoder for ThemeIdentification of Telephone Conversations

Titouan Parcollet, Mohamed Morchid, Georges Linarès;LIA (EA 4128), FranceWed-P-8-3-5, Time: 16:00–18:00

In the last decades, encoder-decoders or autoencoders (AE) havereceived a great interest from researchers due to their capability toconstruct robust representations of documents in a low dimensionalsubspace. Nonetheless, autoencoders reveal little in way of spokendocument internal structure by only considering words or topicscontained in the document as an isolate basic element, and tendto overfit with small corpus of documents. Therefore, QuaternionMulti-layer Perceptrons (QMLP) have been introduced to capture suchinternal latent dependencies, whereas denoising autoencoders (DAE)are composed with different stochastic noises to better process smallset of documents. This paper presents a novel autoencoder based onboth hitherto-proposed DAE (to manage small corpus) and the QMLP(to consider internal latent structures) called “Quaternion denoisingencoder-decoder” (QDAE). Moreover, the paper defines an originalangular Gaussian noise adapted to the specificity of hyper-complexalgebra. The experiments, conduced on a theme identification task ofspoken dialogues from the DECODA framework, show that the QDAEobtains the promising gains of 3% and 1.5% compared to the standardreal valued denoising autoencoder and the QMLP respectively.

ASR Error Management for Improving SpokenLanguage Understanding

Edwin Simonnet 1, Sahar Ghannay 1, Nathalie Camelin 1,Yannick Estève 1, Renato De Mori 2; 1LIUM (EA 4023),France; 2LIA (EA 4128), FranceWed-P-8-3-6, Time: 16:00–18:00

This paper addresses the problem of automatic speech recognition(ASR) error detection and their use for improving spoken languageunderstanding (SLU) systems. In this study, the SLU task consists

Notes

212

in automatically extracting, from ASR transcriptions, semantic con-cepts and concept/values pairs in a e.g touristic information system.An approach is proposed for enriching the set of semantic labelswith error specific labels and by using a recently proposed neuralapproach based on word embeddings to compute well calibrated ASRconfidence measures. Experimental results are reported showingthat it is possible to decrease significantly the Concept/Value ErrorRate with a state of the art system, outperforming previously pub-lished results performance on the same experimental data. It alsoshown that combining an SLU approach based on conditional randomfields with a neural encoder/decoder attention based architecture, itis possible to effectively identifying confidence islands and uncertainsemantic output segments useful for deciding appropriate errorhandling actions by the dialogue manager strategy.

Jointly Trained Sequential Labeling and Classificationby Sparse Attention Neural Networks

Mingbo Ma 1, Kai Zhao 1, Liang Huang 1, Bing Xiang 2,Bowen Zhou 2; 1Oregon State University, USA; 2IBM, USAWed-P-8-3-7, Time: 16:00–18:00

Sentence-level classification and sequential labeling are two funda-mental tasks in language understanding. While these two tasks areusually modeled separately, in reality, they are often correlated,for example in intent classification and slot filling, or in topicclassification and named-entity recognition. In order to utilize thepotential benefits from their correlations, we propose a jointlytrained model for learning the two tasks simultaneously via LongShort-Term Memory (LSTM) networks. This model predicts thesentence-level category and the word-level label sequence from thestepwise output hidden representations of LSTM. We also introducea novel mechanism of “sparse attention” to weigh words differentlybased on their semantic relevance to sentence-level classification.The proposed method outperforms baseline models on ATIS andTREC datasets.

To Plan or not to Plan? Discourse Planning inSlot-Value Informed Sequence to Sequence Models forLanguage Generation

Neha Nayak, Dilek Hakkani-Tür, Marilyn Walker, LarryHeck; Google, USAWed-P-8-3-8, Time: 16:00–18:00

Natural language generation for task-oriented dialogue systemsaims to effectively realize system dialogue actions. All naturallanguage generators (NLGs) must realize grammatical, natural andappropriate output, but in addition, generators for task-orienteddialogue must faithfully perform a specific dialogue act that conveysspecific semantic information, as dictated by the dialogue policy ofthe system dialogue manager. Most previous work on deep learningmethods for task-oriented NLG assumes that generation output canbe an utterance skeleton. Utterances are delexicalized, with variablenames for slots, which are then replaced with actual values as partof post-processing. However, the value of slots do, in fact, influencethe lexical selection in the surrounding context as well as the overallsentence plan. To model this effect, we investigate sequence-to-sequence (seq2seq) models in which slot values are included as partof the input sequence and the output surface form. Furthermore,we study whether a separate sentence planning module that decideson grouping of slot value mentions as input to the seq2seq modelresults in more natural sentences than a seq2seq model that aims tojointly learn the plan and the surface realization.

Online Adaptation of an Attention-Based NeuralNetwork for Natural Language Generation

Matthieu Riou, Bassam Jabaian, Stéphane Huet, FabriceLefèvre; LIA (EA 4128), FranceWed-P-8-3-9, Time: 16:00–18:00

Following some recent propositions to handle natural languagegeneration in spoken dialog systems with long short-term memoryrecurrent neural network models [1] we first investigate a variantthereof with the objective of a better integration of the attentionsubnetwork. Then our main objective is to propose and evaluatea framework to adapt the NLG module online through direct in-teractions with the users. When doing so the basic way is to askthe user to utter an alternative sentence to express a particulardialog act. But then the system has to decide between using anautomatic transcription or to ask for a manual transcription. Todo so a reinforcement learning approach based on an adversarialbandit scheme is retained. We show that by defining appropriatelythe rewards as a linear combination of expected payoffs and costsof acquiring the new data provided by the user, a system designcan balance between improving the system’s performance towards abetter match with the user’s preferences and the burden associatedwith it.

Spanish Sign Language Recognition with DifferentTopology Hidden Markov Models

Carlos-D. Martínez-Hinarejos 1, Zuzanna Parcheta 2;1Universidad Politécnica de Valencia, Spain; 2Sciling,SpainWed-P-8-3-10, Time: 16:00–18:00

Natural language recognition techniques can be applied not onlyto speech signals, but to other signals that represent natural lan-guage units (e.g., words and sentences). This is the case of signlanguage recognition, which is usually employed by deaf people tocommunicate. The use of recognition techniques may allow thislanguage users to communicate more independently with non-signalusers. Several works have been done for different variants of signlanguages, but in most cases their vocabulary is quite limited andthey only recognise gestures corresponding to isolated words. Inthis work, we propose gesture recognisers which make use of typicalContinuous Density Hidden Markov Model. They solve not only theisolated word problem, but also the recognition of basic sentencesusing the Spanish Sign Language with a higher vocabulary than inother approximations. Different topologies and Gaussian mixturesare studied. Results show that our proposal provides promising re-sults that are the first step to obtain a general automatic recognitionof Spanish Sign Language.

OpenMM: An Open-Source Multimodal FeatureExtraction Tool

Michelle Renee Morales 1, Stefan Scherer 2, RivkaLevitan 3; 1CUNY Graduate Center, USA; 2University ofSouthern California, USA; 3CUNY Brooklyn College, USAWed-P-8-3-11, Time: 16:00–18:00

The primary use of speech is in face-to-face interactions and situ-ational context and human behavior therefore intrinsically shapeand affect communication. In order to usefully model situationalawareness, machines must have access to the same streams ofinformation humans have access to. In other words, we need toprovide machines with features that represent each communicativemodality: face and gesture, voice and speech, and language. Thispaper presents OpenMM: an open-source multimodal feature extrac-tion tool. We build upon existing open-source repositories to present

Notes

213

the first publicly available tool for multimodal feature extraction.The tool provides a pipeline for researchers to easily extract visualand acoustic features. In addition, the tool also performs automaticspeech recognition (ASR) and then uses the transcripts to extractlinguistic features. We evaluate the OpenMM’s multimodal featureset on deception, depression and sentiment classification tasksand show its performance is very promising. This tool providesresearchers with a simple way of extracting multimodal features andconsequently a richer and more robust feature representation formachine learning tasks.

Speaker Dependency Analysis, Audiovisual FusionCues and a Multimodal BLSTM for ConversationalEngagement Recognition

Yuyun Huang, Emer Gilmartin, Nick Campbell; TrinityCollege Dublin, IrelandWed-P-8-3-12, Time: 16:00–18:00

Conversational engagement is a multimodal phenomenon and anessential cue to assess both human-human and human-robot com-munication. Speaker-dependent and speaker-independent scenarioswere addressed in our engagement study. Handcrafted audio-visualfeatures were used. Fixed window sizes for feature fusion methodwere analysed. Novel dynamic window size selection and multi-modal bi-directional long short term memory (Multimodal BLSTM)approaches were proposed and evaluated for engagement levelrecognition.

Wed-P-8-4 : Voice Conversion 2Poster 4, 16:00–18:00, Wednesday, 23 Aug. 2017Chair: Chandra Sekhar Seelamantula

Voice Conversion from Unaligned Corpora UsingVariational Autoencoding Wasserstein GenerativeAdversarial Networks

Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao,Hsin-Min Wang; Academia Sinica, TaiwanWed-P-8-4-1, Time: 16:00–18:00

Building a voice conversion (VC) system from non-parallel speechcorpora is challenging but highly valuable in real application scenar-ios. In most situations, the source and the target speakers do notrepeat the same texts or they may even speak different languages.In this case, one possible, although indirect, solution is to build agenerative model for speech. Generative models focus on explainingthe observations with latent variables instead of learning a pairwisetransformation function, thereby bypassing the requirement ofspeech frame alignment. In this paper, we propose a non-parallelVC framework with a variational autoencoding Wasserstein genera-tive adversarial network (VAW-GAN) that explicitly considers a VCobjective when building the speech model. Experimental resultscorroborate the capability of our framework for building a VC systemfrom unaligned data, and demonstrate improved conversion quality.

CAB: An Energy-Based Speaker Clustering Model forRapid Adaptation in Non-Parallel Voice Conversion

Toru Nakashika; University of Electro-Communications,JapanWed-P-8-4-2, Time: 16:00–18:00

In this paper, a new energy-based probabilistic model, called CAB(Cluster Adaptive restricted Boltzmann machine), is proposed forvoice conversion (VC) that does not require parallel data during the

training and requires only a small amount of speech data during theadaptation. Most of the existing VC methods require parallel datafor training. Recently, VC methods that do not require parallel data(called non-parallel VCs) have been also proposed and are attractingmuch attention because they do not require prepared or recordedparallel speech data, unlike conventional approaches. The proposedCAB model is aimed at statistical non-parallel VC based on clusteradaptive training (CAT). This extends the VC method used in ourprevious model, ARBM (adaptive restricted Boltzmann machine). TheARBM approach assumes that any speech signals can be decomposedinto speaker-invariant phonetic information and speaker-identityinformation using the ARBM adaptation matrices of each speaker.VC is achieved by switching the source speaker’s identity into thoseof the target speaker while retaining the phonetic informationobtained by decomposition of the source speaker’s speech. Incontrast, CAB speaker identities are represented as cluster vectorsthat determine the adaptation matrices. As the number of clusters isgenerally smaller than the number of speakers, the number of modelparameters can be reduced compared to ARBM, which enables rapidadaptation of a new speaker. Our experimental results show thatthe proposed method especially performed better than the ARBMapproach, particularly in adaptation.

Phoneme-Discriminative Features for DysarthricSpeech Conversion

Ryo Aihara 1, Tetsuya Takiguchi 2, Yasuo Ariki 2;1Mitsubishi Electric, Japan; 2Kobe University, JapanWed-P-8-4-3, Time: 16:00–18:00

We present in this paper a Voice Conversion (VC) method for aperson with dysarthria resulting from athetoid cerebral palsy. VC isbeing widely researched in the field of speech processing becauseof increased interest in using such processing in applications suchas personalized Text-To-Speech systems. A Gaussian Mixture Model(GMM)-based VC method has been widely researched and Partial LeastSquare (PLS)-based VC has been proposed to prevent the over-fittingproblems associated with the GMM-based VC method. In this paper,we present phoneme-discriminative features, which are associatedwith PLS-based VC. Conventional VC methods do not consider thephonetic structure of spectral features although phonetic struc-tures are important for speech analysis. Especially for dysarthricspeech, their phonetic structures are difficult to discriminate anddiscriminative learning will improve the conversion accuracy. Thispaper employs discriminative manifold learning. Spectral featuresare projected into a subspace in which a near point with the samephoneme label is close to another and a near point with a differentphoneme label is apart. Our proposed method was evaluated ondysarthric speaker conversion task which converts dysarthric voiceinto non-dysarthric speech.

Denoising Recurrent Neural Network for DeepBidirectional LSTM Based Voice Conversion

Jie Wu 1, D.-Y. Huang 2, Lei Xie 1, Haizhou Li 2;1Northwestern Polytechnical University, China;2A*STAR, SingaporeWed-P-8-4-4, Time: 16:00–18:00

The paper studies the post processing in deep bidirectional LongShort-Term Memory (DBLSTM) based voice conversion, where thestatistical parameters are optimized to generate speech that ex-hibits similar properties to target speech. However, there alwaysexists residual error between converted speech and target one.We reformulate the residual error problem as speech restoration,which aims to recover the target speech samples from the convertedones. Specifically, we propose a denoising recurrent neural network(DeRNN) by introducing regularization during training to shape the

Notes

214

distribution of the converted data in latent space. We compare theproposed approach with global variance (GV), modulation spectrum(MS) and recurrent neural network (RNN) based postfilters, whichserve a similar purpose. The subjective test results show that theproposed approach significantly outperforms these conventionalapproaches in terms of quality and similarity.

Speaker Dependent Approach for Enhancing aGlossectomy Patient’s Speech via GMM-Based VoiceConversion

Kei Tanaka, Sunao Hara, Masanobu Abe, Masaaki Sato,Shogo Minagi; Okayama University, JapanWed-P-8-4-5, Time: 16:00–18:00

In this paper, using GMM-based voice conversion algorithm, wepropose to generate speaker-dependent mapping functions to im-prove the intelligibility of speech uttered by patients with a wideglossectomy. The speaker-dependent approach enables to generatethe mapping functions that reconstruct missing spectrum features ofspeech uttered by a patient without having influences of a speaker’sfactor. The proposed idea is simple, i.e., to collect speech utteredby a patient before and after the glossectomy, but in practice it ishard to ask patients to utter speech just for developing algorithms.To confirm the performance of the proposed approach, in thispaper, in order to simulate glossectomy patients, we fabricatedan intraoral appliance which covers lower dental arch and tonguesurface to restrain tongue movements. In terms of the Mel-frequencycepstrum (MFC) distance, by applying the voice conversion, thedistances were reduced by 25% and 42% for speaker-dependent caseand speaker-independent case, respectively. In terms of phonemeintelligibility, dictation tests revealed that speech reconstructed byspeaker-dependent approach almost always showed better perfor-mance than the original speech uttered by simulated patients, whilespeaker-independent approach did not.

Generative Adversarial Network-Based Postfilter forSTFT Spectrograms

Takuhiro Kaneko 1, Shinji Takaki 2, Hirokazu Kameoka 1,Junichi Yamagishi 2; 1NTT, Japan; 2NII, JapanWed-P-8-4-6, Time: 16:00–18:00

We propose a learning-based postfilter to reconstruct the high-fidelity spectral texture in short-term Fourier transform (STFT)spectrograms. In speech-processing systems, such as speech syn-thesis, conversion, enhancement, separation, and coding, STFTspectrograms have been widely used as key acoustic representations.In these tasks, we normally need to precisely generate or predict therepresentations from inputs; however, generated spectra typicallylack the fine structures that are close to those of the true data. Toovercome these limitations and reconstruct spectra having finerstructures, we propose a generative adversarial network (GAN)-basedpostfilter that is implicitly optimized to match the true featuredistribution in adversarial learning. The challenge with this postfilteris that a GAN cannot be easily trained for very high-dimensionaldata such as STFT spectra. We take a simple divide-and-concatenatestrategy. Namely, we first divide the spectrograms into multiplefrequency bands with overlap, reconstruct the individual bandsusing the GAN-based postfilter trained for each band, and finallyconnect the bands with overlap. We tested our proposed postfilteron a deep neural network-based text-to-speech task and confirmedthat it was able to reduce the gap between synthesized and targetspectra, even in the high-dimensional STFT domain.

Generative Adversarial Network-Based GlottalWaveform Model for Statistical Parametric SpeechSynthesis

Bajibabu Bollepalli, Lauri Juvela, Paavo Alku; AaltoUniversity, FinlandWed-P-8-4-7, Time: 16:00–18:00

Recent studies have shown that text-to-speech synthesis quality canbe improved by using glottal vocoding. This refers to vocoders thatparameterize speech into two parts, the glottal excitation and vocaltract, that occur in the human speech production apparatus. Currentglottal vocoders generate the glottal excitation waveform by usingdeep neural networks (DNNs). However, the squared error-basedtraining of the present glottal excitation models is limited to gen-erating conditional average waveforms, which fails to capture thestochastic variation of the waveforms. As a result, shaped noise isadded as post-processing. In this study, we propose a new methodfor predicting glottal waveforms by generative adversarial networks(GANs). GANs are generative models that aim to embed the data dis-tribution in a latent space, enabling generation of new instances verysimilar to the original by randomly sampling the latent distribution.The glottal pulses generated by GANs show a stochastic componentsimilar to natural glottal pulses. In our experiments, we comparesynthetic speech generated using glottal waveforms produced byboth DNNs and GANs. The results show that the newly proposedGANs achieve synthesis quality comparable to that of widely-usedDNNs, without using an additive noise component.

Emotional Voice Conversion with Adaptive Scales F0Based on Wavelet Transform Using Limited Amountof Emotional Data

Zhaojie Luo, Jinhui Chen, Tetsuya Takiguchi, YasuoAriki; Kobe University, JapanWed-P-8-4-8, Time: 16:00–18:00

Deep learning techniques have been successfully applied to speechprocessing. Typically, neural networks (NNs) are very effective inprocessing nonlinear features, such as mel cepstral coefficients(MCC), which represent the spectrum features in voice conversion(VC) tasks. Despite these successes, the approach is restricted toproblems with moderate dimension and sufficient data. Thus, inemotional VC tasks, it is hard to deal with a simple representationof fundamental frequency (F0), which is the most important featurein emotional voice representation, Another problem is that thereare insufficient emotional data for training. To deal with these twoproblems, in this paper, we propose the adaptive scales continuouswavelet transform (AS-CWT) method to systematically capture the F0features of different temporal scales, which can represent differentprosodic levels ranging from micro-prosody to sentence levels. Mean-while, we also use the pre-trained conversion functions obtainedfrom other emotional datasets to synthesize new emotional data asadditional training samples for target emotional voice conversion.Experimental results indicate that our proposed method achievesthe best performance in both objective and subjective evaluations.

Speaker Adaptation in DNN-Based Speech SynthesisUsing d-Vectors

Rama Doddipatla 1, Norbert Braunschweiler 1, RannieryMaia 2; 1Toshiba Research Europe, UK; 2UniversidadeFederal de Santa Catarina, BrazilWed-P-8-4-9, Time: 16:00–18:00

The paper presents a mechanism to perform speaker adaptationin speech synthesis based on deep neural networks (DNNs). Themechanism extracts speaker identification vectors, so-called d-vectors, from the training speakers and uses them jointly with thelinguistic features to train a multi-speaker DNN-based text-to-speech

Notes

215

synthesizer (DNN-TTS). The d-vectors are derived by applying prin-cipal component analysis (PCA) on the bottle-neck features of aspeaker classifier network. At the adaptation stage, three variantsare explored: (1) d-vectors calculated using data from the targetspeaker, or (2) d-vectors calculated as a weighted sum of d-vectorsfrom training speakers, or (3) d-vectors calculated as an average ofthe above two approaches. The proposed method of unsupervisedadaptation using the d-vector is compared with the commonly usedi-vector based approach for speaker adaptation. Listening testsshow that: (1) for speech quality, the d-vector based approach issignificantly preferred over the i-vector based approach. All thed-vector variants perform similar for speech quality; (2) for speakersimilarity, both d-vector and i-vector based adaptation were foundto perform similar, except a small significant preference for thed-vector calculated as an average over the i-vector.

Spectro-Temporal Modelling with Time-FrequencyLSTM and Structured Output Layer for VoiceConversion

Runnan Li 1, Zhiyong Wu 1, Yishuang Ning 1, Lifa Sun 2,Helen Meng 1, Lianhong Cai 1; 1Tsinghua University,China; 2Chinese University of Hong Kong, ChinaWed-P-8-4-10, Time: 16:00–18:00

From speech, speaker identity can be mostly characterized bythe spectro-temporal structures of spectrum. Although recentresearches have demonstrated the effectiveness of employing longshort-term memory (LSTM) recurrent neural network (RNN) in voiceconversion, traditional LSTM-RNN based approaches usually focuson temporal evolutions of speech features only. In this paper, weimprove the conventional LSTM-RNN method for voice conversionby employing the two-dimensional time-frequency LSTM (TFLSTM)to model spectro-temporal warping along both time and frequencyaxes. A multi-task learned structured output layer (SOL) is afterwardadopted to capture the dependencies between spectral and pitch pa-rameters for further improvement, where spectral parameter targetsare conditioned upon pitch parameters prediction. Experimentalresults show the proposed approach outperforms conventionalsystems in speech quality and speaker similarity.

Segment Level Voice Conversion with RecurrentNeural Networks

Miguel Varela Ramos 1, Alan W. Black 2,Ramon Fernandez Astudillo 1, Isabel Trancoso 1, NunoFonseca 3; 1INESC-ID Lisboa, Portugal; 2Carnegie MellonUniversity, USA; 3Politécnico de Leiria, PortugalWed-P-8-4-11, Time: 16:00–18:00

Voice conversion techniques aim to modify a subject’s voice charac-teristics in order to mimic the one’s of another person. Due to thedifference in utterance length between source and target speaker,state of the art voice conversion systems often rely on a framealignment pre-processing step. This step aligns the entire utteranceswith algorithms such as dynamic time warping (DTW) that introduceerrors, hindering system performance. In this paper we presenta new technique that avoids the alignment of entire utterances atframe level, while keeping the local context during training. Forthis purpose, we combine an RNN model with the use of phonemeor syllable-level information, obtained from a speech recognitionsystem. This system segments the utterances into segments whichthen can be grouped into overlapping windows, providing the neededcontext for the model to learn the temporal dependencies. We showthat with this approach, notable improvements can be attained overa state of the art RNN voice conversion system on the CMU ARCTICdatabase. It is also worth noting that with this technique it is possibleto halve the training data size and still outperform the baseline.

Wed-S&T-6/7-A : Show & Tell 5E306, 10:00–12:00, 13:30–15:30, Wednesday, 23 Aug. 2017

Creating a Voice for MiRo, the World’s FirstCommercial Biomimetic Robot

Roger K. Moore, Ben Mitchinson; University of Sheffield,UKWed-S&T-6-A-1, Time: 10:00–12:00

This paper introduces MiRo — the world’s first commercialbiomimetic robot — and describes how its vocal system wasdesigned using a real-time parametric general-purpose mammalianvocal synthesiser tailored to the specific physical characteristics ofthe robot. MiRo’s capabilities will be demonstrated live during thehands-on interactive ‘Show & Tell’ session at INTERSPEECH-2017.

A Thematicity-Based Prosody Enrichment Tool forCTS

Mónica Domínguez 1, Mireia Farrús 1, Leo Wanner 2;1Universitat Pompeu Fabra, Spain; 2ICREA, SpainWed-S&T-6-A-2, Time: 10:00–12:00

This paper presents a demonstration of a stochastic prosody toolfor enrichment of synthesized speech using SSML prosody tagsapplied over hierarchical thematicity spans in the context of a CTSapplication. The motivation for using hierarchical thematicity isexemplified, together with the capabilities of the module to generatea variety of SSML prosody tags within a controlled range of valuesdepending on the input thematicity label.

WebSubDub — Experimental System for CreatingHigh-Quality Alternative Audio Track for TVBroadcasting

Martin Gruber, Jindrich Matoušek, Zdenek Hanzlícek,Jakub Vít, Daniel Tihelka; University of West Bohemia,Czech RepublicWed-S&T-6-A-3, Time: 10:00–12:00

This paper deals with a presentation of an experimental system(called WebSubDub) for creating a high-quality alternative audiotrack for TV broadcasting. The system is used to create subtitles forTV shows in such a format which allows to automatically generatean alternative audio track with multiple voices employing a speciallyadapted TTS system. This alternative audio track is intended forteleviewers with slight hearing impairments, i.e. for a group ofteleviewers who encounter issues when perceiving the original audiotrack — especially dialogues with background music, backgroundnoise or emotional speech. The system was developed in cooperationwith Czech television, the public service broadcaster in the CzechRepublic.

Voice Conservation and TTS System for People FacingTotal Laryngectomy

Markéta Juzová, Daniel Tihelka, Jindrich Matoušek,Zdenek Hanzlícek; University of West Bohemia, CzechRepublicWed-S&T-6-A-4, Time: 10:00–12:00

The presented paper is focused on the building of personalizedtext-to-speech (TTS) synthesis for people who are losing their voicesdue to fatal diseases. The special conditions of this issue makethe process different from preparing professional synthetic voicesfor commercial TTS systems and make it also more difficult. The

Notes

216

whole process is described in this paper and the first results of thepersonalized voice building are presented here as well.

TBT (Toolkit to Build TTS): A High PerformanceFramework to Build Multiple Language HTS Voice

Atish Shankar Ghone 1, Rachana Nerpagar 1, PranawKumar 1, Arun Baby 2, Aswin Shanmugam 2, SasikumarM. 1, Hema A. Murthy 2; 1C-DAC, India; 2IIT Madras,IndiaWed-S&T-6-A-5, Time: 10:00–12:00

With the development of high quality TTS systems, application areaof synthetic speech is increasing rapidly. Beyond the communicationaids for the visually impaired and vocally handicap, TTS voicesare being used in various educational, telecommunication andmultimedia applications. All around the world people are tryingto build TTS voice for their regional languages. TTS voice buildingrequires a number of steps to follow and involves use of multipletools, which makes it time consuming, tedious and perplexing toa user. This paper describes a Toolkit developed for HMM-basedTTS voice building that makes the process much easier and handy.The toolkit uses all required tools, viz. HTS, Festival, Festvox,Hybrid Segmentation Tool, etc. and handles each and every stepstarting from phone set creation, then prompt generation, hybridsegmentation, F0 range finding, voice building, and finally puttingthe built voice into Synthesis framework. Wherever possible it doesparallel processing to reduce time. It saves manual effort and timeto a large extent and enable a person to build TTS voice very easily.This toolkit is made available under Open Source license.

SIAK — A Game for Foreign Language PronunciationLearning

Reima Karhila 1, Sari Ylinen 2, Seppo Enarvi 1, KallePalomäki 1, Aleksander Nikulin 1, Olli Rantula 1, VerttiViitanen 1, Krupakar Dhinakaran 1, Anna-RiikkaSmolander 2, Heini Kallio 2, Katja Junttila 2, MariaUther 3, Perttu Hämäläinen 1, Mikko Kurimo 1; 1AaltoUniversity, Finland; 2University of Helsinki, Finland;3University of Winchester, UKWed-S&T-6-A-6, Time: 10:00–12:00

We introduce a digital game for children’s foreign-language learn-ing that uses automatic speech recognition (ASR) for evaluatingchildren’s utterances. Our first prototype focuses on the learningof English words and their pronunciation. The game connects to anetwork server, which handles the recognition and pronunciationgrading of children’s foreign-language speech. The server is reusablefor different applications. Given suitable acoustic models, it can beused for grading pronunciations in any language.

Wed-S&T-6/7-B : Show & Tell 6E397, 10:00–12:00, 13:30–15:30, Wednesday, 23 Aug. 2017

Integrating the Talkamatic Dialogue Manager withAlexa

Staffan Larsson 1, Alex Berman 2, Andreas Krona 2,Fredrik Kronlid 2; 1University of Gothenburg, Sweden;2Talkamatic, SwedenWed-S&T-6-B-1, Time: 10:00–12:00

This paper describes the integration of Amazon Alexa with the Talka-matic Dialogue Manager (TDM), and shows how flexible dialogue

skills and rapid prototyping of dialogue apps can be brought to theAlexa platform.

A Robust Medical Speech-to-Speech/Speech-to-SignPhraselator

Farhia Ahmed 1, Pierrette Bouillon 2, Chelle Destefano 3,Johanna Gerlach 2, Sonia Halimi 2, Angela Hooper 4,Manny Rayner 2, Hervé Spechbach 5, Irene Strasly 2,Nikos Tsourakis 2; 1Association Genevoise desMalentendants, Switzerland; 2Université de Genève,Switzerland; 3Gypsysnail Arts, Australia; 4NABSInterpreting Services, Australia; 5Hôpitaux Universitairesde Genève, SwitzerlandWed-S&T-6-B-2, Time: 10:00–12:00

We present BabelDr, a web-enabled spoken-input phraselator formedical domains, which has been developed at Geneva Universityin a collaboration between a human language technology group anda group at the University hospital. The current production versionof the system translates French into Arabic, using exclusively rule-based methods, and has performed credibly in simulated triagingtests with standardised patients. We also present an experimentalversion which combines large-vocabulary recognition with the mainrule-based recogniser; offline tests on unseen data suggest thatthe new architecture adds robustness while more than halving the2-best semantic error rate. The experimental version translates fromspoken English into spoken French and also two sign languages.

Towards an Autarkic Embedded Cognitive UserInterface

Frank Duckhorn 1, Markus Huber 2, Werner Meyer 3,Oliver Jokisch 4, Constanze Tschöpe 1, Matthias Wolff 3;1Fraunhofer IKTS, Germany; 2InnoTec21, Germany;3Brandenburgische Technische Universität, Germany;4Hochschule für Telekommunikation Leipzig, GermanyWed-S&T-6-B-3, Time: 10:00–12:00

With this paper we present an overview of an autarkic embeddedcognitive user interface. It is realized in form of an integrated deviceable to communicate with the user over speech & gesture recogni-tion, speech synthesis and a touch display. Semantic processingand cognitive behaviour control support intuitive interaction andhelp controlling arbitrary electronic devices. To ensure user privacyand to operate autonomously of network access all informationprocessing is done on the device.

Nora the Empathetic Psychologist

Genta Indra Winata 1, Onno Kampman 1, Yang Yang 1,Anik Dey 2, Pascale Fung 1; 1HKUST, China; 2EMOSTechnologies, ChinaWed-S&T-6-B-4, Time: 10:00–12:00

Nora is a new dialog system that mimics a conversation with apsychologist by screening for stress, anxiety, and depression. Sheunderstands, empathizes, and adapts to users using emotionalintelligence modules trained via statistical modelling such as Con-volutional Neural Networks. These modules also enable her topersonalize the content of each conversation.

Notes

217

Modifying Amazon’s Alexa ASR Grammar andLexicon — A Case Study

Hassan Alam, Aman Kumar, Manan Vyas, Tina Werner,Rachmat Hartono; BCL Technologies, USAWed-S&T-6-B-5, Time: 10:00–12:00

In this proof-of-concept study we build a tool that modifies thegrammar and the dictionary of an Automatic Speech Recognition(ASR) engine. We evaluated our tool using Amazon’s Alexa ASRengine. The experiments show that with our grammar and dictionarymodification algorithms in the military domain, the accuracy of themodified ASR went up significantly — from 20/100 correct to 80/100correct.

Keynote 3: Björn LindblomAula Magna, 08:30–09:30, Thursday, 24 Aug. 2017Chair: Olov Engwall

Re-Inventing Speech — The Biological Way

Björn Lindblom; University of Stockholm, SwedenThu-K4-1, Time: 08:30–09:30

The mapping of the Speech Chain has so far been focused on theexperimentally more accessible links — e.g., acoustics — whereas thebrain’s activity during speaking and listening has understandablyreceived less attention. That state of affairs is about to changenow thanks to the new sophisticated tools offered by brain imagingtechnology.

At present many key questions concerning human speech processesremain incompletely understood despite the significant researchefforts of the past half century. As speech research goes neuro, wecould do with some better answers.

In this paper I will attempt to shed some light on some of the issues.I will do so by heeding the advice that Tinbergen1 once gave hisfellow biologists on explaining behavior. I paraphrase: Nothing inbiology makes sense unless you simultaneously look at it with thefollowing questions at the back of your mind: How did it evolve?How is it acquired? How does it work here and now?

Applying the Tinbergen strategy to speech I will, in broad strokes,trace a path from the small and fixed innate repertoires of non-human primates to the open-ended vocal systems that humans learntoday.

Such an agenda will admittedly identify serious gaps in our presentknowledge but, importantly, it will also bring an overarching possi-bility:

It will strongly suggest the feasibility of bypassing the traditionallinguistic operational approach to speech units and replacing it by afirst-principles account anchored in biology.

I will argue that this is the road-map we need for a more profoundunderstanding of the fundamental nature spoken language and foreducational, medical and technological applications.

Notes

218

Thu-SS-9-10 : Special Session: Interspeech2017 Computational ParalinguisticsChallengE (ComParE) 1E10, 10:00–12:00, Thursday, 24 Aug. 2017Chairs: Björn Schuller, Anton Batliner

The INTERSPEECH 2017 ComputationalParalinguistics Challenge: Addressee, Cold & Snoring

Björn Schuller 1, Stefan Steidl 2, Anton Batliner 3, ElikaBergelson 4, Jarek Krajewski 5, Christoph Janott 6, AndreiAmatuni 4, Marisa Casillas 7, Amanda Seidl 8, MelanieSoderstrom 9, Anne S. Warlaumont 10, GuillermoHidalgo 5, Sebastian Schnieder 5, Clemens Heiser 6,Winfried Hohenhorst 11, Michael Herzog 12, MaximilianSchmitt 3, Kun Qian 6, Yue Zhang 1, George Trigeorgis 1,Panagiotis Tzirakis 1, Stefanos Zafeiriou 1; 1ImperialCollege London, UK; 2FAU Erlangen-Nürnberg,Germany; 3Universität Passau, Germany; 4DukeUniversity, USA; 5Bergische Universität Wuppertal,Germany; 6Technische Universität München, Germany;7MPI for Psycholinguistics, The Netherlands; 8PurdueUniversity, USA; 9University of Manitoba, Canada;10University of California at Merced, USA; 11AlfriedKrupp Krankenhaus, Germany; 12Carl-Thiem-KlinikumCottbus, GermanyThu-SS-9-10-1, Time: 10:00–10:15

The INTERSPEECH 2017 Computational Paralinguistics Challengeaddresses three different problems for the first time in researchcompetition under well-defined conditions: In the Addressee sub-challenge, it has to be determined whether speech produced by anadult is directed towards another adult or towards a child; in theCold sub-challenge, speech under cold has to be told apart from‘healthy’ speech; and in the Snoring sub-challenge, four differenttypes of snoring have to be classified. In this paper, we describe thesesub-challenges, their conditions, and the baseline feature extractionand classifiers, which include data-learnt feature representationsby end-to-end learning with convolutional and recurrent neuralnetworks, and bag-of-audio-words for the first time in the challengeseries.

Description of the Upper Respiratory Tract InfectionCorpus (URTIC)

Jarek Krajewski 1, Sebastian Schieder , Anton Batliner 2;1Bergische Universität Wuppertal, Germany; 2UniversitätPassau, GermanyThu-SS-9-10-2, Time: 10:15–10:25

(No abstract available at the time of publication)

Description of the Munich-Passau Snore SoundCorpus (MPSSC)

Christoph Janott 1, Anton Batliner 2; 1TechnischeUniversität München, Germany; 2Universität Passau,GermanyThu-SS-9-10-3, Time: 10:25–10:35

(No abstract available at the time of publication)

Description of the Homebank Child/Adult AddresseeCorpus (HB-CHAAC)

Elika Bergelson 1, Andrei Amatuni 1, Marisa Casillas 2,Amanda Seidl 3, Melanie Soderstrom 4, Anne S.Warlaumont 5; 1Duke University, USA; 2MPI forPsycholinguistics, The Netherlands; 3Purdue University,USA; 4University of Manitoba, Canada; 5University ofCalifornia at Merced, USAThu-SS-9-10-4, Time: 10:35–10:45

(No abstract available at the time of publication)

It Sounds Like You Have a Cold! Testing VoiceFeatures for the Interspeech 2017 ComputationalParalinguistics Cold Challenge

Mark Huckvale 1, András Beke 2; 1University CollegeLondon, UK; 2MTA, HungaryThu-SS-9-10-5, Time: 10:45–11:00

This paper describes an evaluation of four different voice featuresets for detecting symptoms of the common cold in speech as partof the Interspeech 2017 Computational Paralinguistics Challenge.The challenge corpus consists of 630 speakers in three partitions,of which approximately one third had a “severe” cold at the time ofrecording. Success on the task is measured in terms of unweightedaverage recall of cold/not-cold classification from short extractsof the recordings. In this paper we review previous voice featuresused for studying changes in health and devise four basic typesof features for evaluation: voice quality features, vowel spectrafeatures, modulation spectra features, and spectral distributionfeatures. The evaluation shows that each feature set provides someuseful information to the task, with features from the modulationspectrogram being most effective. Feature-level fusion of the featuresets shows small performance improvements on the developmenttest set. We discuss the results in terms of the most suitable featuresfor detecting symptoms of cold and address issues arising from thedesign of the challenge.

End-to-End Deep Learning Framework for SpeechParalinguistics Detection Based on Perception AwareSpectrum

Danwei Cai 1, Zhidong Ni 1, Wenbo Liu 1, Weicheng Cai 1,Gang Li 2, Ming Li 1; 1Sun Yat-sen University, China;2JSC, ChinaThu-SS-9-10-6, Time: 11:00–11:15

In this paper, we propose an end-to-end deep learning frameworkto detect speech paralinguistics using perception aware spectrumas input. Existing studies show that speech under cold has distinctvariations of energy distribution on low frequency componentscompared with the speech under ‘healthy’ condition. This motivatesus to use perception aware spectrum as the input to an end-to-endlearning framework with small scale dataset. In this work, wetry both Constant Q Transform (CQT) spectrum and Gammatonespectrum in different end-to-end deep learning networks, where bothspectrums are able to closely mimic the human speech perceptionand transform it into 2D images. Experimental results show theeffectiveness of the proposed perception aware spectrum with end-to-end deep learning approach on Interspeech 2017 ComputationalParalinguistics Cold sub-Challenge. The final fusion result of ourproposed method is 8% better than that of the provided baseline interms of UAR.

Notes

219

Infected Phonemes: How a Cold Impairs Speech on aPhonetic Level

Johannes Wagner 1, Thiago Fraga-Silva 2, Yvan Josse 2,Dominik Schiller 1, Andreas Seiderer 1, Elisabeth André 1;1Universität Augsburg, Germany; 2Vocapia Research,FranceThu-SS-9-10-7, Time: 11:15–11:30

The realization of language through vocal sounds involves a complexinterplay between the lungs, the vocal cords, and a series of resonantchambers (e.g. mouth and nasal cavities). Due to their connection tothe outside world, these body parts are popular spots for viruses andbacteria to enter the human organism. Affected people may sufferfrom an upper respiratory tract infection (URTIC) and consequentlytheir voice often sounds breathy, raspy or sniffly. In this paper, weinvestigate the audible effects of a cold on a phonetic level. Resultson a German corpus show that the articulation of consonants ismore impaired than that of vowels. Surprisingly, nasal sounds donot follow this trend in our experiments. We finally try to predict aspeaker’s health condition by fusing decisions we derive from singlephonemes. The presented work is part of the INTERSPEECH 2017Computational Paralinguistics Challenge.

Phoneme State Posteriorgram Features for SpeechBased Automatic Classification of Speakers in Coldand Healthy Condition

Akshay Kalkunte Suresh 1, Srinivasa Raghavan K.M. 2,Prasanta Kumar Ghosh 2; 1PES Institute of Technology,India; 2Indian Institute of Science, IndiaThu-SS-9-10-8, Time: 11:30–11:45

We consider the problem of automatically detecting if a speaker issuffering from common cold from his/her speech. When a speakerhas symptoms of cold, his/her voice quality changes compared to thenormal one. We hypothesize that such a change in voice quality couldbe reflected in lower likelihoods from a model built using normalspeech. In order to capture this, we compute a 120-dimensionalposteriorgram feature in each frame using Gaussian mixture modelfrom 120 states of 40 three-states phonetic hidden Markov modelstrained on approximately 16.4 hours of normal English speech.Finally, a fixed 5160-dimensional phoneme state posteriorgram (PSP)feature vector for each utterance is obtained by computing statisticsfrom the posteriorgram feature trajectory. Experiments on the2017-Cold sub-challenge data show that when the decisions frombag-of-audio-words (BoAW) and end-to-end (e2e) are combined withthose from PSP features with unweighted majority rule, the UAR onthe development set becomes 69% which is 2.9% (absolute) betterthan the best of the UARs obtained by the baseline schemes. Whenthe decisions from ComParE, BoAW and PSP features are combinedwith simple majority rule, it results in a UAR of 68.52% on the testset.

An Integrated Solution for Snoring SoundClassification Using Bhattacharyya Distance BasedGMM Supervectors with SVM, Feature Selection withRandom Forest and Spectrogram with CNN

Tin Lay Nwe, Huy Dat Tran, Wen Zheng Terence Ng, BinMa; A*STAR, SingaporeThu-SS-9-10-9, Time: 11:45–12:00

Snoring is caused by the narrowing of the upper airway and itis excited by different locations within the upper airways. Thisirregularity could lead to the presence of Obstructive Sleep ApneaSyndrome (OSAS). Diagnosis of OSAS could therefore be made by

snoring sound analysis. This paper proposes the novel method toautomatically classify snoring sounds by their excitation locationsfor ComParE2017 challenge. We propose 3 sub-systems for classi-fication. In the first system, we propose to integrate Bhattacharyyadistance based Gaussian Mixture Model (GMM) supervectors to aset of static features provided by ComParE2017 challenge. TheBhattacharyya distance based GMM supervectors characterize thespectral dissimilarity measure among snore sounds excited by dif-ferent locations. And, we employ Support Vector Machine (SVM) forclassification. In the second system, we perform feature selection onstatic features provided by the challenge and conduct classificationusing Random Forest. In the third system, we extract spectrogramfrom audio and employ Convolutional Neural Network (CNN) forsnore sound classification. Then, we fuse 3 sub-systems to producefinal classification results. The experimental results show that theproposed system performs better than the challenge baseline.

Thu-SS-9-11 : Special Session: State of the Artin Physics-based Voice SimulationF11, 10:00–12:00, Thursday, 24 Aug. 2017Chairs: Sten Ternström, Oriol Guasch

Acoustic Analysis of Detailed Three-DimensionalShape of the Human Nasal Cavity and ParanasalSinuses

Tatsuya Kitamura 1, Hironori Takemoto 2, HisanoriMakinae 3, Tetsutaro Yamaguchi 4, Kotaro Maki 4;1Konan University, Japan; 2Chiba Institute ofTechnology, Japan; 3National Research Institute of PoliceScience, Japan; 4Showa University, JapanThu-SS-9-11-1, Time: 10:00–10:20

The nasal and paranasal cavities have a labyrinthine shape and theiracoustic properties affect speech sounds. In this study, we exploredthe transfer function of the nasal and paranasal cavities, as wellas the contribution of each paranasal cavity, using acoustical andnumerical methods. A physical model of the nasal and paranasalcavities was formed using data from a high-resolution 3D X-ray CTand a 3D printer. The data was acquired from a female subjectduring silent nasal breathing. The transfer function of the physicalmodel was then measured by introducing a white noise signal at theglottis and measuring its acoustic response at a point 20 mm awayfrom the nostrils. We also calculated the transfer function of the 3Dmodel using a finite-difference time-domain or FDTD method. Theresults showed that the gross shape and the frequency of peaks anddips of the measured and calculated transfer functions were similar,suggesting that both methods used in this study were reliable.The results of FDTD simulations evaluating the paranasal sinusesindividually suggested that they contribute not only to spectraldips but also to peaks, which is contrary to the traditional theoriesregarding the production of speech sounds.

A Semi-Polar Grid Strategy for the Three-DimensionalFinite Element Simulation of Vowel-Vowel Sequences

Marc Arnela 1, Saeed Dabbaghchian 2, Oriol Guasch 1,Olov Engwall 2; 1Universitat Ramon Llull, Spain; 2KTH,SwedenThu-SS-9-11-2, Time: 10:20–10:40

Three-dimensional computational acoustic models need very de-tailed 3D vocal tract geometries to generate high quality sounds.Static geometries can be obtained from Magnetic Resonance Imaging(MRI), but it is not currently possible to capture dynamic MRI-based

Notes

220

geometries with sufficient spatial and time resolution. One possiblesolution consists in interpolating between static geometries, but thisis a complex task. We instead propose herein to use a semi-polargrid to extract 2D cross-sections from the static 3D geometries, andthen interpolate them to obtain the vocal tract dynamics. Otherapproaches such as the adaptive grid have also been explored. Inthis method, cross-sections are defined perpendicular to the vocaltract midline, as typically done in 1D to obtain the vocal tract areafunctions. However, intersections between adjacent cross-sectionsmay occur during the interpolation process, especially when thevocal tract midline quickly changes its orientation. In contrast,the semi-polar grid prevents these intersections because the planeorientations are fixed over time. Finite element simulations ofstatic vowels are first conducted, showing that 3D acoustic wavepropagation is not significantly altered when the semi-polar grid isused instead of the adaptive grid. The vowel-vowel sequence [Ai] isfinally simulated to demonstrate the method.

A Fast Robust 1D Flow Model for a Self-OscillatingCoupled 2D FEM Vocal Fold Simulation

Arvind Vasudevan 1, Victor Zappi 2, Peter Anderson 1,Sidney Fels 1; 1University of British Columbia, Canada;2Istituto Italiano di Tecnologia, ItalyThu-SS-9-11-3, Time: 10:40–11:00

A balance between the simplicity and speed of lumped-element vocalfold models and the completeness and complexity of continuum-models is required to achieve fast high-quality articulatory speechsynthesis. We develop and implement a novel self-oscillating vocal-fold model, composed of a 1D unsteady fluid model loosely coupledwith a 2D FEM structural model. The flow model is capable of ro-bustly handling irregular geometries, different boundary conditions,closure of the glottis and unsteady flow states. A method for afast decoupled solution of the flow equations that does not requirethe computation of the Jacobian is provided. The model is coupledwith a 2D real-time finite-difference wave-solver for simulating vocaltract acoustics and a 1D wave-reflection analog representation ofthe trachea. The simulation results are shown to agree with existingdata in literature, and give realistic pressure-velocity distributions,glottal width and glottal flow values. In addition, the model ismore than an order of magnitude faster to run than comparable2D Navier-Stokes fluid solvers, while better capturing transitionalflow than simple Bernoulli-based flow models. The vocal fold modelprovides an alternative to simple lumped-element models for fasterhigher-quality articulatory speech synthesis.

Waveform Patterns in Pitch Glides Near a Vocal TractResonance

Tiina Murtola, Jarmo Malinen; Aalto University, FinlandThu-SS-9-11-4, Time: 11:00–11:20

A time-domain model of vowel production is used to simulatefundamental frequency glides over the first vocal tract resonance. Avocal tract geometry extracted from MRI data of a female speakerpronouncing [i] is used. The model contains direct feedback fromthe acoustic loads to vocal fold tissues and the inertial effect of thefull air column on the glottal flow. The simulations reveal that aperturbation pattern in the fundamental frequency, namely, a jumpand locking to the vocal tract resonance, is accompanied by a specificpattern of glottal waveform changes.

A Unified Numerical Simulation of Vowel ProductionThat Comprises Phonation and the Emitted Sound

Niyazi Cem Degirmenci 1, Johan Jansson 2, JohanHoffman 1, Marc Arnela 3, Patricia Sánchez-Martín 3,Oriol Guasch 3, Sten Ternström 1; 1KTH, Sweden; 2BCAM,Spain; 3Universitat Ramon Llull, SpainThu-SS-9-11-5, Time: 11:20–11:40

A unified approach for the numerical simulation of vowels is pre-sented, which accounts for the self-oscillations of the vocal foldsincluding contact, the generation of acoustic waves and their prop-agation through the vocal tract, and the sound emission outwardsthe mouth. A monolithic incompressible fluid-structure interactionmodel is used to simulate the interaction between the glottal jet andthe vocal folds, whereas the contact model is addressed by meansof a level set application of the Eikonal equation. The coupling withacoustics is done through an acoustic analogy stemming from asimplification of the acoustic perturbation equations. This couplingis one-way in the sense that there is no feedback from the acousticsto the flow and mechanical fields.

All the involved equations are solved together at each time step andin a single computational run, using the finite element method (FEM).As an application, the production of vowel [i] has been addressed.Despite the complexity of all physical phenomena to be simulatedsimultaneously, which requires resorting to massively parallel com-puting, the formant locations of vowel [i] have been well recovered.

Synthesis of VV Utterances from Muscle Activation toSound with a 3D Model

Saeed Dabbaghchian 1, Marc Arnela 2, Olov Engwall 1,Oriol Guasch 2; 1KTH, Sweden; 2Universitat Ramon Llull,SpainThu-SS-9-11-6, Time: 11:40–12:00

We propose a method to automatically generate deformable 3D vocaltract geometries from the surrounding structures in a biomechanicalmodel. This allows us to couple 3D biomechanics and acousticssimulations. The basis of the simulations is muscle activation tra-jectories in the biomechanical model, which move the articulators tothe desired articulatory positions. The muscle activation trajectoriesfor a vowel-vowel utterance are here defined through interpolationbetween the determined activations of the start and end vowel.The resulting articulatory trajectories of flesh points on the tonguesurface and jaw are similar to corresponding trajectories measuredusing Electromagnetic Articulography, hence corroborating thevalidity of interpolating muscle activation. At each time step in thearticulatory transition, a 3D vocal tract tube is created through acavity extraction method based on first slicing the geometry of thearticulators with a semi-polar grid to extract the vocal tract contourin each plane and then reconstructing the vocal tract through asmoothed 3D mesh-generation using the extracted contours. A finiteelement method applied to these changing 3D geometries simulatesthe acoustic wave propagation. We present the resulting acousticpressure changes on the vocal tract boundary and the formanttransitions for the utterance [Ai].

Notes

221

Thu-SS-10-10 : Special Session: Interspeech2017 Computational ParalinguisticsChallengE (ComParE) 2E10, 13:30–15:30, Thursday, 24 Aug. 2017Chairs: Björn Schuller, Anton Batliner

A Dual Source-Filter Model of Snore Audio for SnorerGroup Classification

Achuth Rao M.V., Shivani Yadav, Prasanta KumarGhosh; Indian Institute of Science, IndiaThu-SS-10-10-1, Time: 13:30–13:45

Snoring is a common symptom of serious chronic disease knownas obstructive sleep apnea (OSA). Knowledge about the locationof obstruction site (V—Velum, O—Oropharyngeal lateral walls, T—Tongue, E—Epiglottis) in the upper airways is necessary for propersurgical treatment. In this paper we propose a dual source-filtermodel similar to the source-filter model of speech to approximatethe generation process of snore audio. The first filter models thevocal tract from lungs to the point of obstruction with white noiseexcitation from the lungs. The second filter models the vocal tractfrom the obstruction point to the lips/nose with impulse trainexcitation which represents vibrations at the point of obstruction.The filter coefficients are estimated using the closed and open phasesof the snore beat cycle. VOTE classification is done by using SVMclassifier and filter coefficients as features. The classification exper-iments are performed on the development set (283 snore audios)of the MUNICH-PASSAU SNORE SOUND CORPUS (MPSSC).We obtainan unweighted average recall (UAR) of 49.58%, which is higher thanthe INTERSPEECH-2017 snoring sub-challenge baseline technique by∼3% (absolute).

An ‘End-to-Evolution’ Hybrid Approach for SnoreSound Classification

Michael Freitag, Shahin Amiriparian, NicholasCummins, Maurice Gerczuk, Björn Schuller; UniversitätPassau, GermanyThu-SS-10-10-2, Time: 13:45–14:00

Whilst snoring itself is usually not harmful to a person’s health, itcan be an indication of Obstructive Sleep Apnoea (OSA), a serioussleep-related disorder. As a result, studies into using snoring asacoustic based marker of OSA are gaining in popularity. Motivated bythis, the INTERSPEECH 2017 ComParE Snoring sub-challenge requiresclassification from which areas in the upper airways different snoringsounds originate. This paper explores a hybrid approach combiningevolutionary feature selection based on competitive swarm optimisa-tion and deep convolutional neural networks (CNN). Feature selectionis applied to novel deep spectrum features extracted directly fromspectrograms using pre-trained image classification CNN. Key resultspresented demonstrate that our hybrid approach can substantiallyincrease the performance of a linear support vector machine on aset of low-level features extracted from the Snoring sub-challengedata. Even without subset selection, the deep spectrum featuresare sufficient to outperform the challenge baseline, and competitiveswarm optimisation further improves system performance. Incomparison to the challenge baseline, unweighted average recall isincreased from 40.6% to 57.6% on the development partition, andfrom 58.5% to 66.5% on the test partition, using 2246 of the 4096deep spectrum features.

Snore Sound Classification Using Image-Based DeepSpectrum Features

Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl,Nicholas Cummins, Michael Freitag, SergeyPugachevskiy, Alice Baird, Björn Schuller; UniversitätPassau, GermanyThu-SS-10-10-3, Time: 14:00–14:15

In this paper, we propose a method for automatically detectingvarious types of snore sounds using image classification convo-lutional neural network (CNN) descriptors extracted from audiofile spectrograms. The descriptors, denoted as deep spectrumfeatures, are derived from forwarding spectrograms through verydeep task-independent pre-trained CNNs. Specifically, activations offully connected layers from two common image classification CNNs,AlexNet and VGG19, are used as feature vectors. Moreover, we inves-tigate the impact of differing spectrogram colour maps and two CNNarchitectures on the performance of the system. Results presentedindicate that deep spectrum features extracted from the activationsof the second fully connected layer of AlexNet using a viridis colourmap are well suited to the task. This feature space, when combinedwith a support vector classifier, outperforms the more conventionalknowledge-based features of 6 373 acoustic functionals used in theINTERSPEECH ComParE 2017 Snoring sub-challenge baseline system.In comparison to the baseline, unweighted average recall is increasedfrom 40.6% to 44.8% on the development partition, and from 58.5%to 67.0% on the test partition.

Exploring Fusion Methods and Feature Space for theClassification of Paralinguistic Information

David Tavarez, Xabier Sarasola, Agustin Alonso, JonSanchez, Luis Serrano, Eva Navas, Inma Hernáez;Universidad del País Vasco, SpainThu-SS-10-10-4, Time: 14:15–14:30

This paper introduces the different systems developed by AholabSignal Processing Laboratory for The INTERSPEECH 2017 Compu-tational Paralinguistics Challenge, which includes three differentsubtasks: Addressee, Cold and Snoring classification. Several classi-fication strategies and features related with the spectrum, prosodyand phase have been tested separately and further combined byusing different fusion techniques, such as early fusion by means ofmulti-feature vectors, late fusion of the standalone classifier scoresand label fusion via weighted voting. The obtained results showthat the applied fusion methods improve the performance of thestandalone detectors and provide systems capable of outperformingthe baseline systems in terms of UAR.

DNN-Based Feature Extraction and ClassifierCombination for Child-Directed Speech, Cold andSnoring Identification

Gábor Gosztolya 1, Róbert Busa-Fekete 2, Tamás Grósz 1,László Tóth 3; 1University of Szeged, Hungary; 2Yahoo!,USA; 3MTA-SZTE RGAI, HungaryThu-SS-10-10-5, Time: 14:30–14:45

In this study we deal with the three sub-challenges of the InterspeechComParE Challenge 2017, where the goal is to identify child-directedspeech, speakers having a cold, and different types of snoring sounds.For the first two sub-challenges we propose a simple, two-step featureextraction and classification scheme: first we perform frame-levelclassification via Deep Neural Networks (DNNs), and then we extractutterance-level features from the DNN outputs. By utilizing thesefeatures for classification, we were able to match the performance

Notes

222

of the standard paralinguistic approach (which involves extractingthousands of features, many of them being completely irrelevantto the actual task). As for the Snoring Sub-Challenge, we dividedthe recordings into segments, and averaged out some frame-levelfeatures segment-wise, which were then used for utterance-levelclassification. When combining the predictions of the proposedapproaches with those got by the standard paralinguistic approach,we managed to outperform the baseline values of the Cold andSnoring sub-challenges on the hidden test sets.

Introducing Weighted Kernel Classifiers for HandlingImbalanced Paralinguistic Corpora: Snoring,Addressee and Cold

Heysem Kaya 1, Alexey A. Karpov 2; 1Namık KemalÜniversitesi, Turkey; 2Russian Academy of Sciences,RussiaThu-SS-10-10-6, Time: 14:45–15:00

The field of paralinguistics is growing rapidly with a wide range ofapplications that go beyond recognition of emotions, laughter andpersonality. The research flourishes in multiple directions such assignal representation and classification, addressing the issues of thedomain. Apart from the noise robustness, an important issue withreal life data is the imbalanced nature: some classes of states/traitsare under-represented. Combined with the high dimensionality ofthe feature vectors used in the state-of-the-art analysis systems,this issue poses the threat of over-fitting. While the kernel trick canbe employed to handle the dimensionality issue, regular classifiersinherently aim to minimize the misclassification error and henceare biased towards the majority class. A solution to this problemis over-sampling of the minority class(es). However, this bringsincreased memory/computational costs, while not bringing anynew information to the classifier. In this work, we propose a newweighting scheme on instances of the original dataset, employingWeighted Kernel Extreme Learning Machine, and inspired fromthat, introducing the Weighted Partial Least Squares Regressionbased classifier. The proposed methods are applied on all threeINTERSPEECH ComParE 2017 challenge corpora, giving better orcompetitive results compared to the challenge baselines.

The INTERSPEECH 2017 ComputationalParalinguistics Challenge: A Summary of Results

Stefan Steidl; FAU Erlangen-Nürnberg, GermanyThu-SS-10-10-7, Time: 15:00–15:15

(No abstract available at the time of publication)

Discussion

Björn Schuller 1, Anton Batliner 2; 1Imperial CollegeLondon, UK; 2Universität Passau, GermanyThu-SS-10-10-8, Time: 15:15–15:30

(No abstract available at the time of publication)

Thu-O-9-1 : Discriminative Training for ASRAula Magna, 10:00–12:00, Thursday, 24 Aug. 2017Chairs: Hagen Soltau, William Hartmann

Multitask Learning with Low-Level Auxiliary Tasksfor Encoder-Decoder Based Speech Recognition

Shubham Toshniwal, Hao Tang, Liang Lu, KarenLivescu; TTIC, USAThu-O-9-1-1, Time: 10:00–10:20

End-to-end training of deep learning-based models allows for im-plicit learning of intermediate representations based on the finaltask loss. However, the end-to-end approach ignores the usefuldomain knowledge encoded in explicit intermediate-level supervi-sion. We hypothesize that using intermediate representations asauxiliary supervision at lower levels of deep networks may be agood way of combining the advantages of end-to-end training andmore traditional pipeline approaches. We present experiments onconversational speech recognition where we use lower-level tasks,such as phoneme recognition, in a multitask training approach withan encoder-decoder model for direct character transcription. Wecompare multiple types of lower-level tasks and analyze the effectsof the auxiliary tasks. Our results on the Switchboard corpus showthat this approach improves recognition accuracy over a standardencoder-decoder model on the Eval2000 test set.

Optimizing Expected Word Error Rate via Samplingfor Speech Recognition

Matt Shannon; Google, USAThu-O-9-1-2, Time: 10:20–10:40

State-level minimum Bayes risk (sMBR) training has become the defacto standard for sequence-level training of speech recognitionacoustic models. It has an elegant formulation using the expectationsemiring, and gives large improvements in word error rate (WER)over models trained solely using cross-entropy (CE) or connectionisttemporal classification (CTC). sMBR training optimizes the expectednumber of frames at which the reference and hypothesized acousticstates differ. It may be preferable to optimize the expected WER,but WER does not interact well with the expectation semiring, andprevious approaches based on computing expected WER exactlyinvolve expanding the lattices used during training. In this paper weshow how to perform optimization of the expected WER by samplingpaths from the lattices used during conventional sMBR training. Thegradient of the expected WER is itself an expectation, and so may beapproximated using Monte Carlo sampling. We show experimentallythat optimizing WER during acoustic model training gives 5% relativeimprovement in WER over a well-tuned sMBR baseline on a 2-channelquery recognition task (Google Home).

Annealed f-Smoothing as a Mechanism to Speed upNeural Network Training

Tara N. Sainath, Vijayaditya Peddinti, Olivier Siohan,Arun Narayanan; Google, USAThu-O-9-1-3, Time: 10:40–11:00

In this paper, we describe a method to reduce the overall numberof neural network training steps, during both cross-entropy and se-quence training stages. This is achieved through the interpolation offrame-level CE and sequence level SMBR criteria, during the sequencetraining stage. This interpolation is known as f-smoothing and haspreviously been just used to prevent overfitting during sequencetraining. However, in this paper, we investigate its application to re-duce the training time. We explore different interpolation strategies

Notes

223

to reduce the overall training steps; and achieve a reduction of up to25% with almost no degradation in word error rate (WER). Finally, weexplore the generalization of f-smoothing to other tasks.

Non-Uniform MCE Training of Deep Long Short-TermMemory Recurrent Neural Networks for KeywordSpotting

Zhong Meng, Biing-Hwang Juang; Georgia Institute ofTechnology, USAThu-O-9-1-4, Time: 11:00–11:20

It has been shown in [1, 2] that improved performance can beachieved by formulating the keyword spotting as a non-uniformerror automatic speech recognition problem. In this work, wediscriminatively train a deep bidirectional long short-term memory(BLSTM) — hidden Markov model (HMM) based acoustic model withnon-uniform boosted minimum classification error (BMCE) criterionwhich imposes more significant error cost on the keywords thanthose on the non-keywords. By introducing the BLSTM, the contextinformation in both the past and the future are stored and updatedto predict the desired output and the long-term dependencies withinthe speech signal are well captured. With non-uniform BMCE objec-tive, the BLSTM is trained so that the recognition errors related to thekeywords are remarkably reduced. The BLSTM is optimized usingback-propagation through time and stochastic gradient descent. Thekeyword spotting system is implemented within weighted finite statetransducer framework. The proposed method achieves 5.49% and7.37% absolute figure-of-merit improvements respectively over theBLSTM and the feedforward deep neural network baseline systemstrained with cross-entropy criterion for the keyword spotting taskon Switchboard-1 Release 2 dataset.

Exploiting Eigenposteriors for Semi-SupervisedTraining of DNN Acoustic Models with SequenceDiscrimination

Pranay Dighe, Afsaneh Asaei, Hervé Bourlard; IdiapResearch Institute, SwitzerlandThu-O-9-1-5, Time: 11:20–11:40

Deep neural network (DNN) acoustic models yield posterior proba-bilities of senone classes. Recent studies support the existence oflow-dimensional subspaces underlying senone posteriors. Principalcomponent analysis (PCA) is applied to identify eigenposteriors andperform low-dimensional projection of the training data posteriors.The resulted enhanced posteriors are applied as soft targets fortraining better DNN acoustic model under the student-teacherframework. The present work advances this approach by studyingincorporation of sequence discriminative training. We demonstratehow to combine the gains from eigenposterior based enhancementwith sequence discrimination to improve ASR using semi-supervisedtraining. Evaluation on AMI meeting corpus yields nearly 4% absolutereduction in word error rate (WER) compared to the baseline DNNtrained with cross entropy objective. In this context, eigenposteriorenhancement of the soft targets is crucial to enable additive im-provement using out-of-domain untranscribed data.

Discriminative Autoencoders for Acoustic Modeling

Ming-Han Yang 1, Hung-Shin Lee 1, Yu-Ding Lu 1,Kuan-Yu Chen 1, Yu Tsao 1, Berlin Chen 2, Hsin-MinWang 1; 1Academia Sinica, Taiwan; 2National TaiwanNormal University, TaiwanThu-O-9-1-6, Time: 11:40–12:00

Speech data typically contain information irrelevant to automaticspeech recognition (ASR), such as speaker variability and chan-

nel/environmental noise, lurking deep within acoustic features.Such unwanted information is always mixed together to stunt thedevelopment of an ASR system. In this paper, we propose a newframework based on autoencoders for acoustic modeling in ASR. Un-like other variants of autoencoder neural networks, our frameworkis able to isolate phonetic components from a speech utterance bysimultaneously taking two kinds of objectives into consideration.The first one relates to the minimization of reconstruction errorsand benefits to learn most salient and useful properties of the data.The second one functions in the middlemost code layer, where thecategorical distribution of the context-dependent phone states isestimated for phoneme discrimination and the derivation of acousticscores, the proximity relationship among utterances spoken by thesame speaker are preserved, and the intra-utterance noise is mod-eled and abstracted away. We describe the implementation of thediscriminative autoencoders for training tri-phone acoustic modelsand present TIMIT phone recognition results, which demonstratethat our proposed method outperforms the conventional DNN-basedapproach.

Thu-O-9-2 : Speaker DiarizationA2, 10:00–12:00, Thursday, 24 Aug. 2017Chairs: Eduardo Lleida, Kai Yu

Speaker Diarization Using Convolutional NeuralNetwork for Statistics Accumulation Refinement

Zbynek Zajíc, Marek Hrúz, Ludek Müller; University ofWest Bohemia, Czech RepublicThu-O-9-2-1, Time: 10:00–10:20

The aim of this paper is to investigate the benefit of informationfrom a speaker change detection system based on ConvolutionalNeural Network (CNN) when applied to the process of accumulationof statistics for an i-vector generation. The investigation is carriedout on the problem of diarization. In our system, the output of theCNN is a probability value of a speaker change in a conversationfor a given time segment. According to this probability, we cut theconversation into short segments that are then represented by thei-vector (to describe a speaker in it). We propose a technique to utilizethe information from the CNN for the weighting of the acoustic datain a segment to refine the statistics accumulation process. Thistechnique enables us to represent the speaker better in the finali-vector. The experiments on the English part of the CallHome corpusshow that our proposed refinement of the statistics accumulation isbeneficial with the relative improvement of Diarization Error Ratealmost by 16% when compared to the speaker diarization systemwithout statistics refinement.

Speaker2Vec: Unsupervised Learning and Adaptationof a Speaker Manifold Using Deep Neural Networkswith an Evaluation on Speaker Segmentation

Arindam Jati, Panayiotis Georgiou; University ofSouthern California, USAThu-O-9-2-2, Time: 10:20–10:40

This paper presents a novel approach, we term Speaker2Vec, toderive a speaker-characteristics manifold learned in an unsupervisedmanner. The proposed representation can be employed in differentapplications such as diarization, speaker identification or, as in ourevaluation test case, speaker segmentation. Speaker2Vec exploitslarge amounts of unlabeled training data and the assumption ofshort-term active-speaker stationarity to derive a speaker embeddingusing Deep Neural Networks (DNN). We assume that temporally-nearspeech segments belong to the same speaker, and as such a joint

Notes

224

representation connecting these nearby segments can encode theircommon information. Thus, this bottleneck representation will becapturing mainly speaker-specific information. Such training cantake place in a completely unsupervised manner. For testing, ourtrained model generates the embeddings for the test audio, andapplies a simple distance metric to detect speaker-change points.The paper also proposes a strategy for unsupervised adaptation ofthe DNN models to the application domain. The proposed methodoutperforms the state-of-the-art speaker segmentation algorithmsand MFCC based baseline methods on four evaluation datasets, whileit allows for further improvements by employing this embeddinginto supervised training methods.

A Triplet Ranking-Based Neural Network for SpeakerDiarization and Linking

Gaël Le Lan 1, Delphine Charlet 1, Anthony Larcher 2,Sylvain Meignier 2; 1Orange Labs, France; 2LIUM (EA4023), FranceThu-O-9-2-3, Time: 10:40–11:00

This paper investigates a novel neural scoring method, based onconventional i-vectors, to perform speaker diarization and linkingof large collections of recordings. Using triplet loss for training, thenetwork projects i-vectors in a space that better separates speakersin terms of cosine similarity. Experiments are run on two French TVcollections built from REPERE [1] and ETAPE [2] campaigns corpora,the system being trained on French Radio data. Results indicatethat the proposed approach outperforms conventional cosine andProbabilistic Linear Discriminant Analysis scoring methods on bothwithin- and cross-recording diarization tasks, with a DiarizationError Rate reduction of 14% in average.

Estimating Speaker Clustering Quality Using LogisticRegression

Yishai Cohen, Itshak Lapidot; Afeka Tel Aviv AcademicCollege of Engineering, IsraelThu-O-9-2-4, Time: 11:00–11:20

This paper focuses on estimating clustering validity by using logisticregression. For many applications it might be important to estimatethe quality of the clustering, e.g. in case of speech segments’ cluster-ing, make a decision whether to use the clustered data for speakerverification. In the case of short segments speakers clustering, thecommon criteria for cluster validity are average cluster purity (ACP),average speaker purity (ASP) and K — the geometric mean betweenthe two measures. As in practice, true labels are not available forevaluation, hence they have to be estimated from the clusteringitself. In this paper, mean-shift clustering with PLDA score is appliedin order to cluster short speaker segments represented as i-vectors.Different statistical parameters are then estimated on the clustereddata and are used to train logistic regression to estimate ACP, ASPand K. It was found that logistic regression can be a good predictorof the actual ACP, ASP and K, and yields reasonable informationregarding the clustering quality.

Combining Speaker Turn Embedding and IncrementalStructure Prediction for Low-Latency SpeakerDiarization

Guillaume Wisniewksi, Hervé Bredin, G. Gelly, ClaudeBarras; LIMSI, FranceThu-O-9-2-5, Time: 11:20–11:40

Real-time speaker diarization has many potential applications,including public security, biometrics or forensics. It can also sig-nificantly speed up the indexing of increasingly large multimedia

archives. In this paper, we address the issue of low-latency speakerdiarization that consists in continuously detecting new or reoccur-ring speakers within an audio stream, and determining when eachspeaker is active with a low latency (e.g. every second). This is incontrast with most existing approaches in speaker diarization thatrely on multiple passes over the complete audio recording. Theproposed approach combines speaker turn neural embeddings withan incremental structure prediction approach inspired by state-of-the-art Natural Language Processing models for Part-of-Speechtagging and dependency parsing. It can therefore leverage bothinformation describing the utterance and the inherent temporalstructure of interactions between speakers to learn, in supervisedframework, to identify speakers. Experiments on the Etape broadcastnews benchmark validate the approach.

pyannote.metrics: A Toolkit for ReproducibleEvaluation, Diagnostic, and Error Analysis of SpeakerDiarization Systems

Hervé Bredin; LIMSI, FranceThu-O-9-2-6, Time: 11:40–12:00

pyannote.metrics is an open-source Python library aimed atresearchers working in the wide area of speaker diarization. Itprovides a command line interface (CLI) to improve reproducibilityand comparison of speaker diarization research results. Through itsapplication programming interface (API), a large set of evaluationmetrics is available for diagnostic purposes of all modules of typicalspeaker diarization pipelines (speech activity detection, speakerchange detection, clustering, and identification). Finally, thanksto visualization capabilities, we show that it can also be used fordetailed error analysis purposes. pyannote.metrics can be down-loaded from http://pyannote.github.io.

Thu-O-9-4 : Spoken Term DetectionB4, 10:00–12:00, Thursday, 24 Aug. 2017Chairs: Sanjeev Khudanpur, Murat Saraclar

A Rescoring Approach for Keyword Search UsingLattice Context Information

Zhipeng Chen, Ji Wu; Tsinghua University, ChinaThu-O-9-4-1, Time: 10:00–10:20

In this paper we present a rescoring approach for keyword search(KWS) based on neural networks (NN). This approach exploits only thelattice context in a detected time interval instead of its correspondingaudio. The most informative arcs in lattice context are selected andrepresented as a matrix, where words on arcs are represented in anembedding space with respect to their pronunciations. Then convo-lutional neural networks (CNNs) are employed to capture distinctivefeatures from this matrix. A rescoring model is trained to minimizeterm-weighted sigmoid cross entropy so as to match the evaluationmetric. Experiments on single-word queries show that lattice contextbrings complementary gains over normalized posterior scores.Performance on both in-vocabulary (IV) and out-of-vocabulary (OOV)queries are improved by combining NN-based scores with standardposterior scores.

Notes

225

The Kaldi OpenKWS System: Improving LowResource Keyword Search

Jan Trmal, Matthew Wiesner, Vijayaditya Peddinti,Xiaohui Zhang, Pegah Ghahremani, Yiming Wang,Vimal Manohar, Hainan Xu, Daniel Povey, SanjeevKhudanpur; Johns Hopkins University, USAThu-O-9-4-2, Time: 10:20–10:40

The IARPA BABEL program has stimulated worldwide research inkeyword search technology for low resource languages, and theNIST OpenKWS evaluations are the de facto benchmark test forsuch capabilities. The 2016 OpenKWS evaluation featured Georgianspeech, and had 10 participants from across the world. This paperdescribes the Kaldi system developed to assist IARPA in creating acompetitive baseline against which participants were evaluated, andto provide a truly open source system to all participants to supporttheir research. This system handily met the BABEL program goalsof 0.60 ATWV and 50% WER, achieving 0.70 ATWV and 38% WERwith a single ASR system, i.e. without ASR system combination. Allexcept one OpenKWS participant used Kaldi components in theirsubmissions, typically in conjunction with system combination. Thispaper therefore complements all other OpenKWS-based papers.

The STC Keyword Search System for OpenKWS 2016Evaluation

Yuri Khokhlov 1, Ivan Medennikov 1, AlekseiRomanenko 1, Valentin Mendelev 1, Maxim Korenevsky 1,Alexey Prudnikov 2, Natalia Tomashenko 3, AlexanderZatvornitsky 1; 1STC-innovations, Russia; 2Mail.RuGroup, Russia; 3LIUM (EA 4023), FranceThu-O-9-4-3, Time: 10:40–11:00

This paper describes the keyword search system developed by theSTC team in the framework of OpenKWS 2016 evaluation. Theacoustic modeling techniques included i-vectors based speakeradaptation, multilingual speaker-dependent bottleneck features, anda combination of feedforward and recurrent neural networks. Toimprove the language model, we augmented the training data pro-vided by the organizers with texts generated by the character-levelrecurrent neural networks trained on different data sets. This ledto substantial reductions in the out-of-vocabulary (OOV) and worderror rates. The OOV search problem was solved with the help ofa novel approach based on lattice generated phone posteriors anda highly optimized decoder. This approach outperformed familiarOOV search implementations in terms of speed and demonstratedcomparable or better search quality.

The system was among the top three systems in the evaluation.

Compressed Time Delay Neural Network forSmall-Footprint Keyword Spotting

Ming Sun 1, David Snyder 2, Yixin Gao 1, VarunNagaraja 1, Mike Rodehorst 1, SankaranPanchapagesan 1, Nikko Strom 1, Spyros Matsoukas 1,Shiv Vitaladevuni 1; 1Amazon.com, USA; 2Johns HopkinsUniversity, USAThu-O-9-4-4, Time: 11:00–11:20

In this paper we investigate a time delay neural network (TDNN) fora keyword spotting task that requires low CPU, memory and latency.The TDNN is trained with transfer learning and multi-task learning.Temporal subsampling enabled by the time delay architecture re-duces computational complexity. We propose to apply singular valuedecomposition (SVD) to further reduce TDNN complexity. This allows

us to first train a larger full-rank TDNN model which is not limited byCPU/memory constraints. The larger TDNN usually achieves betterperformance. Afterwards, its size can be compressed by SVD to meetthe budget requirements. Hidden Markov models (HMM) are used inconjunction with the networks to perform keyword detection andperformance is measured in terms of area under the curve (AUC) fordetection error tradeoff (DET) curves. Our experimental results on alarge in-house far-field corpus show that the full-rank TDNN achievesa 19.7% DET AUC reduction compared to a similar-size deep neuralnetwork (DNN) baseline. If we train a larger size full-rank TDNN firstand then reduce it via SVD to the comparable size of the DNN, weobtain a 37.6% reduction in DET AUC compared to the DNN baseline.

Symbol Sequence Search from TelephoneConversation

Masayuki Suzuki 1, Gakuto Kurata 1, Abhinav Sethy 2,Bhuvana Ramabhadran 2, Kenneth W. Church 2, MarkDrake 2; 1IBM, Japan; 2IBM, USAThu-O-9-4-5, Time: 11:20–11:40

We propose a method for searching for symbol sequences in con-versations. Symbol sequences can include phone numbers, creditcard numbers, and any kind of ticket (identification) numbers andare often communicated in call center conversations. Automaticextraction of these from speech is a key to many automatic speechrecognition (ASR) applications such as question answering and sum-marization. Compared with spoken term detection (STD), symbolsequence searches have two additional problems. First, the entiresymbol sequence is typically not observed continuously but in subsequences, where customers or agents speak these sequences infragments, while the recipient repeats them to ensure they have thecorrect sequence. Second, we have to distinguish between differ-ent symbol sequences, for example, phone numbers versus ticketnumbers or customer identification numbers. To deal with theseproblems, we propose to apply STD to symbol-sequence fragmentsand subsequently use confidence scoring to obtain the entire symbolsequence. For the confidence scoring, We propose a long short-termmemory (LSTM) based approach that inputs word before and afterfragments. We also propose to detect repetitions of fragments anduse it for confidence scoring. Our proposed method achieves a 0.87F-measure, in an eight-digit customer identification number searchtask, when operating at 20.3% WER.

Similarity Learning Based Query Modeling forKeyword Search

Batuhan Gundogdu, Murat Saraclar; BogaziçiÜniversitesi, TurkeyThu-O-9-4-6, Time: 11:40–12:00

In this paper, we propose a novel approach for query modelingusing neural networks for posteriorgram based keyword search(KWS). We aim to help the conventional large vocabulary continuousspeech recognition (LVCSR) based KWS systems, especially on out-of-vocabulary (OOV) terms by converting the task into a templatematching problem, just like the query-by-example retrieval tasks. Forthis, we use a dynamic time warping (DTW) based similarity searchon the speaker independent posteriorgram space. In order to modelthe text queries as posteriorgrams, we propose a non-symmetricSiamese neural network structure which both learns a distancemeasure to be used in DTW and the frame representations for thisspecific measure. We compare this new technique with similar DTWbased systems using other distance measures and query modelingtechniques. We also apply system fusion of the proposed systemwith the LVCSR based baseline KWS system. We show that, the pro-posed system works significantly better than other similar systems.Furthermore, when combined with the LVSCR based baseline, the

Notes

226

proposed system provides up to 37.9% improvement on OOV termsand 9.8% improvement on all terms.

Thu-O-9-6 : Noise ReductionC6, 10:00–12:00, Thursday, 24 Aug. 2017Chairs: Yan Huang, Tim Fingscheidt

Deep Recurrent Neural Network Based MonauralSpeech Separation Using Recurrent TemporalRestricted Boltzmann Machines

Suman Samui, Indrajit Chakrabarti, Soumya K. Ghosh;IIT Kharagpur, IndiaThu-O-9-6-1, Time: 10:00–10:20

This paper presents a single-channel speech separation methodimplemented with a deep recurrent neural network (DRNN) using re-current temporal restricted Boltzmann machines (RTRBM). Althoughdeep neural network (DNN) based speech separation (denoising task)methods perform quite well compared to the conventional statis-tical model based speech enhancement techniques, in DNN-basedmethods, the temporal correlations across speech frames are oftenignored, resulting in loss of spectral detail in the reconstructed out-put speech. In order to alleviate this issue, one RTRBM is employedfor modelling the acoustic features of input (mixture) signal and twoRTRBMs are trained for the two training targets (source signals). EachRTRBM attempts to model the abstractions present in the trainingdata at each time step as well as the temporal dependencies in thetraining data. The entire network (consisting of three RTRBMs andone recurrent neural network) can be fine-tuned by the joint opti-mization of the DRNN with an extra masking layer which enforces areconstruction constraint. The proposed method has been evaluatedon the IEEE corpus and TIMIT dataset for speech denoising task.Experimental results have established that the proposed approachoutperforms NMF and conventional DNN and DRNN-based speechenhancement methods.

Improved Codebook-Based Speech EnhancementBased on MBE Model

Qizheng Huang, Changchun Bao, Xianyun Wang; BeijingUniversity of Technology, ChinaThu-O-9-6-2, Time: 10:20–10:40

This paper provides an improved codebook-based speech enhance-ment method using multi-band excitation (MBE) model. It aimsto remove the noise between the harmonics, which may exist incodebook-based enhanced speech. In general, the proposed systemis based on analysis-with-synthesis (AwS) framework. During theanalysis stage, acoustic features are extracted including pitch, har-monic magnitude and voicing from noisy speech. These parametersare obtained on the basis of the spectral magnitudes obtained bycodebook-based method. During the synthesis stage, differentsynthesis strategies for voiced and unvoiced speech are employed.Besides, this paper introduces speech presence probability to modifythe codebook-based Wiener filter so that more accurate acoustic pa-rameters can be obtained. The proposed system can eliminate noisenot only between the harmonics, but also in the silent segments,especially in low SNR noise environment. Experiments show that,the performance of the proposed method is better than traditionalcodebook-based method for different types of noise.

Improving Mask Learning Based Speech EnhancementSystem with Restoration Layers and ResidualConnection

Zhuo Chen, Yan Huang, Jinyu Li, Yifan Gong; Microsoft,USAThu-O-9-6-3, Time: 10:40–11:00

For single-channel speech enhancement, mask learning based ap-proach through neural network has been shown to outperform thefeature mapping approach, and to be effective as a pre-processorfor automatic speech recognition. However, its assumption thatthe mixture and clean reference must have the correspondent scaledoesn’t hold in data collected from real world, and thus leads tosignificant performance degradation on parallel recorded data. Inthis paper, we first extend the mask learning based speech en-hancement by integrating two types of restoration layer to addressthe scale mismatch problem. We further propose a novel residuallearning based speech enhancement model via adding differentshortcut connections to a feature mapping network. We show sucha structure can benefit from both the mask learning and the featuremapping. We evaluate the proposed speech enhancement modelson CHiME 3 data. Without retraining the acoustic model, the bestbi-direction LSTM with residue connections yields 24.90% relativeWER reduction on real data and 34.57% WER on simulated data.

Exploring Low-Dimensional Structures of ModulationSpectra for Robust Speech Recognition

Bi-Cheng Yan 1, Chin-Hong Shih 1, Shih-Hung Liu 2,Berlin Chen 1; 1National Taiwan Normal University,Taiwan; 2Delta Research Center, TaiwanThu-O-9-6-4, Time: 11:00–11:20

Developments of noise robustness techniques are vital to the successof automatic speech recognition (ASR) systems in face of varyingsources of environmental interference. Recent studies have shownthat exploring low-dimensional structures of speech features canyield good robustness. Along this vein, research on low-rank repre-sentation (LRR), which considers the intrinsic structures of speechfeatures lying on some low dimensional subspaces, has gained con-siderable interest from the ASR community. When speech featuresare contaminated with various types of environmental noise, its cor-responding modulation spectra can be regarded as superpositions ofunstructured sparse noise over the inherent linguistic information.As such, we in this paper endeavor to explore the low dimensionalstructures of modulation spectra, in the hope to obtain more noise-robust speech features. The main contribution is that we propose anovel use of the LRR-based method to discover the subspace struc-tures of modulation spectra, thereby alleviating the negative effectsof noise interference. Furthermore, we also extensively compare ourapproach with several well-practiced feature-based normalizationmethods. All experiments were conducted and verified on theAurora-4 database and task. The empirical results show that theproposed LRR-based method can provide significant word errorreductions for a typical DNN-HMM hybrid ASR system.

SEGAN: Speech Enhancement Generative AdversarialNetwork

Santiago Pascual 1, Antonio Bonafonte 1, Joan Serrà 2;1Universitat Politècnica de Catalunya, Spain; 2TelefónicaI+D, SpainThu-O-9-6-5, Time: 11:20–11:40

Current speech enhancement techniques operate on the spectraldomain and/or exploit some higher-level feature. The majorityof them tackle a limited number of noise conditions and rely on

Notes

227

first-order statistics. To circumvent these issues, deep networksare being increasingly used, thanks to their ability to learn complexfunctions from large example sets. In this work, we propose theuse of generative adversarial networks for speech enhancement. Incontrast to current techniques, we operate at the waveform level,training the model end-to-end, and incorporate 28 speakers and 40different noise conditions into the same model, such that modelparameters are shared across them. We evaluate the proposed modelusing an independent, unseen test set with two speakers and 20alternative noise conditions. The enhanced samples confirm theviability of the proposed model, and both objective and subjectiveevaluations confirm the effectiveness of it. With that, we open theexploration of generative architectures for speech enhancement,which may progressively incorporate further speech-centric designchoices to improve their performance.

Concatenative Resynthesis Using Twin Networks

Soumi Maiti, Michael I. Mandel; CUNY Graduate Center,USAThu-O-9-6-6, Time: 11:40–12:00

Traditional noise reduction systems modify a noisy signal to makeit more like the original clean signal. For speech, these methodssuffer from two main problems: under-suppression of noise andover-suppression of target speech. Instead, synthesizing cleanspeech based on the noisy signal could produce outputs that areboth noise-free and high quality. Our previous work introduced sucha system using concatenative synthesis, but it required processingthe clean speech at run time, which was slow and not scalable. Inorder to make such a system scalable, we propose here learning asimilarity metric using two separate networks, one network pro-cessing the clean segments offline and another processing the noisysegments at run time. This system incorporates a ranking loss tooptimize for the retrieval of appropriate clean speech segments.This model is compared against our original on the CHiME2-GRIDcorpus, measuring ranking performance and subjective listeningtests of resyntheses.

Thu-O-9-8 : Speech Recognition: MultimodalSystemsD8, 10:00–12:00, Thursday, 24 Aug. 2017Chairs: Patrick Wambacq, Florian Metze

Combining Residual Networks with LSTMs forLipreading

Themos Stafylakis, Georgios Tzimiropoulos; University ofNottingham, UKThu-O-9-8-1, Time: 10:00–10:20

We propose an end-to-end deep learning architecture for word-levelvisual speech recognition. The system is a combination of spatiotem-poral convolutional, residual and bidirectional Long Short-TermMemory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-wordsconsisting of 1.28sec video excerpts from BBC TV broadcasts. Theproposed network attains word accuracy equal to 83.0%, yielding6.8% absolute improvement over the current state-of-the-art, withoutusing information about word boundaries during training or testing.

Improving Computer Lipreading via DNN SequenceDiscriminative Training Techniques

Kwanchiva Thangthai, Richard Harvey; University ofEast Anglia, UKThu-O-9-8-2, Time: 10:20–10:40

Although there have been some promising results in computerlipreading, there has been a paucity of data on which to trainautomatic systems. However the recent emergence of the TCD-TIMITcorpus, with around 6000 words, 59 speakers and seven hours ofrecorded audio-visual speech, allows the deployment of more recenttechniques in audio-speech such as Deep Neural Networks (DNNs)and sequence discriminative training.

In this paper we combine the DNN with a Hidden Markov Model(HMM) to the, so called, hybrid DNN-HMM configuration which wetrain using a variety of sequence discriminative training methods.This is then followed with a weighted finite state transducer. Theconclusion is that the DNN offers very substantial improvement overa conventional classifier which uses a Gaussian Mixture Model (GMM)to model the densities even when optimised with Speaker AdaptiveTraining. Sequence adaptive training offers further improvementsdepending on the precise variety employed but those improvementsare of the order of 10% improvement in word accuracy. Puttingthese two results together implies that lipreading is moving fromsomething of rather esoteric interest to becoming a practical realityin the foreseeable future.

Improving Speaker-Independent Lipreading withDomain-Adversarial Training

Michael Wand, Jürgen Schmidhuber; IDSIA, SwitzerlandThu-O-9-8-3, Time: 10:40–11:00

We present a Lipreading system, i.e. a speech recognition systemusing only visual features, which uses domain-adversarial trainingfor speaker independence. Domain-adversarial training is integratedinto the optimization of a lipreader based on a stack of feedforwardand LSTM (Long Short-Term Memory) recurrent neural networks,yielding an end-to-end trainable system which only requires a verysmall number of frames of untranscribed target data to substantiallyimprove the recognition accuracy on the target speaker. On pairs ofdifferent source and target speakers, we achieve a relative accuracyimprovement of around 40% with only 15 to 20 seconds of untran-scribed target speech data. On multi-speaker training setups, theaccuracy improvements are smaller but still substantial.

Turbo Decoders for Audio-Visual Continuous SpeechRecognition

Ahmed Hussen Abdelaziz; ICSI, USAThu-O-9-8-4, Time: 11:00–11:20

Visual speech, i.e., video recordings of speakers’ mouths, plays animportant role in improving the robustness properties of automaticspeech recognition (ASR) against noise. Optimal fusion of audio andvideo modalities is still one of the major challenges that attractssignificant interest in the realm of audio-visual ASR. Recently, turbodecoders (TDs) have been successful in addressing the audio-visualfusion problem. The idea of the TD framework is to iterativelyexchange some kind of soft information between the audio andvideo decoders until convergence. The forward-backward algorithm(FBA) is mostly applied to the decoding graphs to estimate thissoft information. Applying the FBA to the complex graphs thatare usually used in large vocabulary tasks may be computationallyexpensive. In this paper, I propose to apply the forward-backwardalgorithm to a lattice of most likely state sequences instead ofusing the entire decoding graph. Using lattices allows for TD to beeasily applied to large vocabulary tasks. The proposed approach

Notes

228

is evaluated using the newly released TCD-TIMIT corpus, where astandard recipe for large vocabulary ASR is employed. The modifiedTD performs significantly better than the feature and decision fusionmodels in all clean and noisy test conditions.

DNN-Based Ultrasound-to-Speech Conversion for aSilent Speech Interface

Tamás Gábor Csapó 1, Tamás Grósz 2, GáborGosztolya 2, László Tóth 3, Alexandra Markó 4; 1BME,Hungary; 2University of Szeged, Hungary; 3MTA-SZTERGAI, Hungary; 4MTA-ELTE LingArt, HungaryThu-O-9-8-5, Time: 11:20–11:40

In this paper we present our initial results in articulatory-to-acousticconversion based on tongue movement recordings using DeepNeural Networks (DNNs). Despite the fact that deep learning hasrevolutionized several fields, so far only a few researchers haveapplied DNNs for this task. Here, we compare various possiblefeature representation approaches combined with DNN-based re-gression. As the input, we recorded synchronized 2D ultrasoundimages and speech signals. The task of the DNN was to estimateMel-Generalized Cepstrum-based Line Spectral Pair (MGC-LSP) coeffi-cients, which then served as input to a standard pulse-noise vocoderfor speech synthesis. As the raw ultrasound images have a relativelyhigh resolution, we experimented with various feature selectionand transformation approaches to reduce the size of the featurevectors. The synthetic speech signals resulting from the variousDNN configurations were evaluated both using objective measuresand a subjective listening test. We found that the representationthat used several neighboring image frames in combination with afeature selection method was preferred both by the subjects takingpart in the listening experiments, and in terms of the NormalizedMean Squared Error. Our results may be useful for creating SilentSpeech Interface applications in the future.

Visually Grounded Learning of Keyword Predictionfrom Untranscribed Speech

Herman Kamper, Shane Settle, Gregory Shakhnarovich,Karen Livescu; TTIC, USAThu-O-9-8-6, Time: 11:40–12:00

During language acquisition, infants have the benefit of visual cuesto ground spoken language. Robots similarly have access to audioand visual sensors. Recent work has shown that images and spokencaptions can be mapped into a meaningful common space, allowingimages to be retrieved using speech and vice versa. In this settingof images paired with untranscribed spoken captions, we considerwhether computer vision systems can be used to obtain textual labelsfor the speech. Concretely, we use an image-to-words multi-labelvisual classifier to tag images with soft textual labels, and then traina neural network to map from the speech to these soft targets. Weshow that the resulting speech system is able to predict which wordsoccur in an utterance — acting as a spoken bag-of-words classifier —without seeing any parallel speech and text. We find that the modeloften confuses semantically related words, e.g. “man” and “person”,making it even more effective as a semantic keyword spotter.

Thu-O-10-1 : Neural Network AcousticModels for ASR 3Aula Magna, 13:30–15:30, Thursday, 24 Aug. 2017Chairs: Bhuvana Ramabhadran, Rohit Prabhavalkar

Deep Neural Factorization for Speech Recognition

Jen-Tzung Chien, Chen Shen; National Chiao TungUniversity, TaiwanThu-O-10-1-1, Time: 13:30–13:50

Conventional speech recognition system is constructed by unfoldingthe spectral-temporal input matrices into one-way vectors and usingthese vectors to estimate the affine parameters of neural networkaccording to the vector-based error back-propagation algorithm.System performance is constrained because the contextual corre-lations in frequency and time horizons are disregarded and thespectral and temporal factors are excluded. This paper proposesa spectral-temporal factorized neural network (STFNN) to tacklethis weakness. The spectral-temporal structure is preserved andfactorized in hidden layers through two ways of factor matriceswhich are trained by using the factorized error backpropagation.Affine transformation in standard neural network is generalizedto the spectro-temporal factorization in STFNN. The structuralfeatures or patterns are extracted and forwarded towards thesoftmax outputs. A deep neural factorization is built by cascadinga number of factorization layers with fully-connected layers forspeech recognition. An orthogonal constraint is imposed in factormatrices for redundancy reduction. Experimental results show themerit of integrating the factorized features in deep feedforward andrecurrent neural networks for speech recognition.

Semi-Supervised DNN Training with Word Selectionfor ASR

Karel Veselý, Lukáš Burget, Jan Cernocký; BrnoUniversity of Technology, Czech RepublicThu-O-10-1-2, Time: 13:50–14:10

Not all the questions related to the semi-supervised training ofhybrid ASR system with DNN acoustic model were already deeplyinvestigated. In this paper, we focus on the question of the granular-ity of confidences (per-sentence, per-word, per-frame), the questionof how the data should be used (data-selection by masks, or inmini-batch SGD with confidences as weights). Then, we propose tore-tune the system with the manually transcribed data, both with the‘frame CE’ training and ‘sMBR’ training.

Our preferred semi-supervised recipe which is both simple andefficient is following: we select words according to the word accuracywe obtain on the development set. Such recipe, which does not relyon a grid-search of the training hyper-parameter, generalized wellfor: Babel Vietnamese (transcribed 11h, untranscribed 74h), BabelBengali (transcribed 11h, untranscribed 58h) and our custom Switch-board setup (transcribed 14h, untranscribed 95h). We obtained theabsolute WER improvements 2.5% for Vietnamese, 2.3% for Bengaliand 3.2% for Switchboard.

Gaussian Prediction Based Attention for OnlineEnd-to-End Speech Recognition

Junfeng Hou, Shiliang Zhang, Li-Rong Dai; USTC, ChinaThu-O-10-1-3, Time: 14:10–14:30

Recently end-to-end speech recognition has obtained much attention.One of the popular models to achieve end-to-end speech recognitionis attention based encoder-decoder model, which usually generatingoutput sequences iteratively by attending the whole representations

Notes

229

of the input sequences. However, predicting outputs until receivingthe whole input sequence is not practical for online or low timelatency speech recognition. In this paper, we present a simple buteffective attention mechanism which can make the encoder-decodermodel generate outputs without attending the entire input sequenceand can apply to online speech recognition. At each predictionstep, the attention is assumed to be a time-moving gaussian windowwith variable size and can be predicted by using previous input andoutput information instead of the content based computation on thewhole input sequence. To further improve the online performanceof the model, we employ deep convolutional neural networks asencoder. Experiments show that the gaussian prediction basedattention works well and under the help of deep convolutional neuralnetworks the online model achieves 19.5% phoneme error rate inTIMIT ASR task.

Efficient Knowledge Distillation from an Ensemble ofTeachers

Takashi Fukuda 1, Masayuki Suzuki 1, Gakuto Kurata 1,Samuel Thomas 2, Jia Cui 2, Bhuvana Ramabhadran 2;1IBM, Japan; 2IBM, USAThu-O-10-1-4, Time: 14:30–14:50

This paper describes the effectiveness of knowledge distillation usingteacher student training for building accurate and compact neuralnetworks. We show that with knowledge distillation, informationfrom multiple acoustic models like very deep VGG networks and LongShort-Term Memory (LSTM) models can be used to train standardconvolutional neural network (CNN) acoustic models for a variety ofsystems requiring a quick turnaround. We examine two strategiesto leverage multiple teacher labels for training student models. Inthe first technique, the weights of the student model are updated byswitching teacher labels at the minibatch level. In the second method,student models are trained on multiple streams of information fromvarious teacher distributions via data augmentation. We show thatstandard CNN acoustic models can achieve comparable recognitionaccuracy with much smaller number of model parameters comparedto teacher VGG and LSTM acoustic models. Additionally we alsoinvestigate the effectiveness of using broadband teacher labels asprivileged knowledge for training better narrowband acoustic modelswithin this framework. We show the benefit of this simple techniqueby training narrowband student models with broadband teacher softlabels on the Aurora 4 task.

An Analysis of “Attention” in Sequence-to-SequenceModels

Rohit Prabhavalkar 1, Tara N. Sainath 1, Bo Li 1,Kanishka Rao 1, Navdeep Jaitly 2; 1Google, USA;2NVIDIA, USAThu-O-10-1-5, Time: 14:50–15:10

In this paper, we conduct a detailed investigation of attention-basedmodels for automatic speech recognition (ASR). First, we exploredifferent types of attention, including “online” and “full-sequence”attention. Second, we explore different subword units to see howmuch of the end-to-end ASR process can reasonably be capturedby an attention model. In experimental evaluations, we find thatalthough attention is typically focused over a small region of theacoustics during each step of next label prediction, “full-sequence”attention outperforms “online” attention, although this gap can besignificantly reduced by increasing the length of the segments overwhich attention is computed. Furthermore, we find that context-independent phonemes are a reasonable sub-word unit for attentionmodels. When used in the second-pass to rescore N-best hypotheses,these models provide over a 10% relative improvement in word errorrate.

Neural Speech Recognizer: Acoustic-to-Word LSTMModel for Large Vocabulary Speech Recognition

Hagen Soltau, Hank Liao, Hasim Sak; Google, USAThu-O-10-1-6, Time: 15:10–15:30

We present results that show it is possible to build a competitive,greatly simplified, large vocabulary continuous speech recognitionsystem with whole words as acoustic units. We model the outputvocabulary of about 100,000 words directly using deep bi-directionalLSTM RNNs with CTC loss. The model is trained on 125,000 hours ofsemi-supervised acoustic training data, which enables us to alleviatethe data sparsity problem for word models. We show that the CTCword models work very well as an end-to-end all-neural speechrecognition model without the use of traditional context-dependentsub-word phone units that require a pronunciation lexicon, andwithout any language model removing the need to decode. Wedemonstrate that the CTC word models perform better than astrong, more complex, state-of-the-art baseline with sub-word units.

Thu-O-10-2 : Robust Speaker RecognitionA2, 13:30–15:30, Thursday, 24 Aug. 2017Chairs: John Hansen, Tomi Kinnunen

CNN-Based Joint Mapping of Short and LongUtterance i-Vectors for Speaker Verification UsingShort Utterances

Jinxi Guo, Usha Amrutha Nookala, Abeer Alwan;University of California at Los Angeles, USAThu-O-10-2-1, Time: 13:30–13:50

Text-independent speaker recognition using short utterances isa highly challenging task due to the large variation and contentmismatch between short utterances. I-vector and probabilistic lineardiscriminant analysis (PLDA) based systems have become the stan-dard in speaker verification applications, but they are less effectivewith short utterances. To address this issue, we propose a novelmethod, which trains a convolutional neural network (CNN) model tomap the i-vectors extracted from short utterances to the correspond-ing long-utterance i-vectors. In order to simultaneously learn therepresentation of the original short-utterance i-vectors and fit thetarget long-version i-vectors, we jointly train a supervised-regressionmodel with an autoencoder using CNNs. The trained CNN modelis then used to generate the mapped version of short-utterancei-vectors in the evaluation stage. We compare our proposed CNN-based joint mapping method with a GMM-based joint modelingmethod under matched and mismatched PLDA training conditions.Experimental results using the NIST SRE 2008 dataset show that theproposed technique achieves up to 30% relative improvement underduration mismatched PLDA-training conditions and outperformsthe GMM-based method. The improved systems also perform bettercompared with the matched-length PLDA training condition usingshort utterances.

Curriculum Learning Based Probabilistic LinearDiscriminant Analysis for Noise Robust SpeakerRecognition

Shivesh Ranjan, Abhinav Misra, John H.L. Hansen;University of Texas at Dallas, USAThu-O-10-2-2, Time: 13:50–14:10

This study introduces a novel Curriculum Learning based Probabilis-tic Linear Discriminant Analysis (CL-PLDA) algorithm for improvingspeaker recognition in noisy conditions. CL-PLDA operates by initial-izing the training EM algorithm with cleaner data (easy examples),

Notes

230

and successively adds noisier data (difficult examples) as the trainingprogresses. This curriculum learning based approach guides theparameters of CL-PLDA to better local minima compared to regularPLDA. We test CL-PLDA on speaker verification task of the severelynoisy and degraded DARPA RATS data, and show it to significantlyoutperform regular PLDA across test-sets of varying duration.

i-Vector Transformation Using a Novel DiscriminativeDenoising Autoencoder for Noise-Robust SpeakerRecognition

Shivangi Mahto, Hitoshi Yamamoto, TakafumiKoshinaka; NEC, JapanThu-O-10-2-3, Time: 14:10–14:30

This paper proposes i-vector transformations using neural networksfor achieving noise-robust speaker recognition. A novel discrim-inative denoising autoencoder (DDAE) is employed on i-vectorsto remove additive noise effects. The DDAE is trained to denoiseand classify noisy i-vectors simultaneously, making it possible toadd discriminability to the denoised i-vectors. Speaker recognitionexperiments on the NIST SRE 2012 task shows 32% better errorperformance as compared to a baseline system. Also, our proposedmethod outperforms such conventional methods as multi-conditiontraining and a basic denoising autoencoder.

Unsupervised Discriminative Training of PLDA forDomain Adaptation in Speaker Verification

Qiongqiong Wang, Takafumi Koshinaka; NEC, JapanThu-O-10-2-4, Time: 14:30–14:50

This paper presents, for the first time, unsupervised discriminativetraining of probabilistic linear discriminant analysis (unsupervisedDT-PLDA). While discriminative training avoids the problem ofgenerative training based on probabilistic model assumptions thatoften do not agree with actual data, it has been difficult to apply itto unsupervised scenarios because it can fit data with almost anylabels. This paper focuses on unsupervised training of DT-PLDAin the application of domain adaptation in i-vector based speakerverification systems, using unlabeled in-domain data. The proposedmethod makes it possible to conduct discriminative training, i.e.,estimation of model parameters and unknown labels, by employingdata statistics as a regularization term in addition to the originalobjective function in DT-PLDA. An experiment on a NIST SpeakerRecognition Evaluation task shows that the proposed method outper-forms a conventional method using speaker clustering and performsalmost as well as supervised DT-PLDA.

Speaker Verification Under Adverse Conditions Usingi-Vector Adaptation and Neural Networks

Jahangir Alam 1, Patrick Kenny 1, GautamBhattacharya 1, Marcel Kockmann 2; 1CRIM, Canada;2VoiceTrust, GermanyThu-O-10-2-5, Time: 14:50–15:10

The main challenges introduced in the 2016 NIST speaker recogni-tion evaluation (SRE16) are domain mismatch between training andevaluation data, duration variability in test recordings and unlabeledin-domain training data. This paper outlines the systems developedat CRIM for SRE16. To tackle the domain mismatch problem, weapply minimum divergence training to adapt a conventional i-vectorextractor to the task domain. Specifically, we take an out-of-domaintrained i-vector extractor as an initialization and perform fewiterations of minimum divergence training on the unlabeled dataprovided. Next, we non-linearly transform the adapted i-vectorsby learning a speaker classifier neural network. Speaker features

extracted from this network have been shown to be more robustthan i-vectors under domain mismatch conditions with a reductionin equal error rates of 2–3% absolute. Finally, we propose a newBeta-Bernoulli backend that models the features supplied by thespeaker classifier network. Our best single system is the speakerclassifier network - Beta-Bernoulli backend combination. Overallsystem performance was very satisfactory for the fixed conditiontask. With our submitted fused system we achieve an equal errorrate of 9.89%.

Improving Robustness of Speaker Recognition to NewConditions Using Unlabeled Data

Diego Castan 1, Mitchell McLaren 1, Luciana Ferrer 2,Aaron Lawson 1, Alicia Lozano-Diez 3; 1SRI International,USA; 2Universidad de Buenos Aires, Argentina;3Universidad Autónoma de Madrid, SpainThu-O-10-2-6, Time: 15:10–15:30

Unsupervised techniques for the adaptation of speaker recognitionare important due to the problem of condition mismatch that isprevalent when applying speaker recognition technology to newconditions and the general scarcity of labeled ‘in-domain’ data. Inthe recent NIST 2016 Speaker Recognition Evaluation (SRE), sym-metric score normalization (S-norm) and calibration using unlabeledin-domain data were shown to be beneficial. Because calibrationrequires speaker labels for training, speaker-clustering techniqueswere used to generate pseudo-speakers for learning calibrationparameters in those cases where only unlabeled in-domain data wasavailable. These methods performed well in the SRE16. It is unclear,however, whether those techniques generalize well to other datasources. In this work, we benchmark these approaches on several dis-tinctly different databases, after we describe our SRI-CON-UAM teamsystem submission for the NIST 2016 SRE. Our analysis shows thatwhile the benefit of S-norm is also observed across other datasets,applying speaker-clustered calibration provides considerably greaterbenefit to the system in the context of new acoustic conditions.

Thu-O-10-4 : Multimodal Resources andAnnotationB4, 13:30–15:30, Thursday, 24 Aug. 2017Chairs: Stephanie Strassel, Febe De Wet

CALYOU: A Comparable Spoken Algerian CorpusHarvested from YouTube

K. Abidi 1, M.A. Menacer 2, Kamel Smaïli 2; 1ESI, Algeria;2LORIA, FranceThu-O-10-4-1, Time: 13:30–13:50

This paper addresses the issue of comparability of comments ex-tracted from Youtube. The comments concern spoken Algerian thatcould be either local Arabic, Modern Standard Arabic or French. Thisdiversity of expression gives rise to a huge number of problemsconcerning the data processing. In this article, several methods ofalignment will be proposed and tested. The method which permits tobest align is Word2Vec-based approach that will be used iteratively.This recurrent call of Word2Vec allows us improve significantly theresults of comparability. In fact, a dictionary-based approach leadsto a Recall of 4, while our approach allows one to get a Recall of 33at rank 1. Thanks to this approach, we built from Youtube CALYOU,a Comparable Corpus of the spoken Algerian.

Notes

231

PRAV: A Phonetically Rich Audio Visual Corpus

Abhishek Narwekar 1, Prasanta Kumar Ghosh 2;1University of Illinois at Urbana-Champaign, USA;2Indian Institute of Science, IndiaThu-O-10-4-2, Time: 13:50–14:10

This paper describes the acquisition of PRAV, a phonetically richaudio-visual Corpus. The PRAV Corpus contains audio as well asvisual recordings of 2368 sentences from the TIMIT corpus eachspoken by four subjects, making it the largest audio-visual corpus inthe literature in terms of the number of sentences per subject. Visualfeatures, comprising the coordinates of points along the contour ofthe subjects lips, have been extracted for the entire PRAV Corpususing the Active Appearance Models (AAM) algorithm and have beenmade available along with the audio and video recordings. Thesubjects being Indian makes PRAV an ideal resource for audio-visualspeech study with non-native English speakers. Moreover, this paperdescribes how the large number of sentences per subject makesthe PRAV Corpus a significant dataset by highlighting its utility inexploring a number of potential research problems including visualspeech synthesis and perception studies.

NTCD-TIMIT: A New Database and Baseline forNoise-Robust Audio-Visual Speech Recognition

Ahmed Hussen Abdelaziz; ICSI, USAThu-O-10-4-3, Time: 14:10–14:30

Although audio-visual speech is well known to improve the ro-bustness properties of automatic speech recognition (ASR) systemsagainst noise, the realm of audio-visual ASR (AV-ASR) has notgathered the research momentum it deserves. This is mainly dueto the lack of audio-visual corpora and the need to combine twofields of knowledge: ASR and computer vision. This paper describesthe NTCD-TIMIT database and baseline that can overcome thesetwo barriers and attract more research interest to AV-ASR. TheNTCD-TIMIT corpus has been created by adding six noise types at arange of signal-to-noise ratios to the speech material of the recentlypublished TCD-TIMIT corpus. NTCD-TIMIT comprises visual featuresthat have been extracted from the TCD-TIMIT video recordings usingthe visual front-end presented in this paper. The database containsalso Kaldi scripts for training and decoding audio-only, video-only,and audio-visual ASR models. The baseline experiments and resultsobtained using these scripts are detailed in this paper.

The Extended SPaRKy Restaurant Corpus: Designinga Corpus with Variable Information Density

David M. Howcroft, Dietrich Klakow, Vera Demberg;Universität des Saarlandes, GermanyThu-O-10-4-4, Time: 14:30–14:50

Natural language generation (NLG) systems rely on corpora for bothhand-crafted approaches in a traditional NLG architecture and forstatistical end-to-end (learned) generation systems. Limitations inexisting resources, however, make it difficult to develop systemswhich can vary the linguistic properties of an utterance as needed.For example, when users’ attention is split between a linguistic anda secondary task such as driving, a generation system may need toreduce the information density of an utterance to compensate forthe reduction in user attention.

We introduce a new corpus in the restaurant recommendation andcomparison domain, collected in a paraphrasing paradigm, wheresubjects wrote texts targeting either a general audience or an elderlyfamily member. This design resulted in a corpus of more than 5000texts which exhibit a variety of lexical and syntactic choices anddiffer with respect to average word & sentence length and surprisal.The corpus includes two levels of meaning representation: flat

‘semantic stacks’ for propositional content and Rhetorical StructureTheory (RST) relations between these propositions.

Automatic Construction of the Finnish ParliamentSpeech Corpus

André Mansikkaniemi, Peter Smit, Mikko Kurimo; AaltoUniversity, FinlandThu-O-10-4-5, Time: 14:50–15:10

Automatic speech recognition (ASR) systems require large amountsof transcribed speech data, for training state-of-the-art deep neuralnetwork (DNN) acoustic models. Transcribed speech is a scarce andexpensive resource, and ASR systems are prone to underperformin domains where there is not a lot of training data available. Inthis work, we open up a vast and previously unused resource oftranscribed speech for Finnish, by retrieving and aligning all therecordings and meeting transcripts from the web portal of the Parlia-ment of Finland. Short speech-text segment pairs are retrieved fromthe audio and text material, by using the Levenshtein algorithm toalign the first-pass ASR hypotheses with the corresponding meetingtranscripts. DNN acoustic models are trained on the automaticallyconstructed corpus, and performance is compared to other modelstrained on a commercially available speech corpus. Model perfor-mance is evaluated on Finnish parliament speech, by dividing thetesting set into seen and unseen speakers. Performance is alsoevaluated on broadcast speech to test the general applicability ofthe parliament speech corpus. We also study the use of meetingtranscripts in language model adaptation, to achieve additional gainsin speech recognition accuracy of Finnish parliament speech.

Building Audio-Visual Phonetically Annotated ArabicCorpus for Expressive Text to Speech

Omnia Abdo 1, Sherif Abdou 2, Mervat Fashal 1;1Alexandria University, Egypt; 2Cairo University, EgyptThu-O-10-4-6, Time: 15:10–15:30

The present research aims to build an MSA audio-visual corpus. Thecorpus is annotated both phonetically and visually and dedicatedto emotional speech processing studies. The building of the corpusconsists of 5 main stages: speaker selection, sentences selection,recording, annotation and evaluation. 500 sentences were criticallyselected based on their phonemic distribution. The speaker wasinstructed to read the same 500 sentences with 6 emotions (Hap-piness – Sadness – Fear – Anger – Inquiry – Neutral). A sample of50 sentences was selected for annotation. The corpus evaluationmodules were: audio, visual and audio-visual subjective evaluation.

The corpus evaluation process showed that happy, anger and inquiryemotions were better recognized visually (94%, 96% and 96%) thanaudibly (63.6%, 74% and 74%) and the audio visual evaluation scores(96%, 89.6% and 80.8%). Sadness and fear emotion on the other handwere better recognized audibly (76.8% and 97.6%) than visually (58%and 78.8 %) and the audio visual evaluation scores were (65.6% and90%).

Notes

232

Thu-O-10-8 : Forensic Phonetics andSociophonetic VarietiesD8, 13:30–15:30, Thursday, 24 Aug. 2017Chairs: Agustin Gravano, Melanie Weirich

What is the Relevant Population? Considerations forthe Computation of Likelihood Ratios in ForensicVoice Comparison

Vincent Hughes, Paul Foulkes; University of York, UKThu-O-10-8-1, Time: 13:30–13:50

In forensic voice comparison, it is essential to consider not only thesimilarity between samples, but also the typicality of the evidencein the relevant population. This is explicit within the likelihoodratio (LR) framework. A significant issue, however, is the definitionof the relevant population. This paper explores the complexity ofpopulation selection for voice evidence. We evaluate the effectsof population specificity in terms of regional background on LRoutput using combinations of the F1, F2, and F3 trajectories ofthe diphthong /aI/. LRs were computed using development andreference data which were regionally matched (Standard SouthernBritish English) and mixed (general British English) relative to the testdata. These conditions reflect the paradox that without knowing whothe offender is, it is not possible to know the population of whichhe is a member. Results show that the more specific populationproduced stronger evidence and better system validity than themore general definition. However, as region-specific voice features(lower formants) were removed, the difference in the output fromthe matched and mixed systems was reduced. This shows that theeffects of population selection are dependent on the sociolinguisticconstraints on the feature analysed.

Voice Disguise vs. Impersonation: Acoustic andPerceptual Measurements of Vocal Flexibility in NonExperts

Véronique Delvaux, Lise Caucheteux, Kathy Huet,Myriam Piccaluga, Bernard Harmegnies; Université deMons, BelgiumThu-O-10-8-2, Time: 13:50–14:10

The aim of this study is to assess the potential for deliberatelychanging one’s voice as a means to conceal or falsify identity, com-paring acoustic and perceptual measurements of carefully controlledspeech productions.

Twenty-two non expert speakers read a phonetically-balanced text5 times in various conditions including natural speech, free vocaldisguise (2 disguises per speaker), impersonation of a common targetfor all speakers, impersonation of one specific target per speaker.Long-term average spectra (LTAS) were computed for each readingand multiple pairwise comparisons were performed using the SDDDdissimilarity index.

The acoustic analysis showed that all speakers were able to deliber-ately change their voice beyond self-typical natural variation, whetherin attempting to simply disguise their identity or to impersonate aspecific target. Although the magnitude of the acoustic changes wascomparable in disguise vs. impersonation, overall it was limited inthat it did not achieved between-speaker variation levels. Perceptualjudgements performed on the same material revealed that naivelisteners were better at discriminating between impersonators andtargets than at simply detecting voice disguise.

Schwa Realization in French: Using Automatic SpeechProcessing to Study Phonological and Socio-LinguisticFactors in Large Corpora

Yaru Wu 1, Martine Adda-Decker 1, Cécile Fougeron 1,Lori Lamel 2; 1LPP (UMR 7018), France; 2LIMSI, FranceThu-O-10-8-3, Time: 14:10–14:30

The study investigates different factors influencing schwa realizationin French: phonological factors, speech style, gender, and socio-professional status. Three large corpora, two of public journalisticspeech (ESTER and ETAPE) and one of casual speech (NCCFr) areused. The absence/presence of schwa is automatically decidedvia forced alignment, which has a successful performance rate of95%. Only polysyllabic words including a potential schwa in theword-initial syllable are studied in order to control for variability inword structure and position. The effect of the left context, groupedinto classes of a word final vowel or final consonant or a pause,is studied. Words preceded by a vowel (V#) tend to favor schwadeletion. Interestingly, words preceded by a consonant or a pausehave similar behaviors: speakers tend to maintain schwa in bothcontexts. As can be expected, the more casual the speech, the morefrequently schwa is dropped. Males tend to delete more schwasthan females, and journalists are more likely to delete schwa thanpoliticians. These results suggest that beyond phonology, otherfactors such as gender, style and socio-professional status influencethe realization of schwa.

The Social Life of Setswana Ejectives

Daniel Duran 1, Jagoda Bruni 1, Grzegorz Dogil 1, JustusRoux 2; 1Universität Stuttgart, Germany; 2SADiLaR,South AfricaThu-O-10-8-4, Time: 14:30–14:50

This paper presents a first phonetic analysis of voiced, devoiced andejectivized stop sounds in Setswana taken from two different speechdatabases. It is observed that rules governing the voicing/devoicingprocesses depend on sociophonetic and ethnolinguistic factors.Speakers, especially women, from the rural North West area of SouthAfrica tend to preserve the phonologically stronger devoiced (oreven ejectivized) forms, both in single standing plosives as wellas in the post-nasal context (NC

˚). On the other hand, in the more

industrialized area of Gauteng, voiced forms of plosives prevail. Theempirically observed data is modelled with KaMoso, a computationalmulti-agent simulation framework. So far, this framework focusedon open social structures (whole world networks) that facilitate lan-guage modernization through exchange between different phoneticforms. The updated model has been enriched with social/phoneticsimulation scenarios in which speech agents interact between eachother in a so-called parochial setting, reflecting smaller, closedcommunities. Both configurations correspond to the sociopoliticalchanges that have been taking place in South Africa over the lastdecades, showing the differences in speech between women and menfrom rural and industrialized areas of the country.

How Long is Too Long? How Pause Features AfterRequests Affect the Perceived Willingness ofAffirmative Answers

Lea S. Kohtz 1, Oliver Niebuhr 2;1Christian-Albrechts-Universität zu Kiel, Germany;2University of Southern Denmark, DenmarkThu-O-10-8-5, Time: 14:50–15:10

A perception experiment involving 28 German listeners is presented.It investigates — for sequences of request, pause, and affirmativeanswer — the effect of pause duration on the answerer’s perceived

Notes

233

willingness to comply with the request. Replicating earlier results onAmerican English, perceived willingness was found to decrease withincreasing pause duration, particularly above a “tolerance threshold”of 600 ms. Refining and qualifying this replicated result, the percep-tion experiment showed additional effects of speaking-rate contextand pause quality (silence vs. breathing vs. café noise) on perceivedwillingness judgments. The overall results picture is discussedwith respect to the origin of the “tolerance threshold”, the status ofbreathing in speech, and the function of pauses in communication.

Shadowing Synthesized Speech — Segmental Analysisof Phonetic Convergence

Iona Gessinger, Eran Raveh, Sébastien Le Maguer, BerndMöbius, Ingmar Steiner; Universität des Saarlandes,GermanyThu-O-10-8-6, Time: 15:10–15:30

To shed light on the question whether humans converge phoneticallyto synthesized speech, a shadowing experiment was conducted usingthree different types of stimuli — natural speaker, diphone synthe-sis, and HMM synthesis. Three segment-level phonetic features ofGerman that are well-known to vary across native speakers wereexamined. The first feature triggered convergence in roughly onethird of the cases for all stimulus types. The second feature showedgenerally a small amount of convergence, which may be due to thenature of the feature itself. Still the effect was strongest for the natu-ral stimuli, followed by the HMM stimuli and weakest for the diphonestimuli. The effect of the third feature was clearly observable for thenatural stimuli and less pronounced in the synthetic stimuli. This ispresumably a result of the partly insufficient perceptibility of thistarget feature in the synthetic stimuli and demonstrates the necessityof gaining fine-grained control over the synthesis output, should itbe intended to implement capabilities of phonetic convergence onthe segmental level in spoken dialogue systems.

Thu-O-10-11 : Speech and AudioSegmentation and Classification 1F11, 13:30–15:30, Thursday, 24 Aug. 2017Chairs: Mahadeva Prasanna, Tomoki Toda

Occupancy Detection in Commercial and ResidentialEnvironments Using Audio Signal

Shabnam Ghaffarzadegan 1, Attila Reiss 2, Mirko Ruhs 2,Robert Duerichen 2, Zhe Feng 1; 1Robert Bosch, USA;2Robert Bosch, GermanyThu-O-10-11-1, Time: 13:30–13:50

Occupancy detection, including presence detection and head count,as one of the fast growing areas plays an important role in providingsafety, comfort and reducing energy consumption both in residentialand commercial setups. The focus of this study is proposing af-fordable strategies to increase occupancy detection performance inrealistic scenarios using only audio signal collected from the environ-ment. We use approximately 100-hour of audio data in residentialand commercial environments to analyze and evaluate our setup. Inthis study, we take advantage of developments in feature selectionmethods to choose the most relevant audio features for the task.Attribute and error vs. human activity analysis are also performedto gain a better understanding of the environmental sounds andpossible solutions to enhance the performance. Experimental resultsconfirm the effectiveness of audio sensor for occupancy detectionusing a cost effective system with presence detection accuracy of96% and 99%, and the head count accuracy of 70% and 95% for theresidential and commercial setups, respectively.

Data Augmentation, Missing Feature Mask and KernelClassification for Through-the-Wall AcousticSurveillance

Huy Dat Tran, Wen Zheng Terence Ng, Yi Ren Leng;A*STAR, SingaporeThu-O-10-11-2, Time: 13:50–14:10

This paper deals with sound event classification from poor qualitysignals in the context of “through-the-wall” (TTW) surveillance. Thetask is extremely challenging due to the high level of distortion andattenuation caused by complex sound propagation and modulationeffect from signal acquisition. Another problem, facing in TTWsurveillance, is the lack of comprehensive training data as therecording is much more complicated than conventional approachesusing audio microphones. To address that challenge, we employ arecurrent neural network, particularly the Long Short-Term Mem-ory (LSTM) encoder, to transform conventional clean and noisyaudio signals into TTW signals to augment additional training data.Furthermore, a novel missing feature mask kernel classification isdeveloped to optimize the classification accuracy of TTW soundevent classification. Particularly, Wasserstein distance is calculatedfrom reliable intersection regions between pair-wise sound imagerepresentations and embedded into a probabilistic distance SupportVector Machine (SVM) kernel to optimize the TTW data separation.The proposed missing feature mask kernel allows effective trainingwith inhomogeneously distorted data and the experimental resultsshow promising results on TTW audio recordings, outperformingseveral state-of-art methods.

Endpoint Detection Using Grid Long Short-TermMemory Networks for Streaming Speech Recognition

Shuo-Yiin Chang, Bo Li, Tara N. Sainath, Gabor Simko,Carolina Parada; Google, USAThu-O-10-11-3, Time: 14:10–14:30

The task of endpointing is to determine when the user has finishedspeaking. This is important for interactive speech applications suchas voice search and Google Home. In this paper, we propose aGLDNN-based (grid long short-term memory deep neural network)endpointer model and show that it provides significant improve-ments over a state-of-the-art CLDNN (convolutional, long short-termmemory, deep neural network) model. Specifically, we replace theconvolution layer in the CLDNN with a grid LSTM layer that modelsboth spectral and temporal variations through recurrent connections.Results show that the GLDNN achieves 32% relative improvement infalse alarm rate at a fixed false reject rate of 2%, and reduces medianlatency by 11%. We also include detailed experiments investigatingwhy grid LSTMs offer better performance than convolution layers.Analysis reveals that the recurrent connection along the frequencyaxis is an important factor that greatly contributes to the perfor-mance of grid LSTMs, especially in the presence of background noise.Finally, we also show that multichannel input further increasesrobustness to background speech. Overall, we achieve 16% (100 ms)endpointer latency improvement relative to our previous best modelon a Voice Search Task.

Deep Learning Techniques in Tandem with SignalProcessing Cues for Phonetic Segmentation for Textto Speech Synthesis in Indian Languages

Arun Baby, Jeena J. Prakash, Rupak Vignesh, Hema A.Murthy; IIT Madras, IndiaThu-O-10-11-4, Time: 14:30–14:50

Automatic detection of phoneme boundaries is an important sub-taskin building speech processing applications, especially text-to-speech

Notes

234

synthesis (TTS) systems. The main drawback of the Gaussian mixturemodel - hidden Markov model (GMM-HMM) based forced-alignment isthat the phoneme boundaries are not explicitly modeled. In an earlierwork, we had proposed the use of signal processing cues in tandemwith GMM-HMM based forced alignment for boundary correction forbuilding Indian language TTS systems. In this paper, we capitaliseon the ability of robust acoustic modeling techniques such as deepneural networks (DNN) and convolutional deep neural networks(CNN) for acoustic modeling. The GMM-HMM based forced alignmentis replaced by DNN-HMM/CNN-HMM based forced alignment. Signalprocessing cues are used to correct the segment boundaries obtainedusing DNN-HMM/CNN-HMM segmentation. TTS systems built usingthese boundaries show a relative improvement in synthesis quality.

Gate Activation Signal Analysis for Gated RecurrentNeural Networks and its Correlation with PhonemeBoundaries

Yu-Hsuan Wang, Cheng-Tao Chung, Hung-Yi Lee;National Taiwan University, TaiwanThu-O-10-11-5, Time: 14:50–15:10

In this paper we analyze the gate activation signals inside the gatedrecurrent neural networks, and find the temporal structure of suchsignals is highly correlated with the phoneme boundaries. Thiscorrelation is further verified by a set of experiments for phonemesegmentation, in which better results compared to standard ap-proaches were obtained.

Speaker Change Detection in Broadcast TV UsingBidirectional Long Short-Term Memory Networks

Ruiqing Yin, Hervé Bredin, Claude Barras; LIMSI, FranceThu-O-10-11-6, Time: 15:10–15:30

Speaker change detection is an important step in a speaker di-arization system. It aims at finding speaker change points in theaudio stream. In this paper, it is treated as a sequence labeling taskand addressed by Bidirectional long short term memory networks(Bi-LSTM). The system is trained and evaluated on the Broadcast TVsubset from ETAPE database. The result shows that the proposedmodel brings good improvement over conventional methods basedon BIC and Gaussian Divergence. For instance, in comparison toGaussian divergence, it produces speech turns that are 19.5% longeron average, with the same level of purity.

Thu-P-9-1 : Noise Robust and Far-field ASRPoster 1, 10:00–12:00, Thursday, 24 Aug. 2017Chair: Volker Leutnant 

Improved Automatic Speech Recognition UsingSubband Temporal Envelope Features andTime-Delay Neural Network Denoising Autoencoder

Cong-Thanh Do, Yannis Stylianou; Toshiba ResearchEurope, UKThu-P-9-1-1, Time: 10:00–12:00

This paper investigates the use of perceptually-motivated subbandtemporal envelope (STE) features and time-delay neural network(TDNN) denoising autoencoder (DAE) to improve deep neural net-work (DNN)-based automatic speech recognition (ASR). STEs areestimated by full-wave rectification and low-pass filtering of band-passed speech using a Gammatone filter-bank. TDNNs are usedeither as DAE or acoustic models. ASR experiments are performedon Aurora-4 corpus. STE features provide 2.2% and 3.7% relativeword error rate (WER) reduction compared to conventional log-melfilter-bank (FBANK) features when used in ASR systems using DNN

and TDNN as acoustic models, respectively. Features enhancedby TDNN DAE are better recognized with ASR system using DNNacoustic models than using TDNN acoustic models. Improved ASRperformance is obtained when features enhanced by TDNN DAE areused in ASR system using DNN acoustic models. In this scenario,using STE features provides 9.8% relative WER reduction comparedto when using FBANK features.

Factored Deep Convolutional Neural Networks forNoise Robust Speech Recognition

Masakiyo Fujimoto; NICT, JapanThu-P-9-1-2, Time: 10:00–12:00

In this paper, we present a framework of a factored deep convo-lutional neural network (CNN) learning for noise robust automaticspeech recognition (ASR). Deep CNN architecture, which has attractedgreat attention in various research areas, has also been successfullyapplied to ASR. However, to ensure noise robustness, since merelyintroducing deep CNN architecture into the acoustic modeling ofASR is insufficient, we introduce factored network architecture intodeep CNN-based acoustic modeling. The proposed factored deepCNN framework factors out feature enhancement, delta parameterlearning, and hidden Markov model state classification into threespecific network blocks. By assigning specific roles to each block,the noise robustness of deep CNN-based acoustic models can beimproved. With various comparative evaluations, we reveal that theproposed method successfully improves ASR accuracies in noiseenvironments.

Global SNR Estimation of Speech Signals for UnknownNoise Conditions Using Noise Adapted Non-LinearRegression

Pavlos Papadopoulos, Ruchir Travadi, Shrikanth S.Narayanan; University of Southern California, USAThu-P-9-1-3, Time: 10:00–12:00

The performance of speech technologies deteriorates in the presenceof noise. Additionally, we need these technologies to be able tooperate across a variety of noise levels and conditions. SNR esti-mation can guide the design and operation of such technologies orcan be used as a pre-processing tool in database creation (e.g. iden-tify/discard noisy signals). We propose a new method to estimate theglobal SNR of a speech signal when prior information about the noisethat corrupts the signal, and speech boundaries within the signal,are not available. To achieve this goal, we train a neural network thatperforms non-linear regression to estimate the SNR. We use energyratios as features, as well as ivectors to provide information aboutthe noise that corrupts the signal. We compare our method againstothers in the literature, using the Mean Absolute Error (MAE) metric,and show that our method outperforms them consistently.

Joint Training of Multi-Channel-ConditionDereverberation and Acoustic Modeling ofMicrophone Array Speech for Robust Distant SpeechRecognition

Fengpei Ge 1, Kehuang Li 2, Bo Wu 3, Sabato MarcoSiniscalchi 4, Yonghong Yan 1, Chin-Hui Lee 2; 1ChineseAcademy of Sciences, China; 2Georgia Institute ofTechnology, USA; 3Xidian University, China; 4Universitàdi Enna Kore, ItalyThu-P-9-1-4, Time: 10:00–12:00

We propose a novel data utilization strategy, called multi-channel-condition learning, leveraging upon complementary information

Notes

235

captured in microphone array speech to jointly train dereverber-ation and acoustic deep neural network (DNN) models for robustdistant speech recognition. Experimental results, with a singleautomatic speech recognition (ASR) system, on the REVERB2014simulated evaluation data show that, on 1-channel testing, thebaseline joint training scheme attains a word error rate (WER) of7.47%, reduced from 8.72% for separate training. The proposedmulti-channel-condition learning scheme has been experimentedon different channel data combinations and usage showing manyinteresting implications. Finally, training on all 8-channel data andwith DNN-based language model rescoring, a state-of-the-art WER of4.05% is achieved. We anticipate an even lower WER when combiningmore top ASR systems.

Uncertainty Decoding with Adaptive Sampling forNoise Robust DNN-Based Acoustic Modeling

Dung T. Tran, Marc Delcroix, Atsunori Ogawa,Tomohiro Nakatani; NTT, JapanThu-P-9-1-5, Time: 10:00–12:00

Although deep neural network (DNN) based acoustic models haveobtained remarkable results, the automatic speech recognition (ASR)performance still remains low in noise and reverberant conditions.To address this issue, a speech enhancement front-end is often usedbefore recognition to reduce noise. However, the front-end cannotfully suppress noise and often introduces artifacts that are limitingthe ASR performance improvement. Uncertainty decoding has beenproposed to better interconnect the speech enhancement front-endand ASR back-end and mitigate the mismatch caused by residualnoise and artifacts. By considering features as distributions insteadof point estimates, the uncertainty decoding approach modifiesthe conventional decoding rules to account for the uncertaintyemanating from the speech enhancement. Although the conceptof uncertainty decoding has been investigated for DNN acousticmodels recently, finding efficient ways to incorporate distributionof the enhanced features within a DNN acoustic model still requiresfurther investigations. In this paper, we propose to parameterize thedistribution of the enhanced feature and estimate the parametersby backpropagation using an unsupervised adaptation scheme. Wedemonstrate the effectiveness of the proposed approach on realaudio data of the CHiME3 dataset.

Attention-Based LSTM with Multi-Task Learning forDistant Speech Recognition

Yu Zhang, Pengyuan Zhang, Yonghong Yan; ChineseAcademy of Sciences, ChinaThu-P-9-1-6, Time: 10:00–12:00

Distant speech recognition is a highly challenging task due to back-ground noise, reverberation, and speech overlap. Recently, there hasbeen an increasing focus on attention mechanism. In this paper, weexplore the attention mechanism embedded within the long short-term memory (LSTM) based acoustic model for large vocabularydistant speech recognition, trained using speech recorded from asingle distant microphone (SDM) and multiple distant microphones(MDM). Furthermore, multi-task learning architecture is incorporatedto improve robustness in which the network is trained to performboth a primary senone classification task and a secondary featureenhancement task. Experiments were conducted on the AMI meetingcorpus. On average our model achieved 3.3% and 5.0% relativeimprovements in word error rate (WER) over the LSTM baselinemodel in the SDM and MDM cases, respectively. In addition, themodel provided between a 2–4% absolute WER reduction comparedto a conventional pipeline of independent processing stage on theMDM task.

To Improve the Robustness of LSTM-RNN AcousticModels Using Higher-Order Feedback from MultipleHistories

Hengguan Huang, Brian Mak; HKUST, ChinaThu-P-9-1-7, Time: 10:00–12:00

This paper investigates a novel multiple-history long short-termmemory (MH-LSTM) RNN acoustic model to mitigate the robustnessproblem of noisy outputs in the form of mis-labeled data and/ormis-alignments. Conceptually, after an RNN is unfolded in time, thehidden units in each layer are re-arranged into ordered sub-layerswith a master sub-layer on top and a set of auxiliary sub-layersbelow it. Only the master sub-layer generates outputs for the nextlayer whereas the auxiliary sub-layers run in parallel with the mastersub-layer but with increasing time lags. Each sub-layer also receiveshigher-order feedback from a fixed number of sub-layers below it.As a result, each sub-layer maintains a different history of the inputspeech, and the ensemble of all the different histories lends itselfto the model’s robustness. The higher-order connections not onlyprovide shorter feedback paths for error signals to propagate tothe farther preceding hidden states to better model the long-termmemory, but also more feedback paths to each model parameterand smooth its update during training. Phoneme recognition resultson both real TIMIT data as well as synthetic TIMIT data with noisylabels or alignments show that the new model outperforms theconventional LSTM RNN model.

End-to-End Speech Recognition with AuditoryAttention for Multi-Microphone Distance SpeechRecognition

Suyoun Kim, Ian Lane; Carnegie Mellon University, USAThu-P-9-1-8, Time: 10:00–12:00

End-to-End speech recognition is a recently proposed approachthat directly transcribes input speech to text using a single model.End-to-End speech recognition methods including ConnectionistTemporal Classification and Attention-based Encoder Decoder Net-works have been shown to obtain state-of-the-art performance ona number of tasks and significantly simplify the modeling, trainingand decoding procedures for speech recognition. In this paper, weextend our prior work on End-to-End speech recognition focusingon the effectiveness of these models in far-field environments.Specifically, we propose introducing Auditory Attention to integrateinput from multiple microphones directly within an End-to-Endspeech recognition model, leveraging the attention mechanism todynamically tune the model’s attention to the most reliable inputsources. We evaluate our proposed model on the CHiME-4 task, andshow substantial improvement compared to a model optimized fora single microphone input.

Robust Speech Recognition Based on BinauralAuditory Processing

Anjali Menon 1, Chanwoo Kim 2, Richard M. Stern 1;1Carnegie Mellon University, USA; 2Google, USAThu-P-9-1-9, Time: 10:00–12:00

This paper discusses a combination of techniques for improvingspeech recognition accuracy in the presence of reverberation andspatially-separated interfering sound sources. Interaural Time Delay(ITD), observed as a consequence of the difference in arrival times ofa sound to the two ears, is an important feature used by the humanauditory system to reliably localize and separate sound sources. Inaddition, the “precedence effect” helps the auditory system differ-entiate between the direct sound and its subsequent reflections inreverberant environments. This paper uses a cross-correlation-basedmeasure across the two channels of a binaural signal to isolate the

Notes

236

target source by rejecting portions of the signal corresponding tolarger ITDs. To overcome the effects of reverberation, the steady-state components of speech are suppressed, effectively boosting theonsets, so as to retain the direct sound and suppress the reflections.Experimental results show a significant improvement in recognitionaccuracy using both these techniques. Cross-correlation-based pro-cessing and steady-state suppression are carried out separately, andthe order in which these techniques are applied produces differencesin the resulting recognition accuracy.

Adaptive Multichannel Dereverberation forAutomatic Speech Recognition

Joe Caroselli, Izhak Shafran, Arun Narayanan, RichardRose; Google, USAThu-P-9-1-10, Time: 10:00–12:00

Reverberation is known to degrade the performance of automaticspeech recognition (ASR) systems dramatically in far-field conditions.Adopting the weighted prediction error (WPE) approach, we formu-late an online dereverberation algorithm for a multi-microphonearray. The key contributions of this paper are: (a) we demonstratethat dereverberation using WPE improves performance even whenthe acoustic models are trained using multi-style training (MTR)with noisy, reverberated speech; (b) we show that the gains fromWPE are preserved even in large and diverse real-world data sets;(c) we propose an adaptive version for online multichannel ASRtasks which gives similar gains as the non-causal version; and (d)while the algorithm can just be applied for evaluation, we showthat also including dereverberation during training gives increasedperformance gains. We also report how different parameter settingsof the dereverberation algorithm impacts the ASR performance.

Thu-P-9-3 : Styles, Varieties, Forensics andToolsPoster 3, 10:00–12:00, Thursday, 24 Aug. 2017Chair: Kiyoko Yoneyama

The Effects of Real and Placebo Alcohol onDeaffrication

Urban Zihlmann; Universität Zürich, SwitzerlandThu-P-9-3-1, Time: 10:00–12:00

The more alcohol a person has consumed, the more mispronunci-ations occur. This study investigates how deaffrication surfaces inBernese Swiss German when speakers are moderately intoxicated(0.05–0.08% Vol.), whether these effects can be hidden, and whethera placebo effect interacting with mispronunciation occurs. Fiveparticipants reading a text were recorded as follows. In stage I, theyread the text before and after drinking placebo alcohol, and finallyagain after being told to enunciate very clearly. 3–7 days later, thesame experiment was repeated with real alcohol. The recordingswere then analysed with Praat. Despite interspeaker variation, thefollowing generalisations can be made. The most deaffrication occursin the C_C context both when speakers are sober and inebriated;affricates in _#, V_C, and V_V position encounter more deaffricationin the alcohol stage; and /

>tS/ and

>kx are deaffricated more when

the speaker is intoxicated, with />tS/ being the most susceptible

to mispronunciation. Moreover, when alcohol is consumed, moredeaffrication occurs, which cannot consciously be controlled. Fur-thermore, a statistically significant difference between the pre- andthe post-placebo-drinking experiment could be found, which impliesthat a placebo effect takes place. Nevertheless, the effects of realalcohol are considerably stronger.

Polyglot and Speech Corpus Tools: A System forRepresenting, Integrating, and Querying SpeechCorpora

Michael McAuliffe 1, Elias Stengel-Eskin 1, MichaelaSocolof 2, Morgan Sonderegger 1; 1McGill University,Canada; 2University of Maryland, USAThu-P-9-3-2, Time: 10:00–12:00

Speech datasets from many languages, styles, and sources exist inthe world, representing significant potential for scientific studies ofspeech — particularly given structural similarities among all speechdatasets. However, studies using multiple speech corpora remaindifficult in practice, due to corpus size, complexity, and differingformats. We introduce open-source software for unified corpus anal-ysis: integrating speech corpora and querying across them. Corporaare stored in a custom ‘polyglot persistence’ scheme that combinesthree sub-databases mirroring different data types: a Neo4j graphdatabase to represent temporal annotation graph structure, and SQLand InfluxDB databases to represent meta- and acoustic data. Thisscheme abstracts away from the idiosyncratic formats of differentspeech corpora, while mirroring the structure of different data typesimproves speed and scalability. A Python API and a GUI both allowfor: enriching the database with positional, hierarchical, temporal,and signal measures (e.g. utterance boundaries, f0) that are usefulfor linguistic analysis; querying the database using a simple querylanguage; and exporting query results to standard formats for fur-ther analysis. We describe the software, summarize two case studiesusing it to examine effects on pitch and duration across languages,and outline planned future development.

Mapping Across Feature Spaces in Forensic VoiceComparison: The Contribution of Auditory-BasedVoice Quality to (Semi-)Automatic System Testing

Vincent Hughes, Philip Harrison, Paul Foulkes, PeterFrench, Colleen Kavanagh, Eugenia San Segundo;University of York, UKThu-P-9-3-3, Time: 10:00–12:00

In forensic voice comparison, there is increasing focus on the inte-gration of automatic and phonetic methods to improve the validityand reliability of voice evidence to the courts. In line with this, wepresent a comparison of long-term measures of the speech signalto assess the extent to which they capture complementary speaker-specific information. Likelihood ratio-based testing was conductedusing MFCCs and (linear and Mel-weighted) long-term formantdistributions (LTFDs). Fusing automatic and semi-automatic systemsyielded limited improvement in performance over the baseline MFCCsystem, indicating that these measures capture essentially the samespeaker-specific information. The output from the best performingsystem was used to evaluate the contribution of auditory-basedanalysis of supralaryngeal (filter) and laryngeal (source) voice qualityin system testing. Results suggest that the problematic speakers forthe (semi-)automatic system are, to some extent, predictable fromtheir supralaryngeal voice quality profiles, with the least distinctivespeakers producing the weakest evidence and most misclassifica-tions. However, the misclassified pairs were still easily differentiatedvia auditory analysis. Laryngeal voice quality may thus be useful inresolving problematic pairs for (semi-)automatic systems, potentiallyimproving their overall performance.

Notes

237

Effect of Language, Speaking Style and Speaker onLong-Term F0 Estimation

Pablo Arantes 1, Anders Eriksson 2, Suska Gutzeit 1;1Universidade Federal de São Carlos, Brazil; 2StockholmUniversity, SwedenThu-P-9-3-4, Time: 10:00–12:00

In this study, we compared three long-term fundamental frequencyestimates — mean, median and base value — with respect to howfast they approach a stable value, as a function of language, speakingstyle and speaker. The base value concept was developed in searchfor an f0 value which should be invariant under prosodic variation.It has since also been tested in forensic phonetics as a possiblespeaker-specific f0 value. Data used in this study — recorded speechby male and female speakers in seven languages and three speakingstyles, spontaneous, phrase reading and word list reading — hadbeen recorded for a previous project. Average stabilisation timesfor the mean, median and base value are 9.76, 9.67 and 8.01 s. Basevalues stabilise significantly faster. Languages differ in both averageand variability of the stabilisation times. Values range from 7.14 to11.41 (mean), 7.5 to 11.33 (median) and 6.74 to 9.34 (base value).Spontaneous speech yields the most variable stabilisation times forthe three estimators in Italian and Swedish, for the median in Frenchand Portuguese and base value in German. Speakers within eachlanguage do not differ significantly in terms of stabilisation timevariability for the three estimators.

Stability of Prosodic Characteristics Across Age andGender Groups

Jan Volín 1, Tereza Tykalová 2, Tomáš Boril 1; 1CharlesUniversity, Czech Republic; 2CTU, Czech RepublicThu-P-9-3-5, Time: 10:00–12:00

The indexical function of speech prosody signals the membership ofa speaker in a social group. The factors of age and gender are rela-tively easy to establish but their reflection in speech characteristicscan be less straightforward as they interact with other social aspects.Therefore, diverse speaker communities should be investigatedwith the aim of their subsequent comparison. Our study providesdata for the population of adult speakers of Czech — a West Slaviclanguage of Central Europe. The sample consists of six age groups(20 to 80 years of age) with balanced representation of gender. Thesearch for age and gender related attributes covered both globalacoustic descriptors and linguistically informed prosodic featureextraction. Apart from commonly used measures and methods wealso exploited Legendre polynomials, k-means clustering and a newlydesigned Cumulative Slope Index (CSI). The results specify generaldeceleration of articulation rate with age and lowering of F0 in agingCzech women, and reveal an increase in CSI of both F0 tracks andintensity curves with age. Furthermore, various melodic shapes werefound to be distributed unequally across the age groups.

Electrophysiological Correlates of Familiar VoiceRecognition

Julien Plante-Hébert 1, Victor J. Boucher 1, BoutheinaJemel 2; 1Université de Montréal, Canada; 2HôpitalRivière-des-Prairies, CanadaThu-P-9-3-6, Time: 10:00–12:00

Our previous work using voice lineups has established that listen-ers can recognize with near-perfect accuracy the voice of familiarindividuals. In a forensic perspective, however, there are limitationsto the application of voice lineups in that some witnesses may notwish to recognize the familiar voice of a parent or close friend orelse provide unreliable responses. Considering this problem, the

present study aimed to isolate the electrophysiological markers ofvoice familiarity. We recorded the evoked response potentials (ERPs)of 11 participants as they listened to a set of similar voices in varyingutterances (standards of voice line ups were used in selecting voices).Within the presented set, only one voice was familiar to the listener(the voice of a parent, close friend, etc.). The ERPs showed a markeddifference for heard familiar voices compared to an unfamiliar set.These are the first findings of a neural marker of voice recognitionbased on voices that are actually familiar to a listener and which takeinto account utterances rather than isolated vowels. The presentresults thus indicate that protocols of near-perfect voice recognitioncan be devised without using behavioral responses.

Developing an Embosi (Bantu C25) Speech VariantDictionary to Model Vowel Elision and MorphemeDeletion

Jamison Cooper-Leavitt 1, Lori Lamel 1, Annie Rialland 2,Martine Adda-Decker 1, Gilles Adda 1; 1LIMSI, France;2LPP (UMR 7018), FranceThu-P-9-3-7, Time: 10:00–12:00

This paper investigates vowel elision and morpheme deletion inEmbosi (Bantu C25), an under-resourced language spoken in the Re-public of Congo. We propose that the observed morpheme deletionis morphological, and that vowel elision is phonological. The studyfocuses on vowel elision that occurs across word boundaries betweenthe contact of long/short vowels (i.e. CV[long] # V[short].CV), andbetween the contact of short/short vowels (CV[short] # V[short].CV).Several different categories of morphemes are explored: (i) preposi-tions (ya, mo), (ii) class-noun nominal prefixes (ba, etc.), (iii) singularsubject pronouns (ngá, nO, wa). For example, the preposition, ya,regularly deletes allowing for vowel elision if vowel contact occursbetween the head of the noun phrase and the previous word. Pho-netically motivated speech variants are proposed in the lexicon usedfor forced alignment (segmentation) enabling these phenomena tobe quantified in the corpus so as to develop a dictionary containingrelevant phonetic variants.

Rd as a Control Parameter to Explore AffectiveCorrelates of the Tense-Lax Continuum

Andy Murphy, Irena Yanushevskaya, AilbheNí Chasaide, Christer Gobl; Trinity College Dublin,IrelandThu-P-9-3-8, Time: 10:00–12:00

This study uses the Rd glottal waveshape parameter to simulate thephonatory tense-lax continuum and to explore its affective correlatesin terms of activation and valence. Based on a natural utterancewhich was inverse filtered and source-parameterised, a range ofsynthesized stimuli varying along the tense-lax continuum were gen-erated using Rd as a control parameter. Two additional stimuli wereincluded, which were versions of the most lax stimuli with additionalcreak (lax-creaky voice). In a listening test, participants chose anemotion from a set of affective labels and indicated its perceivedstrength. They also indicated the naturalness of the stimulus andtheir confidence in their judgment. Results showed that stimuli atthe tense end of the range were most frequently associated withangry, at the lax end of the range the association was with sad, andin the intermediate range, the association was with content. Resultsalso indicate, as was found in our earlier work, that a particularstimulus can be associated with more than one affect. Overall theseresults show that Rd can be used as a single control parameter togenerate variation along the tense-lax continuum of phonation.

Notes

238

Cross-Linguistic Distinctions Between Professionaland Non-Professional Speaking Styles

Plínio A. Barbosa 1, Sandra Madureira 2, PhilippeBoula de Mareüil 3; 1Universidade Estadual deCampinas, Brazil; 2Universidade de São Paulo, Brazil;3LIMSI, FranceThu-P-9-3-9, Time: 10:00–12:00

This work investigates acoustic and perceptual differences in fourlanguage varieties by using a corpus of professional and non-professional speaking styles. The professional stimuli are composedof excerpts of broadcast news and political discourses from sixsubjects in each case. The non-professional stimuli are made upof recordings of 10 subjects who read a long story and narratedit subsequently. All this material was obtained in four languagevarieties: Brazilian and European Portuguese, standard French andGerman. The corpus is balanced for gender. Eight melodic andintensity parameters were automatically obtained from excerpts of10 to 20 seconds. We showed that 6 out of 8 parameters partiallydistinguish professional from non-professional style in the fourlanguage varieties. Classification and discrimination tests carriedout with 12 Brazilian listeners using delexicalised speech showedthat these subjects are able to distinguish professional style fromnon-professional style with about 2/3 of hits irrespective of lan-guage. In comparison, an automatic classification using an LDAmodel performed better in classifying non-professional (96%) againstprofessional styles, but not in classifying professional (42%) againstnon-professional styles.

Perception and Production of Word-Final /K/ inFrench

Cedric Gendrot; LPP (UMR 7018), FranceThu-P-9-3-10, Time: 10:00–12:00

Variability of (French) /K/ is a frequently studied phenomenonshowing that /K/ can have multiple realizations. In French, all thesestudies were undertaken using small read corpora and we havereason to believe that these corpora don’t allow to look at the fullpicture. Indeed factors such as local word frequency, as well asspeech rate can have almost as much influence as phonemic contextin the realization of /K/.

According to Ohala’s Aerodynamic Voicing principle, /K/ wouldtend to be either an unvoiced fricative or a voiced approximant.We chose to analyze word final /K/s as they tend to embrace thelargest spectrum of variation. The study realized here is two-fold: aperception study in a specific phonemic context, between /a/ and /l/,where /K/ is realized as an approximant, so as to better understandthe parameters and their thresholds necessary for /K/ identification,and provide a measure of rhoticity.

In a second step, keeping the rhoticity measurement in mind, weanalyzed the realizations of word final /K/s in two continuous speechcorpora and modelled the realization of /K/ using predictors such asdiphone and digram frequency, phonemic context and speech rate.

Glottal Source Estimation from Coded TelephoneSpeech Using a Deep Neural Network

N.P. Narendra, Manu Airaksinen, Paavo Alku; AaltoUniversity, FinlandThu-P-9-3-11, Time: 10:00–12:00

In speech analysis, the information about the glottal source isobtained from speech by using glottal inverse filtering (GIF). Theaccuracy of state-of-the-art GIF methods is sufficiently high whenthe input speech signal is of high-quality (i.e., with little noise orreverberation). However, in realistic conditions, particularly when

GIF is computed from coded telephone speech, the accuracy of GIFmethods deteriorates severely. To robustly estimate the glottalsource under coded condition, a deep neural network (DNN)-basedmethod is proposed. The proposed method utilizes a DNN to mapthe speech features extracted from the coded speech to the glottalflow waveform estimated from the corresponding clean speech. Togenerate the coded telephone speech, adaptive multi-rate (AMR)codec is utilized which is a widely used speech compression method.The proposed glottal source estimation method is compared withtwo existing GIF methods, closed phase covariance analysis (CP) anditerative adaptive inverse filtering (IAIF). The results indicate that theproposed DNN-based method is capable of estimating glottal flowwaveforms from coded telephone speech with a considerably betteraccuracy in comparison to CP and IAIF.

Automatic Labelling of Prosodic Prominence, Phrasingand Disfluencies in French Speech by Simulating thePerception of Naïve and Expert Listeners

George Christodoulides, Mathieu Avanzi,Anne Catherine Simon; Université catholique deLouvain, BelgiumThu-P-9-3-12, Time: 10:00–12:00

We explore the use of machine learning techniques (notably SVMclassifiers and Conditional Random Fields) to automate the prosodiclabelling of French speech, based on modelling and simulating theperception of prosodic events by naïve and expert listeners. Themodels are based on previous work on the perception of syllabicprominence and hesitation-related disfluencies, and on an experi-ment on the real-time perception of prosodic boundaries. Expert andnon-expert listeners annotated samples from three multi-genre cor-pora (CPROM, CPROM-PFC, LOCAS-F). Automatic prosodic annotationis approached as a sequence labelling problem, drawing on multipleinformation sources (acoustic features, lexical and shallow syntacticfeatures) in accordance with the experimental findings showingthat listeners integrate all such information in their perception ofprosodic segmentation and events. We test combinations of featuresand machine learning methods, and we compare the automaticlabelling with expert annotation. The result of this study is a toolthat automatically annotates prosodic events by simulating theperception of expert and naïve listeners.

Don’t Count on ASR to Transcribe for You: BreakingBias with Two Crowds

Michael Levit, Yan Huang, Shuangyu Chang, YifanGong; Microsoft, USAThu-P-9-3-13, Time: 10:00–12:00

A crowdsourcing approach for collecting high-quality speech tran-scriptions is presented. The approach addresses typical weaknessof traditional semi-supervised transcription strategies that showASR hypotheses to transcribers to help them cope with unclear orambiguous audio and speed up transcriptions. We explain how thetraditional methods introduce bias into transcriptions that make itdifficult to objectively measure system improvements against ex-isting baselines, and suggest a two-stage crowdsourcing alternativethat, first, iteratively collects transcription hypotheses and, then,asks a different crowd to pick the best of them. We show thatthis alternative not only outperforms the traditional method in aside-by-side comparison, but it also leads to ASR improvements dueto superior quality of acoustic and language models trained on thetranscribed data.

Notes

239

Effects of Training Data Variety in Generating GlottalPulses from Acoustic Features with DNNs

Manu Airaksinen, Paavo Alku; Aalto University, FinlandThu-P-9-3-14, Time: 10:00–12:00

Glottal volume velocity waveform, the acoustical excitation of voicedspeech, cannot be acquired through direct measurements in normalproduction of continuous speech. Glottal inverse filtering (GIF),however, can be used to estimate the glottal flow from recordedspeech signals. Unfortunately, the usefulness of GIF algorithms islimited since they are sensitive to noise and call for high-qualityrecordings. Recently, efforts have been taken to expand the use ofGIF by training deep neural networks (DNNs) to learn a statisticalmapping between frame-level acoustic features and glottal pulsesestimated by GIF. This framework has been successfully utilized instatistical speech synthesis in the form of the GlottDNN vocoderwhich uses a DNN to generate glottal pulses to be used as thesynthesizer’s excitation waveform. In this study, we investigate howthe DNN-based generation of glottal pulses is affected by trainingdata variety. The evaluation is done using both objective measuresas well as subjective listening tests of synthetic speech. The resultssuggest that the performance of the glottal pulse generation withDNNs is affected particularly by how well the training corpus suitsGIF: processing low-pitched male speech and sustained phonationsshows better performance than processing high-pitched femalevoices or continuous speech.

Towards Intelligent Crowdsourcing for Audio DataAnnotation: Integrating Active Learning in the RealWorld

Simone Hantke, Zixing Zhang, Björn Schuller;Universität Passau, GermanyThu-P-9-3-15, Time: 10:00–12:00

In this contribution, we combine the advantages of traditionalcrowdsourcing with contemporary machine learning algorithmswith the aim of ultimately obtaining reliable training data for audioprocessing in a faster, cheaper and therefore more efficient mannerthan has been previously possible. We propose a novel crowdsourc-ing approach, which brings a simulated active learning annotationscenario into a real world environment creating an intelligent andgamified crowdsourcing platform for manual audio annotation. Ourplatform combines two active learning query strategies with aninternally calculated trustability score to efficiently reduce manuallabelling efforts. This reduction is achieved in a twofold manner:first our system automatically decides if an instance requires anno-tation; second, it dynamically decides, depending on the quality ofpreviously gathered annotations, on exactly how many annotationsare needed to reliably label an instance. Results presented indicatethat our approach drastically reduces the annotation load and isconsiderably more efficient than conventional methods.

Thu-P-9-4 : Speech Synthesis: Data,Evaluation, and Novel ParadigmsPoster 4, 10:00–12:00, Thursday, 24 Aug. 2017Chair: Sébastien Le Maguer

Principles for Learning Controllable TTS fromAnnotated and Latent Variation

Gustav Eje Henter, Jaime Lorenzo-Trueba, Xin Wang,Junichi Yamagishi; NII, JapanThu-P-9-4-1, Time: 10:00–12:00

For building flexible and appealing high-quality speech synthesisers,it is desirable to be able to accommodate and reproduce fine vari-ations in vocal expression present in natural speech. Synthesiserscan enable control over such output properties by adding adjustablecontrol parameters in parallel to their text input. If not annotatedin training data, the values of these control inputs can be optimisedjointly with the model parameters. We describe how this establishedmethod can be seen as approximate maximum likelihood and MAPinference in a latent variable model. This puts previous ideas of(learned) synthesiser inputs such as sentence-level control vectorson a more solid theoretical footing. We furthermore extend themethod by restricting the latent variables to orthogonal subspacesvia a sparse prior. This enables us to learn dimensions of variationpresent also within classes in coarsely annotated speech. As anexample, we train an LSTM-based TTS system to learn nuances inemotional expression from a speech database annotated with sevendifferent acted emotions. Listening tests show that our proposalsuccessfully can synthesise speech with discernible differences inexpression within each emotion, without compromising the recog-nisability of synthesised emotions compared to an identical systemwithout learned nuances.

Sampling-Based Speech Parameter Generation UsingMoment-Matching Networks

Shinnosuke Takamichi 1, Tomoki Koriyama 2, HiroshiSaruwatari 1; 1University of Tokyo, Japan; 2TokyoInstitute of Technology, JapanThu-P-9-4-2, Time: 10:00–12:00

This paper presents sampling-based speech parameter generationusing moment-matching networks for Deep Neural Network (DNN)-based speech synthesis. Although people never produce exactly thesame speech even if we try to express the same linguistic and para-linguistic information, typical statistical speech synthesis producescompletely the same speech, i.e., there is no inter-utterance variationin synthetic speech. To give synthetic speech natural inter-utterancevariation, this paper builds DNN acoustic models that make it possi-ble to randomly sample speech parameters. The DNNs are trainedso that they make the moments of generated speech parametersclose to those of natural speech parameters. Since the variation ofspeech parameters is compressed into a low-dimensional simpleprior noise vector, our algorithm has lower computation cost thandirect sampling of speech parameters. As the first step towards gen-erating synthetic speech that has natural inter-utterance variation,this paper investigates whether or not the proposed sampling-basedgeneration deteriorates synthetic speech quality. In evaluation, wecompare speech quality of conventional maximum likelihood-basedgeneration and proposed sampling-based generation. The resultdemonstrates the proposed generation causes no degradation inspeech quality.

Notes

240

Unit Selection with Hierarchical Cascaded Long ShortTerm Memory Bidirectional Recurrent Neural Nets

Vincent Pollet 1, Enrico Zovato 2, Sufian Irhimeh 1, PierBatzu 2; 1Nuance Communications, Belgium; 2NuanceCommunications, ItalyThu-P-9-4-3, Time: 10:00–12:00

Bidirectional recurrent neural nets have demonstrated state-of-the-art performance for parametric speech synthesis. In this paper, weintroduce a top-down application of recurrent neural net modelsto unit-selection synthesis. A hierarchical cascaded network graphpredicts context phone duration, speech unit encoding and frame-level logF0 information that serves as targets for the search of units.The new approach is compared with an existing state-of-art hybridsystem that uses Hidden Markov Models as basis for the statisticalunit search.

Utterance Selection for Optimizing Intelligibility ofTTS Voices Trained on ASR Data

Erica Cooper 1, Xinyue Wang 1, Alison Chang 2, YochevedLevitan 1, Julia Hirschberg 1; 1Columbia University, USA;2Google, USAThu-P-9-4-4, Time: 10:00–12:00

This paper describes experiments in training HMM-based text-to-speech (TTS) voices on data collected for Automatic Speech Recog-nition (ASR) training. We compare a number of filtering techniquesdesigned to identify the best utterances from a noisy, multi-speakercorpus for training voices, to exclude speech containing noise andto include speech close in nature to more traditionally-collected TTScorpora. We also evaluate the use of automatic speech recognizersfor intelligibility assessment in comparison with crowdsourcingmethods. While the goal of this work is to develop natural-soundingand intelligible TTS voices in Low Resource Languages (LRLs) rapidlyand easily, without the expense of recording data specifically for thispurpose, we focus on English initially to identify the best filteringtechniques and evaluation methods. We find that, when a largeamount of data is available, selecting from the corpus based oncriteria such as standard deviation of f0, fast speaking rate, andhypo-articulation produces the most intelligible voices.

Bias and Statistical Significance in Evaluating SpeechSynthesis with Mean Opinion Scores

Andrew Rosenberg, Bhuvana Ramabhadran; IBM, USAThu-P-9-4-5, Time: 10:00–12:00

Listening tests and Mean Opinion Scores (MOS) are the most com-monly used techniques for the evaluation of speech synthesis qualityand naturalness. These are invaluable in the assessment of subjectivequalities of machine generated stimuli. However, there are a numberof challenges in understanding the MOS scores that come out oflistening tests.

Primarily, we advocate for the use of non-parametric statistical testsin the calculation of statistical significance when comparing listeningtest results.

Additionally, based on the results of 46 legacy listening tests, wemeasure the impact of two sources of bias. Bias introduced byindividual participants and synthesized text can a dramatic impacton observed MOS scores. For example, we find that on average themean difference between the highest and lowest scoring rater is over2 MOS points (on a 5 point scale). From this observation, we cautionagainst using any statistical test without adjusting for this bias, andprovide specific non-parametric recommendations.

Phase Modeling Using Integrated Linear PredictionResidual for Statistical Parametric Speech Synthesis

Nagaraj Adiga, S.R. Mahadeva Prasanna; IIT Guwahati,IndiaThu-P-9-4-6, Time: 10:00–12:00

The conventional statistical parametric speech synthesis (SPSS) focuson characteristics of the magnitude spectrum of speech for speechsynthesis by ignoring phase characteristics of speech. In this work,the role of phase information to improve the naturalness of syntheticspeech is explored. The phase characteristics of excitation signal areestimated from the integrated linear prediction residual (ILPR) usingan all-pass (AP) filter. The coefficients of the AP filter are estimatedby minimizing an entropy based objective function from the cosinephase of the analytical signal obtained from ILPR signal. The AP filtercoefficients (APCs) derived from the AP filter are used as featuresfor modeling phase in SPSS. During synthesis time, to generate theexcitation signal, frame wise generated APCs are used to add thegroup delay to the impulse excitation. The proposed method iscompared with the group delay based phase excitation used in theSTRAIGHT method. The experimental results show that proposedphased modeling having a better perceptual synthesis quality whencompared with the STRAIGHT method.

Evaluation of a Silent Speech Interface Based onMagnetic Sensing and Deep Learning for aPhonetically Rich Vocabulary

Jose A. Gonzalez 1, Lam A. Cheah 2, Phil D. Green 1,James M. Gilbert 2, Stephen R. Ell 3, Roger K. Moore 1, EdHoldsworth 4; 1University of Sheffield, UK; 2University ofHull, UK; 3Hull and East Yorkshire Hospitals Trust, UK;4Practical Control, UKThu-P-9-4-7, Time: 10:00–12:00

To help people who have lost their voice following total laryngectomy,we present a speech restoration system that produces audible speechfrom articulator movement. The speech articulators are monitoredby sensing changes in magnetic field caused by movements of smallmagnets attached to the lips and tongue. Then, articulator move-ment is mapped to a sequence of speech parameter vectors usinga transformation learned from simultaneous recordings of speechand articulatory data. In this work, this transformation is performedusing a type of recurrent neural network (RNN) with fixed latency,which is suitable for real-time processing. The system is evaluated ona phonetically-rich database with simultaneous recordings of speechand articulatory data made by non-impaired subjects. Experimentalresults show that our RNN-based mapping obtains more accuratespeech reconstructions (evaluated using objective quality metricsand a listening test) than articulatory-to-acoustic mappings usingGaussian mixture models (GMMs) or deep neural networks (DNNs).Moreover, our fixed-latency RNN architecture provides comparableperformance to an utterance-level batch mapping using bidirectionalRNNs (BiRNNs).

Predicting Head Pose from Speech with a ConditionalVariational Autoencoder

David Greenwood, Stephen Laycock, Iain Matthews;University of East Anglia, UKThu-P-9-4-8, Time: 10:00–12:00

Natural movement plays a significant role in realistic speech anima-tion. Numerous studies have demonstrated the contribution visualcues make to the degree we, as human observers, find an animationacceptable.

Notes

241

Rigid head motion is one visual mode that universally co-occurs withspeech, and so it is a reasonable strategy to seek a transformationfrom the speech mode to predict the head pose. Several previousauthors have shown that prediction is possible, but experiments aretypically confined to rigidly produced dialogue. Natural, expressive,emotive and prosodic speech exhibit motion patterns that are farmore difficult to predict with considerable variation in expected headpose.

Recently, Long Short Term Memory (LSTM) networks have becomean important tool for modelling speech and natural language tasks.We employ Deep Bi-Directional LSTMs (BLSTM) capable of learninglong-term structure in language, to model the relationship thatspeech has with rigid head motion. We then extend our model byconditioning with prior motion. Finally, we introduce a generativehead motion model, conditioned on audio features using a Condi-tional Variational Autoencoder (CVAE). Each approach mitigates theproblems of the one to many mapping that a speech to head posemodel must accommodate.

Real-Time Reactive Speech Synthesis: IncorporatingInterruptions

Mirjam Wester, David A. Braude, Blaise Potard,Matthew P. Aylett, Francesca Shaw; CereProc, UKThu-P-9-4-9, Time: 10:00–12:00

The ability to be interrupted and react in a realistic manner is akey requirement for interactive speech interfaces. While previoussystems have long implemented techniques such as ‘barge in’ wherespeech output can be halted at word or phrase boundaries, lesswork has explored how to mimic human speech output responsesto real-time events like interruptions which require a reaction fromthe system. Unlike previous work which has focused on incrementalproduction, here we explore a novel re-planning approach. Theproposed system is versatile and offers a large range of possibleways to react. A focus group was used to evaluate the approach,where participants interacted with a system reading out a text. Thesystem would react to audio interruptions, either with no reactions,passive reactions, or active negative reactions (i.e. getting increas-ingly irritated). Participants preferred a reactive system.

A Neural Parametric Singing Synthesizer

Merlijn Blaauw, Jordi Bonada; Universitat PompeuFabra, SpainThu-P-9-4-10, Time: 10:00–12:00

We present a new model for singing synthesis based on a modifiedversion of the WaveNet architecture. Instead of modeling rawwaveform, we model features produced by a parametric vocoder thatseparates the influence of pitch and timbre. This allows convenientlymodifying pitch to match any target melody, facilitates training onmore modest dataset sizes, and significantly reduces training andgeneration times. Our model makes frame-wise predictions usingmixture density outputs rather than categorical outputs in order toreduce the required parameter count. As we found overfitting tobe an issue with the relatively small datasets used in our experi-ments, we propose a method to regularize the model and make theautoregressive generation process more robust to prediction errors.Using a simple multi-stream architecture, harmonic, aperiodic andvoiced/unvoiced components can all be predicted in a coherentmanner. We compare our method to existing parametric statisti-cal and state-of-the-art concatenative methods using quantitativemetrics and a listening test. While naive implementations of theautoregressive generation algorithm tend to be inefficient, using asmart algorithm we can greatly speed up the process and obtain asystem that’s competitive in both speed and quality.

Tacotron: Towards End-to-End Speech Synthesis

Yuxuan Wang 1, R.J. Skerry-Ryan 1, Daisy Stanton 1,Yonghui Wu 1, Ron J. Weiss 1, Navdeep Jaitly 1, ZonghengYang 1, Ying Xiao 1, Zhifeng Chen 1, Samy Bengio 1, QuocLe 1, Yannis Agiomyrgiannakis 2, Rob Clark 2, Rif A.Saurous 1; 1Google, USA; 2Google, UKThu-P-9-4-11, Time: 10:00–12:00

A text-to-speech synthesis system typically consists of multiplestages, such as a text analysis frontend, an acoustic model and anaudio synthesis module. Building these components often requiresextensive domain expertise and may contain brittle design choices.In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters.Given <text, audio> pairs, the model can be trained completelyfrom scratch with random initialization. We present several keytechniques to make the sequence-to-sequence framework performwell for this challenging task. Tacotron achieves a 3.82 subjective5-scale mean opinion score on US English, outperforming a produc-tion parametric system in terms of naturalness. In addition, sinceTacotron generates speech at the frame level, it’s substantially fasterthan sample-level autoregressive methods.

Siri On-Device Deep Learning-Guided Unit SelectionText-to-Speech System

Tim Capes, Paul Coles, Alistair Conkie, Ladan Golipour,Abie Hadjitarkhani, Qiong Hu, Nancy Huddleston,Melvyn Hunt, Jiangchuan Li, Matthias Neeracher,Kishore Prahallad, Tuomo Raitio, Ramya Rasipuram,Greg Townsend, Becci Williamson, David Winarsky,Zhizheng Wu, Hepeng Zhang; Apple, USAThu-P-9-4-12, Time: 10:00–12:00

This paper describes Apple’s hybrid unit selection speech synthesissystem, which provides the voices for Siri with the requirement ofnaturalness, personality and expressivity. It has been deployed intohundreds of millions of desktop and mobile devices (e.g. iPhone,iPad, Mac, etc.) via iOS and macOS in multiple languages. The systemis following the classical unit selection framework with the advan-tage of using deep learning techniques to boost the performance. Inparticular, deep and recurrent mixture density networks are usedto predict the target and concatenation reference distributions forrespective costs during unit selection. In this paper, we present anoverview of the run-time TTS engine and the voice building process.We also describe various techniques that enable on-device capabilitysuch as preselection optimization, caching for low latency, and unitpruning for low footprint, as well as techniques that improve thenaturalness and expressivity of the voice such as the use of longunits.

An Expanded Taxonomy of Semiotic Classes for TextNormalization

Daan van Esch, Richard Sproat; Google, USAThu-P-9-4-13, Time: 10:00–12:00

We describe an expanded taxonomy of semiotic classes for textnormalization, building upon the work in [1]. We add a large numberof categories of non-standard words (NSWs) that we believe a robustreal-world text normalization system will have to be able to process.Our new categories are based upon empirical findings encounteredwhile building text normalization systems across many languages,for both speech recognition and speech synthesis purposes. Webelieve our new taxonomy is useful both for ensuring high coveragewhen writing manual grammars, as well as for eliciting training datato build machine learning-based text normalization systems.

Notes

242

Complex-Valued Restricted Boltzmann Machine forDirect Learning of Frequency Spectra

Toru Nakashika 1, Shinji Takaki 2, Junichi Yamagishi 2;1University of Electro-Communications, Japan; 2NII,JapanThu-P-9-4-14, Time: 10:00–12:00

In this paper, we propose a new energy-based probabilistic modelwhere a restricted Boltzmann machine (RBM) is extended to deal withcomplex-valued visible units. The RBM that automatically learns therelationships between visible units and hidden units (but withoutconnections in the visible or the hidden units) has been widelyused as a feature extractor, a generator, a classifier, pre-trainingof deep neural networks, etc. However, all the conventional RBMshave assumed the visible units to be either binary-valued or real-valued, and therefore complex-valued data cannot be fed to the RBM.

In various applications, however, complex-valued data is frequentlyused such examples include complex spectra of speech, fMRI im-ages, wireless signals, and acoustic intensity. For the direct learn-ing of such the complex-valued data, we define the new model called“complex-valued RBM (CRBM)” where the conditional probability ofthe complex-valued visible units given the hidden units forms acomplex-Gaussian distribution. Another important characteristic ofthe CRBM is to have connections between real and imaginary partsof each of the visible units unlike the conventional real-valued RBM.Our experiments demonstrated that the proposed CRBM can directlyencode complex spectra of speech signals without decoupling imagi-nary number or phase from the complex-value data.

Thu-S&T-9/10-A : Show & Tell 7E306, 10:00–12:00, 13:30–15:30, Thursday, 24 Aug. 2017

Soundtracing for Realtime Speech Adjustment toEnvironmental Conditions in 3D Simulations

Bartosz Ziółko, Tomasz Pedzimaz, Szymon Pałka; AGHUST, PolandThu-S&T-9-A-1, Time: 10:00–12:00

We present a 3D realtime audio engine which utilizes frustum tracingto create realistic audio auralization, modifying speech in architec-tural walkthroughs. All audio effects are computed based on boththe geometrical (e.g. walls, furniture) and acoustical scene proper-ties (e.g. materials, air attenuation). The sound changes dynamicallyas we change the point of perception and sound sources. The enginecan be configured to use as little as 10 percent of available processingpower. Our demonstration will be based on listening radio samplesin rooms with similar shape, but different acoustical properties. Thedescribed system is a component of a virtual reality trainer for fire-fighters using Oculus Rift. It allows to conduct dialogues with victimsand to locate them based on sound cues.

Vocal-Tract Model with Static Articulators: Lips,Teeth, Tongue, and More

Takayuki Arai; Sophia University, JapanThu-S&T-9-A-2, Time: 10:00–12:00

Our physical models of the human vocal tract successfully demon-strate theories such as the source-filter theory of speech production,mechanisms such as the relationship between vocal-tract configura-tion and vowel quality, and phenomena such as formant frequencyestimation. Earlier models took one of two directions: either simpli-fication, showing only a few target themes, or diversification, simu-lating human articulation more broadly. In this study, we have de-signed a static, hybrid model. Each model of this type produces one

vowel. However, the model also simulates the human articulatorsmore broadly, including the lips, teeth, and tongue. The sagittal blockis enclosed with transparent plates so that the inside of the vocaltract is visible from the outside. We also colored the articulators tomake them more easily identified. In testing, we confirmed that thevocal-tract models can produce the target vowel. These models havegreat potential, with applications not only in acoustics and phoneticseducation, but also pronunciation training in language learning andspeech therapy in the clinical setting.

Remote Articulation Test System Based on WebRTC

Ikuyo Masuda-Katsuse; Kindai University, JapanThu-S&T-9-A-3, Time: 10:00–12:00

A remote articulation test system with multimedia communicationhas been developed in order that outside speech-language-hearingtherapists (STs) can exam pronunciations of the students in specialeducation classes in regular elementary schools and give advice totheir teachers. The proposed system has video and voice commu-nication and image transmission functions based on WebRTC. Usingimage transmission, the ST presents picture cards for the word testto the student and asks what is depicted. Using video / voice commu-nication, the ST confirms the student’s voice and articulation move-ment. Compared to our previous system in which written words werepresented, the proposed system enables a more formal and accuratearticulation test.

The ModelTalker Project: A Web-Based Voice BankingPipeline for ALS/MND Patients

H. Timothy Bunnell, Jason Lilley, Kathleen McGrath;Nemours Biomedical Research, USAThu-S&T-9-A-4, Time: 10:00–12:00

The Nemours ModelTalker supports voice banking for users diag-nosed with ALS/MND and related neurodegenerative diseases. Usersrecord up to 1600 sentences from which a synthetic voice is con-structed. For the past two years we have focused on extending andrefining a web-based recording tool to support this process. In thisdemonstration, we illustrate the features of the web-based pipelinethat guides patients through the process of setting up to record athome, recording a standard speech inventory, adding custom record-ings, and screening alternative versions of their voice and alternativesynthesis parameter settings. Finally, we summarize results from 352individuals with a wide range of speaking ability, who have recentlyused this voice banking pipeline.

Visible Vowels: A Tool for the Visualization of VowelVariation

Wilbert Heeringa, Hans Van de Velde; Fryske Akademy,The NetherlandsThu-S&T-9-A-5, Time: 10:00–12:00

This paper presents Visible Vowels, a web app that visualizes varia-tion in f0, formants and duration. It combines user friendliness withmaximum functionality and flexibility, using a live plot view.

Notes

243

Author IndexAAare, Kätlin . . . . . . . . . . . . . Tue-O-3-6-1 118

Tue-P-4-3-2 143Abad, Alberto . . . . . . . . . . . Wed-P-7-4-9 205AbdAlmageed, Wael . . . . Tue-O-4-8-2 127Abdo, Omnia . . . . . . . . . . . Thu-O-10-4-6 232Abdou, Sherif . . . . . . . . . . . Thu-O-10-4-6 232Abe, Masanobu . . . . . . . . . Wed-P-8-4-5 215Abidi, K. . . . . . . . . . . . . . . . . . Thu-O-10-4-1 231Abraham, Basil . . . . . . . . . Mon-P-2-3-8 105

Wed-SS-7-1-9 164Wed-SS-7-1-10 164

Wed-O-8-8-4 186Achanta, Sivanand . . . . . . Mon-SS-2-8-6 80

Wed-O-8-1-1 182Acheson, Daniel J. . . . . . . Mon-P-2-2-7 102Adda, Gilles . . . . . . . . . . . . . Thu-P-9-3-7 238Adda-Decker, Martine . . Mon-SS-1-11-6 78

Wed-O-7-4-3 178Thu-O-10-8-3 233

Thu-P-9-3-7 238Adi, Yossi . . . . . . . . . . . . . . . Tue-O-3-6-5 119

Wed-O-7-8-6 181Adiga, Nagaraj . . . . . . . . . . Tue-P-5-2-4 150

Wed-O-6-4-3 170Thu-P-9-4-6 241

Agiomyrgiannakis, Y. . . . Tue-O-4-1-6 123Thu-P-9-4-11 242

Agrawal, Dharmesh M. . Wed-P-7-3-15 203Agrawal, Purvi . . . . . . . . . . Wed-O-7-2-1 177Aguilar, Lourdes . . . . . . . . Tue-O-5-8-6 135Agurto, Carla . . . . . . . . . . . Wed-P-7-4-6 205Ahmad, W. . . . . . . . . . . . . . . Wed-O-6-10-2 174Ahmed, Farhia . . . . . . . . . . Wed-S&T-6-B-2 217Ahn, ChungHyun . . . . . . . Mon-P-1-2-5 95Aihara, Ryo . . . . . . . . . . . . . Wed-P-8-4-3 214Airaksinen, Manu . . . . . . . Thu-P-9-3-11 239

Thu-P-9-3-14 240Ajili, Moez . . . . . . . . . . . . . . Wed-P-6-2-9 193Akhtiamov, Oleg . . . . . . . . Wed-O-7-6-4 180Akiba, Tomoyosi . . . . . . . Wed-P-6-3-5 195Akira, Hayakawa . . . . . . . . Wed-P-8-2-9 211Alam, Hassan . . . . . . . . . . . Wed-S&T-6-B-5 218Alam, Jahangir . . . . . . . . . . Tue-O-5-2-5 131

Tue-P-3-1-9 138Thu-O-10-2-5 231

Alcorn, Alyssa M. . . . . . . . Tue-SS-3-11-4 111Aldeneh, Zakaria . . . . . . . Tue-O-3-10-3 121

Tue-O-3-10-5 121Tue-O-4-8-4 127

Aleksic, Petar . . . . . . . . . . . Mon-P-1-4-1 96Alexander, Rachel . . . . . . Mon-O-1-10-4 86Algra, Jouke . . . . . . . . . . . . Mon-SS-1-11-1 77Al Hanai, Tuka . . . . . . . . . . Wed-O-7-10-6 182Ali, Ahmed . . . . . . . . . . . . . . Wed-O-7-10-6 182Alku, Paavo . . . . . . . . . . . . . Tue-O-5-4-2 132

Tue-O-5-4-3 132Tue-P-3-1-8 138Wed-P-8-1-8 207Wed-P-8-4-7 215

Thu-P-9-3-11 239Thu-P-9-3-14 240

Allen, James . . . . . . . . . . . . Tue-K2-1 110Alluri, K.N.R.K. Raju . . . . Mon-SS-2-8-6 80

Wed-O-8-1-1 182Almeida, Andre . . . . . . . . . Mon-O-1-6-5 85Alonso, Agustin . . . . . . . . Thu-SS-10-10-4 222Al-Radhi, Mohammed S. Mon-P-1-1-6 93Alumäe, Tanel . . . . . . . . . . Wed-SS-7-1-12 165Alwan, Abeer . . . . . . . . . . . Mon-P-1-2-3 95

Tue-P-3-1-10 138Thu-O-10-2-1 230

Amatuni, Andrei . . . . . . . . Wed-SS-6-11-3 162Thu-SS-9-10-1 219Thu-SS-9-10-4 219

Amazouz, Djegdjiga . . . . Mon-SS-1-11-6 78Ambati, Bharat Ram . . . . Wed-P-6-1-10 190Ambikairajah, E. . . . . . . . . Mon-O-2-4-2 89

Tue-O-4-8-3 127Tue-P-3-1-6 137

Wed-O-7-10-4 182Wed-O-8-1-3 183Wed-P-6-2-2 191

Ambrazaitis, Gilbert . . . . Wed-P-8-1-13 208

Amiriparian, Shahin . . . . Tue-SS-3-11-4 111Thu-SS-10-10-2 222Thu-SS-10-10-3 222

Amman, Scott . . . . . . . . . . Wed-P-7-3-9 202An, Maobo . . . . . . . . . . . . . . Mon-SS-2-8-2 79Ananthapadmanabha, T Mon-P-2-2-4 101Andersen, Asger H. . . . . . Wed-P-6-4-6 197Anderson, Hans . . . . . . . . Wed-P-7-3-1 200Anderson, Peter . . . . . . . . Thu-SS-9-11-3 221Ando, Atsushi . . . . . . . . . . Tue-P-4-3-12 145Ando, Hiroshi . . . . . . . . . . . Tue-O-3-6-4 119André, Elisabeth . . . . . . . . Thu-SS-9-10-7 220Andreeva, Bistra . . . . . . . . Wed-P-7-2-5 199Andrei, Valentin . . . . . . . . Tue-O-4-4-5 125Anjos, André . . . . . . . . . . . Tue-S&T-3-A-2 158Antoniou, Mark . . . . . . . . . Mon-P-2-1-4 99Aono, Yushi . . . . . . . . . . . . . Tue-O-3-10-2 121

Tue-P-4-3-12 145Arai, Jun . . . . . . . . . . . . . . . . Tue-SS-5-11-4 114Arai, Takayuki . . . . . . . . . . Tue-O-3-2-3 116

Thu-S&T-9-A-2 243Araki, Shoko . . . . . . . . . . . . Wed-P-6-4-3 197Arantes, Pablo . . . . . . . . . . Thu-P-9-3-4 238Ardaillon, Luc . . . . . . . . . . . Tue-O-4-10-5 128Arias-Vergara, Tomás . . Tue-P-5-2-7 150Arık, Sercan Ö. . . . . . . . . . . Tue-P-4-1-5 141Ariki, Yasuo . . . . . . . . . . . . . Wed-P-8-4-3 214

Wed-P-8-4-8 215Arimoto, Yoshiko . . . . . . . Wed-P-8-2-7 210Arnela, Marc . . . . . . . . . . . . Thu-SS-9-11-2 220

Thu-SS-9-11-5 221Thu-SS-9-11-6 221

Arora, Raman . . . . . . . . . . . Tue-O-3-2-4 116Arora, Vipul . . . . . . . . . . . . . Tue-O-5-8-4 134Arsikere, Harish . . . . . . . . Wed-O-6-10-4 175Asadiabadi, Sasan . . . . . . Mon-P-2-2-3 101Asaei, Afsaneh . . . . . . . . . . Thu-O-9-1-5 224Asami, Taichi . . . . . . . . . . . Mon-P-2-4-3 106

Tue-P-4-3-1 143Askjær-Jørgensen, T. . . . Tue-P-5-1-2 146Astésano, Corine . . . . . . . Wed-O-7-1-5 176Astolfi, Arianna . . . . . . . . . Tue-P-5-2-2 149Astudillo, Ramon F. . . . . Tue-SS-5-11-8 114

Wed-P-8-4-11 216Asu, Eva Liina . . . . . . . . . . . Tue-O-5-6-6 134Athanasopoulou, A. . . . . Tue-O-5-6-3 133Atkins, David C. . . . . . . . . Wed-P-8-2-2 209Atsushi, Ando . . . . . . . . . . Tue-O-3-10-2 121Audhkhasi, Kartik . . . . . . Mon-O-1-1-5 81

Tue-O-3-1-5 115Audibert, Nicolas . . . . . . . Tue-SS-5-11-2 113Avanzi, Mathieu . . . . . . . . Thu-P-9-3-12 239Aylett, Matthew P. . . . . . . Mon-O-1-10-5 86

Thu-P-9-4-9 242Ayllón, David . . . . . . . . . . . Mon-O-1-4-5 83

BBaby, Arun . . . . . . . . . . . . . . Wed-S&T-6-A-5 217

Thu-O-10-11-4 234Bacchiani, Michiel . . . . . . Mon-O-2-10-1 91

Mon-O-2-10-5 92Tue-P-4-2-3 142

Bach, Francis . . . . . . . . . . . . Wed-P-7-2-13 200Bäckström, Tom . . . . . . . . Mon-O-2-4-6 90Badin, Pierre . . . . . . . . . . . . Wed-O-6-1-2 169Badino, Leonardo . . . . . . . Tue-O-3-2-4 116

Wed-O-6-6-4 172Bagby, Tom . . . . . . . . . . . . . Tue-P-4-2-3 142Baggott, Matthew J. . . . . . Wed-P-7-4-6 205Bahmaninezhad, F. . . . . . Tue-O-5-2-4 131Bai, Linxue . . . . . . . . . . . . . . Mon-O-2-4-1 89Baird, Alice . . . . . . . . . . . . . Tue-SS-3-11-4 111

Wed-P-8-2-1 209Thu-SS-10-10-3 222

Baker, Justin T. . . . . . . . . . Wed-P-8-2-3 209Balog, András . . . . . . . . . . . Wed-SS-6-2-2 160Baltrušaitis, Tadas . . . . . . Wed-P-8-2-3 209Bandini, Andrea . . . . . . . . Mon-P-2-2-14 104

Tue-P-5-2-3 149Bando, Yoshiaki . . . . . . . . Wed-O-7-2-2 177Bang, Jeong-Uk . . . . . . . . . Wed-P-6-3-12 196Banno, Hideki . . . . . . . . . . . Mon-P-1-1-4 93

Tue-O-5-4-1 131Bao, Changchun . . . . . . . . Tue-P-5-3-7 153

Thu-O-9-6-2 227Bao, Feng . . . . . . . . . . . . . . . . Tue-P-5-3-7 153Bapna, Ankur . . . . . . . . . . . Wed-O-7-4-1 178

Barbosa, Plínio A. . . . . . . . Thu-P-9-3-9 239Barker, Jon . . . . . . . . . . . . . . Mon-P-1-1-2 93

Tue-P-5-4-8 156Wed-O-7-2-5 177

Barlaz, Marissa . . . . . . . . . Mon-O-1-6-1 84Barra-Chicote, Roberto . Tue-O-3-8-2 120Barras, Claude . . . . . . . . . . Thu-O-9-2-5 225

Thu-O-10-11-6 235Barriere, Valentin . . . . . . . Tue-O-5-10-3 135Baskar, Murali Karthick Mon-O-2-1-6 87

Mon-P-2-3-4 105Wed-SS-6-2-1 160

Batista, Fernando . . . . . . . Tue-SS-5-11-8 114Batliner, Anton . . . . . . . . . Tue-SS-3-11-4 111

Wed-P-8-2-1 209Thu-SS-9-10-1 219Thu-SS-9-10-2 219Thu-SS-9-10-3 219

Thu-SS-10-10-8 223Batzu, Pier . . . . . . . . . . . . . . Thu-P-9-4-3 241Baucom, Brian . . . . . . . . . . Wed-P-8-2-10 211

Wed-P-8-2-11 211Bauer, Josef . . . . . . . . . . . . . Tue-S&T-3-A-5 158Baumann, Timo . . . . . . . . . Wed-SS-8-11-9 169Bayer, Ali Orkan . . . . . . . . Wed-O-7-6-3 179Beare, Richard . . . . . . . . . . Wed-P-7-2-7 199Beaufays, Françoise . . . . Tue-O-5-1-1 129

Wed-O-7-8-5 181Beck, Eugen . . . . . . . . . . . . . Tue-O-3-1-2 115Beckman, Mary E. . . . . . . . Tue-P-5-1-4 146Bedi, Gillinder . . . . . . . . . . Wed-P-7-4-6 205Beerends, John . . . . . . . . . Wed-P-6-4-1 196Beke, András . . . . . . . . . . . . Thu-SS-9-10-5 219Belinkov, Yonatan . . . . . . Wed-O-7-10-6 182Bell, Peter . . . . . . . . . . . . . . . Mon-P-2-3-10 106

Mon-S&T-2-A-5 109Wed-P-6-3-10 196

Beneš, Karel . . . . . . . . . . . . . Mon-O-2-1-6 87Bengio, Samy . . . . . . . . . . . Thu-P-9-4-11 242Bengio, Yoshua . . . . . . . . . Tue-O-5-1-3 129

Wed-O-6-10-6 175Ben Jannet, Mohamed A. Wed-O-7-4-3 178Ben Kheder, Waad . . . . . . Tue-P-5-2-6 150

Wed-P-6-2-9 193Benuš, Štefan . . . . . . . . . . . Wed-O-6-6-2 171

Wed-P-7-2-9 199Bergelson, Elika . . . . . . . . . Mon-S&T-2-A-4 108

Wed-SS-6-11-3 162Wed-SS-6-11-4 162Thu-SS-9-10-1 219Thu-SS-9-10-4 219

Bergmann, Christina . . . Tue-S&T-3-A-6 158Wed-SS-6-11-5 162

Berisha, Visar . . . . . . . . . . . Tue-P-5-2-1 149Tue-P-5-2-9 151

Berman, Alex . . . . . . . . . . . Wed-S&T-6-B-1 217Bertero, Dario . . . . . . . . . . . Mon-S&T-2-B-4 109Berthelsen, Harald . . . . . . Tue-O-3-8-5 120

Wed-SS-7-1-1 163Bertoldi, Nicola . . . . . . . . . Wed-O-8-4-3 184Besacier, Laurent . . . . . . . Wed-SS-7-1-6 164Beskow, Jonas . . . . . . . . . . Tue-SS-3-11-5 111

Tue-O-3-8-5 120Best, Catherine T. . . . . . . . Mon-O-1-6-3 84

Wed-P-7-2-2 198Betz, Simon . . . . . . . . . . . . . Tue-O-3-8-6 120Bhat, Chitralekha . . . . . . . Mon-S&T-2-A-6 109

Tue-P-5-2-10 151Bhati, Saurabhchand . . . Wed-SS-7-1-5 163Bhattacharya, Gautam . . Tue-O-5-2-5 131

Tue-P-3-1-9 138Thu-O-10-2-5 231

Biadsy, Fadi . . . . . . . . . . . . . Mon-O-2-1-1 86Wed-O-8-10-1 187Wed-O-8-10-4 188

Bian, Tianling . . . . . . . . . . . Mon-P-2-4-8 107Bin Siddique, Farhad . . . Wed-P-8-2-6 210Birkholz, Peter . . . . . . . . . . Mon-P-1-1-3 93

Tue-O-5-10-5 136Wed-SS-8-11-8 168

Bishop, Judith . . . . . . . . . . Tue-O-4-6-5 126Bjerva, Johannes . . . . . . . . Tue-P-5-1-13 148Björkenstam, Kristina N. Wed-SS-7-11-3 166B.K., Dhanush . . . . . . . . . . . Wed-P-6-2-12 193Blaauw, Merlijn . . . . . . . . . Thu-P-9-4-10 242Black, Alan W. . . . . . . . . . . Mon-SS-1-11-4 78

Mon-SS-1-11-5 78Wed-P-8-4-11 216

244

Blackburn, Daniel . . . . . . . Wed-P-7-4-7 205Blaylock, Reed . . . . . . . . . . Wed-O-6-1-3 169

Wed-O-6-1-5 170Bocklet, Tobias . . . . . . . . . Tue-S&T-3-A-5 158Boë, Louis-Jean . . . . . . . . . Wed-O-6-1-2 169Boenninghoff, Benedikt Wed-SS-7-1-7 164Bohn, Ocke-Schwen . . . . . Tue-P-5-1-2 146Bollepalli, Bajibabu . . . . . Tue-O-5-4-3 132

Wed-P-8-4-7 215Bölte, Sven . . . . . . . . . . . . . . Mon-O-2-2-5 88Bonada, Jordi . . . . . . . . . . . Thu-P-9-4-10 242Bonafonte, Antonio . . . . . Thu-O-9-6-5 227Bonastre, Jean-François Wed-P-6-2-8 192

Wed-P-6-2-9 193Bone, Daniel . . . . . . . . . . . . Mon-O-1-2-2 82

Wed-O-8-10-5 188Borgström, Bengt J. . . . . . Tue-P-3-2-5 139Boril, Tomáš . . . . . . . . . . . . Wed-P-7-2-6 199

Thu-P-9-3-5 238Borský, Michal . . . . . . . . . . Mon-O-2-2-4 88Bosker, Hans Rutger . . . . Mon-P-2-1-6 99

Wed-SS-8-11-2 167Wed-O-7-1-1 175

Botros, Noor . . . . . . . . . . . . Wed-SS-6-11-2 161Bouchekif, Abdessalam Wed-P-6-3-11 196Boucher, Victor J. . . . . . . . Thu-P-9-3-6 238Bouillon, Pierrette . . . . . . Wed-S&T-6-B-2 217Boula de Mareüil, P. . . . . Thu-P-9-3-9 239Bourlard, Hervé . . . . . . . . . Mon-P-2-3-3 104

Thu-O-9-1-5 224Bousquet, Pierre-Michel Tue-P-3-2-3 139Boves, L. . . . . . . . . . . . . . . . . Tue-O-4-2-3 123Braginsky, Mika . . . . . . . . . Tue-S&T-3-A-6 158Brakel, Philemon . . . . . . . Tue-O-5-1-3 129Brandt, Erika . . . . . . . . . . . . Wed-P-7-2-5 199Brattain, Laura J. . . . . . . . Mon-P-2-2-13 103Braude, David A. . . . . . . . . Thu-P-9-4-9 242Braun, Bettina . . . . . . . . . . Tue-O-4-6-4 126

Tue-O-5-6-1 133Braunschweiler, N. . . . . . . Wed-P-8-4-9 215Bredin, Hervé . . . . . . . . . . . Thu-O-9-2-5 225

Thu-O-9-2-6 225Thu-O-10-11-6 235

Brooks, Elizabeth . . . . . . . Wed-P-6-1-10 190Brueckner, Raymond . . . Mon-O-2-2-5 88

Wed-O-6-8-4 173Bruguier, Antoine . . . . . . Wed-O-7-8-5 181Brümmer, Niko . . . . . . . . . Tue-O-3-4-2 117

Tue-O-5-2-5 131Tue-P-3-1-1 136

Bruni, Jagoda . . . . . . . . . . . Thu-O-10-8-4 233Brusco, Pablo . . . . . . . . . . . Wed-O-6-6-6 172Brutti, Alessio . . . . . . . . . . Mon-P-2-3-5 105Bryan, Craig J. . . . . . . . . . . Wed-P-8-2-11 211Bryhadyr, Nataliya . . . . . . Wed-P-8-1-11 208Buçinca, Zana . . . . . . . . . . . Tue-SS-3-11-3 111Buck, Markus . . . . . . . . . . . Wed-O-6-4-5 171Budzianowski, Paweł . . . Tue-P-4-3-13 145Buera, Luis . . . . . . . . . . . . . . Tue-O-3-4-6 118

Tue-O-5-2-5 131Bulling, Philipp . . . . . . . . . Mon-O-1-4-2 83Bullock, Barbara E. . . . . . . Mon-SS-1-11-7 78Bunnell, H. Timothy . . . . Tue-P-5-2-11 151

Thu-S&T-9-A-4 243Bunt, Harry . . . . . . . . . . . . . Mon-O-1-2-1 81Burchfield, L. Ann . . . . . . Mon-P-2-1-4 99Burget, Lukáš . . . . . . . . . . . Mon-O-2-1-6 87

Mon-P-2-3-4 105Tue-P-3-2-7 140Tue-P-3-2-8 140

Wed-SS-6-2-1 160Thu-O-10-1-2 229

Burileanu, Corneliu . . . . . Tue-O-4-4-5 125Burmania, Alec . . . . . . . . . Mon-O-1-2-3 82Busa-Fekete, Róbert . . . . Thu-SS-10-10-5 222Busso, Carlos . . . . . . . . . . . Mon-O-1-2-3 82

Tue-O-3-10-4 121Tue-P-5-3-14 154

Byrd, Dani . . . . . . . . . . . . . . Mon-P-2-2-5 102Byun, Jun . . . . . . . . . . . . . . . Wed-O-8-6-5 185

CCabarrão, Vera . . . . . . . . . . Tue-SS-5-11-8 114Cabral, João Paulo . . . . . . Mon-O-1-10-1 85Cai, Danwei . . . . . . . . . . . . . Mon-SS-1-8-4 76

Thu-SS-9-10-6 219Cai, Lianhong . . . . . . . . . . . Mon-P-2-4-6 107

Tue-O-4-8-1 126Wed-P-8-4-10 216

Cai, Weicheng . . . . . . . . . . . Mon-SS-1-8-4 76Thu-SS-9-10-6 219

Camelin, Nathalie . . . . . . . Wed-P-6-3-11 196Wed-P-8-3-6 212

Cameron, Hugh . . . . . . . . . Wed-SS-7-1-2 163Campbell, Nick . . . . . . . . . Wed-O-6-8-6 174

Wed-P-8-2-9 211Wed-P-8-3-12 214

Campbell, William . . . . . . Tue-O-5-2-2 130Campos, Joana . . . . . . . . . . Tue-SS-5-11-8 114Can, Dogan . . . . . . . . . . . . . Wed-P-8-2-2 209Candeias, Sara . . . . . . . . . . Tue-O-5-8-5 135

Wed-P-6-1-3 189Cano, María José . . . . . . . . Mon-SS-1-8-2 76Cao, Beiming . . . . . . . . . . . . Mon-O-1-10-6 86

Wed-P-6-1-7 190Cao, Chong . . . . . . . . . . . . . Mon-P-2-2-6 102Cao, Yuhang . . . . . . . . . . . . Tue-P-5-3-6 152Cao, Zhanzhong . . . . . . . . Tue-P-5-3-2 152Capes, Tim . . . . . . . . . . . . . . Thu-P-9-4-12 242Carignan, Christopher . . Mon-O-1-6-3 84

Wed-P-7-2-2 198Carlson, Jason C. . . . . . . . Tue-O-5-10-1 135Caroselli, Joe . . . . . . . . . . . . Mon-O-2-10-5 92

Thu-P-9-1-10 237Carullo, Alessio . . . . . . . . . Tue-P-5-2-2 149Casanueva, Iñigo . . . . . . . Tue-P-4-3-13 145Caseiro, Diamantino . . . . Wed-O-8-10-1 187

Wed-O-8-10-4 188Casillas, Marisa . . . . . . . . . Wed-SS-6-11-3 162

Wed-SS-6-11-4 162Thu-SS-9-10-1 219Thu-SS-9-10-4 219

Castan, Diego . . . . . . . . . . . Tue-O-5-10-1 135Wed-P-6-2-1 191

Thu-O-10-2-6 231Castellana, Antonella . . . Tue-P-5-2-2 149Cau, Cecile . . . . . . . . . . . . . . Wed-P-8-1-4 207Caucheteux, Lise . . . . . . . . Thu-O-10-8-2 233Cecchi, Guillermo A. . . . . Wed-P-7-4-6 205Cernak, Milos . . . . . . . . . . . Tue-S&T-3-A-2 158Cernocký, Jan . . . . . . . . . . . Mon-P-2-3-4 105

Tue-P-3-2-7 140Tue-P-3-2-8 140

Thu-O-10-1-2 229Cha, Jih-Ho . . . . . . . . . . . . . Wed-P-8-1-1 206Chaabouni, Rahma . . . . . Wed-SS-7-11-6 167Chakrabarti, Indrajit . . . Thu-O-9-6-1 227Chambers, Craig G. . . . . . Tue-SS-4-11-2 112Chaminade, Thierry . . . . Tue-SS-3-11-6 111Champagne-Lavau, M. . . Wed-P-8-1-4 207Chan, William . . . . . . . . . . . Tue-O-3-1-3 115Chandrashekaran, A. . . . Mon-P-1-4-7 97Chandu, Khyathi R. . . . . . Mon-SS-1-11-5 78Chang, Alison . . . . . . . . . . . Thu-P-9-4-4 241Chang, Shiyu . . . . . . . . . . . . Tue-P-5-4-13 157Chang, Shuangyu . . . . . . . Thu-P-9-3-13 239Chang, Shuo-Yiin . . . . . . . Tue-P-5-3-8 153

Thu-O-10-11-3 234Chang, Xuankai . . . . . . . . . Wed-O-7-2-3 177Charlet, Delphine . . . . . . . Wed-P-6-3-11 196

Thu-O-9-2-3 225Charonyktakis, Paulos . Wed-O-8-8-5 187Chasaide, Ailbhe Ní . . . . Tue-O-4-8-5 127Cheah, Lam A. . . . . . . . . . . Thu-P-9-4-7 241Chelba, Ciprian . . . . . . . . . Wed-O-8-10-4 188Chen, Aoju . . . . . . . . . . . . . . Mon-P-2-2-16 104

Wed-SS-8-11-6 168Chen, Berlin . . . . . . . . . . . . . Wed-P-6-3-4 195

Thu-O-9-1-6 224Thu-O-9-6-4 227

Chen, Bo . . . . . . . . . . . . . . . . Mon-P-2-4-8 107Mon-P-2-4-9 107

Chen, Bo-Rui . . . . . . . . . . . . Tue-P-5-3-13 154Chen, Chen . . . . . . . . . . . . . Tue-P-3-2-1 139Chen, Chin-Po . . . . . . . . . . Wed-O-6-8-2 173Chen, Deming . . . . . . . . . . Tue-P-5-3-9 153Chen, Fei . . . . . . . . . . . . . . . . Mon-P-2-1-1 98

Mon-P-2-1-2 98

Chen, Hsuan-Yu . . . . . . . . Wed-P-8-2-4 210Chen, I-Fan . . . . . . . . . . . . . . Wed-O-7-2-6 178Chen, Jinhui . . . . . . . . . . . . Wed-P-8-4-8 215Chen, Kuan-Yu . . . . . . . . . . Wed-P-6-3-4 195

Thu-O-9-1-6 224Chen, Nancy F. . . . . . . . . . . Mon-P-2-3-7 105

Wed-SS-7-11-1 166Wed-P-6-1-5 189

Wed-P-6-1-11 190Chen, Si . . . . . . . . . . . . . . . . . Tue-P-5-1-6 147Chen, Siyuan . . . . . . . . . . . . Mon-O-2-4-2 89Chen, Wenda . . . . . . . . . . . . Wed-P-6-1-11 190Chen, X. . . . . . . . . . . . . . . . . . Mon-O-2-1-3 87Chen, Yafan . . . . . . . . . . . . . Wed-P-6-4-4 197Chen, Ying . . . . . . . . . . . . . . Tue-P-5-1-8 147Chen, Ying-Wen . . . . . . . . . Wed-P-6-3-4 195Chen, Yixiang . . . . . . . . . . . Mon-SS-2-8-3 79

Tue-P-3-2-2 139Chen, Yun-Nung . . . . . . . . Wed-P-6-3-6 195Chen, Zhifeng . . . . . . . . . . . Wed-O-8-4-1 184

Thu-P-9-4-11 242Chen, Zhipeng . . . . . . . . . . Thu-O-9-4-1 225Chen, Zhuo . . . . . . . . . . . . . Thu-O-9-6-3 227Chen, Zhuxin . . . . . . . . . . . Mon-SS-2-8-5 80Cheng, Gaofeng . . . . . . . . . Tue-P-4-1-1 140Cheng, Zuofu . . . . . . . . . . . Tue-P-5-3-9 153Chennupati, Nivedita . . . Tue-S&T-3-A-4 158Chi, Tai-Shih . . . . . . . . . . . . Mon-O-1-4-6 84Chien, Jen-Tzung . . . . . . . Tue-O-4-4-4 125

Tue-O-5-1-4 129Thu-O-10-1-1 229

Chien, Yu-Ren . . . . . . . . . . Mon-O-2-2-4 88Chikhi, Samy . . . . . . . . . . . . Mon-S&T-2-B-1 109Child, Rewon . . . . . . . . . . . Tue-P-4-1-5 141Chin, Kean . . . . . . . . . . . . . . Mon-O-2-10-1 91

Mon-O-2-10-5 92Ching, P.C. . . . . . . . . . . . . . . Wed-O-8-8-1 186Chng, Eng Siong . . . . . . . . Tue-P-5-3-5 152Cho, Eunah . . . . . . . . . . . . . Wed-O-8-4-5 184Choi, Ikkyu . . . . . . . . . . . . . Tue-O-5-8-3 134

Wed-P-6-1-4 189Choi, Inkyu . . . . . . . . . . . . . Mon-P-1-2-6 95Choi, Mu-Yeol . . . . . . . . . . . Wed-P-6-3-12 196Chong, Chee Seng . . . . . . Mon-P-2-1-5 99Choo, Kihyun . . . . . . . . . . . Tue-P-5-4-5 155Chorowski, Jan . . . . . . . . . Mon-P-1-4-4 97

Wed-O-8-4-1 184Chowdhury, Shreyan . . . Wed-P-7-3-12 202Christensen, Heidi . . . . . . Wed-P-7-4-7 205Christensen, Mads G. . . . Mon-O-2-2-1 87Christodoulides, G. . . . . . Thu-P-9-3-12 239Chung, Cheng-Tao . . . . . . Thu-O-10-11-5 235Chung, Joon Son . . . . . . . . Wed-O-8-1-5 183Church, Kenneth W. . . . . Thu-O-9-4-5 226Chwalek, Patrick C. . . . . . Mon-P-2-2-13 103Clark, Rob . . . . . . . . . . . . . . . Thu-P-9-4-11 242Clavel, Chloé . . . . . . . . . . . . Tue-O-5-10-3 135C.M., Vikram . . . . . . . . . . . . Tue-P-5-2-4 150Cmejla, Roman . . . . . . . . . Tue-P-5-2-8 150Coates, Adam . . . . . . . . . . . Tue-P-4-1-5 141Cohen, Yishai . . . . . . . . . . . Thu-O-9-2-4 225Colbath, Sean . . . . . . . . . . . Tue-S&T-3-A-1 158Coles, Paul . . . . . . . . . . . . . . Thu-P-9-4-12 242Colett, Hannah R. . . . . . . . Tue-S&T-3-A-5 158Colibro, Daniele . . . . . . . . Tue-O-5-2-3 130Collins, Zachary . . . . . . . . Wed-P-8-3-4 212Conkie, Alistair . . . . . . . . . Thu-P-9-4-12 242Conlan, Owen . . . . . . . . . . . Wed-O-6-8-6 174Cooke, Martin . . . . . . . . . . . Wed-O-8-8-5 187Cooper, Erica . . . . . . . . . . . Thu-P-9-4-4 241Cooper-Leavitt, J. . . . . . . . Thu-P-9-3-7 238Corris, Miriam . . . . . . . . . . Tue-O-4-6-5 126Cortes, Elísabet Eir . . . . . Tue-P-5-1-15 149Cowan, Benjamin R. . . . . Mon-O-1-10-1 85Cristia, Alejandrina . . . . . Mon-S&T-2-A-4 108

Tue-S&T-3-A-6 158Wed-SS-6-11-4 162Wed-SS-6-11-5 162Wed-SS-6-11-6 162Wed-SS-7-11-2 166

Crook, Paul . . . . . . . . . . . . . Tue-P-4-3-10 145Csapó, Tamás Gábor . . . Mon-P-1-1-6 93

Thu-O-9-8-5 229Cuayáhuitl, Heriberto . . Wed-O-7-6-2 179Cucchiarini, Catia . . . . . . . Wed-O-8-8-2 186Cucu, Horia . . . . . . . . . . . . . Tue-O-4-4-5 125Cui, Jia . . . . . . . . . . . . . . . . . . Thu-O-10-1-4 230

245

Cui, Xiaodong . . . . . . . . . . . Mon-O-1-1-3 81Mon-O-1-1-5 81

Cumani, Sandro . . . . . . . . Tue-O-5-2-3 130Cummins, Nicholas . . . . . Mon-O-2-2-5 88

Tue-SS-3-11-4 111Tue-S&T-3-B-3 159

Wed-P-7-4-5 204Wed-P-8-2-1 209Wed-P-8-2-5 210

Thu-SS-10-10-2 222Thu-SS-10-10-3 222

Cutler, Anne . . . . . . . . . . . . Mon-P-2-1-4 99Tue-O-4-6-3 126Tue-O-5-6-1 133

DDabbaghchian, Saeed . . . Thu-SS-9-11-2 220

Thu-SS-9-11-6 221Dai, Jia . . . . . . . . . . . . . . . . . . Mon-P-1-2-1 94Dai, Li-Rong . . . . . . . . . . . . . Tue-O-4-4-1 124

Wed-O-7-10-2 182Thu-O-10-1-3 229

d’Alessandro, C. . . . . . . . . Tue-SS-4-11-1 112Dalmasso, Emanuele . . . Tue-O-5-2-3 130Damnati, Géraldine . . . . . Wed-P-6-3-11 196Dandapat, Samarendra . Tue-O-3-6-3 119Dang, Jianwu . . . . . . . . . . . Wed-O-7-1-6 176Dang, Ting . . . . . . . . . . . . . . Tue-O-4-8-3 127Daniel, Adrien . . . . . . . . . . Tue-S&T-3-B-1 159Das, Amit . . . . . . . . . . . . . . . Wed-SS-6-2-5 161Das, Biswajit . . . . . . . . . . . . Tue-P-5-2-10 151Das, Rohan Kumar . . . . . Mon-SS-1-8-5 77

Wed-P-6-2-12 193Dasgupta, Hirak . . . . . . . . Mon-O-2-4-5 89D’Ausilio, Alessandro . . Wed-O-6-6-4 172Davel, Marelie . . . . . . . . . . . Wed-SS-7-1-14 165Davies, William J. . . . . . . . Wed-P-6-4-5 197Davis, Chris . . . . . . . . . . . . . Mon-P-2-1-5 99Dawalatabad, Nauman . Wed-O-7-1-3 176Dean, David . . . . . . . . . . . . . Tue-P-3-2-10 140Deena, Salil . . . . . . . . . . . . . Wed-O-8-10-2 187Deepak, K.T. . . . . . . . . . . . . Mon-P-1-2-2 94Degirmenci, Niyazi C. . . Thu-SS-9-11-5 221de Haan, Jan Mark . . . . . . Wed-P-6-4-6 197Dehak, Najim . . . . . . . . . . . Mon-O-2-2-3 88

Tue-O-3-4-2 117Tue-O-5-2-2 130

Dehak, Reda . . . . . . . . . . . . Tue-O-5-2-2 130Deisher, Michael . . . . . . . . Tue-S&T-3-A-5 158Delalez, Samuel . . . . . . . . . Tue-SS-4-11-1 112Delcroix, Marc . . . . . . . . . . Mon-O-2-10-2 91

Tue-O-4-4-2 124Tue-P-4-1-3 141Tue-P-4-1-4 141Tue-P-5-4-3 155

Wed-O-8-6-2 185Thu-P-9-1-5 236

Delgado, Héctor . . . . . . . . Mon-SS-1-8-1 76Del Giudice, Max . . . . . . . . Tue-O-4-2-1 123Delhay, Arnaud . . . . . . . . . Wed-P-8-2-8 210Delvaux, Véronique . . . . . Thu-O-10-8-2 233Demberg, Vera . . . . . . . . . . Thu-O-10-4-4 232Demolin, Didier . . . . . . . . . Wed-O-6-1-1 169De Mori, Renato . . . . . . . . Wed-P-8-3-6 212Demuth, Katherine . . . . . Mon-O-1-6-5 85Derrick, Donald . . . . . . . . . Wed-P-7-2-2 198Destefano, Chelle . . . . . . . Wed-S&T-6-B-2 217de Wit, Harriet . . . . . . . . . . Wed-P-7-4-6 205Dey, Anik . . . . . . . . . . . . . . . Mon-S&T-2-B-4 109

Wed-S&T-6-B-4 217Dey, Subhadeep . . . . . . . . Tue-P-3-1-2 136D’Haro, L.F. . . . . . . . . . . . . . Mon-O-2-2-3 88Dhiman, Jitendra K. . . . . Mon-O-2-4-3 89

Wed-O-6-4-3 170Dhinakaran, Krupakar . . Wed-S&T-6-A-6 217Diakoloukas, Vassilios . Wed-O-7-4-2 178Diez Sánchez, Mireia . . . Tue-O-5-2-5 131

Tue-P-3-2-7 140Digalakis, Vassilios . . . . . Wed-O-7-4-2 178Di Gangi, Mattia A. . . . . . Wed-O-8-4-3 184Dighe, Pranay . . . . . . . . . . . Thu-O-9-1-5 224Dijkstra, Jelske . . . . . . . . . Mon-SS-1-11-1 77Dimitriadis, Dimitrios . . Mon-O-1-1-5 81

Tue-O-3-10-3 121Tue-O-4-8-4 127Wed-P-6-1-1 188

Dinarelli, Marco . . . . . . . . . Wed-O-7-4-4 178

Ding, Hongwei . . . . . . . . . . Mon-O-2-6-6 91Wed-P-7-2-12 200

Ding, Wan . . . . . . . . . . . . . . . Mon-O-1-2-5 82Do, Cong-Thanh . . . . . . . . Thu-P-9-1-1 235Do, Quoc Truong . . . . . . . Wed-O-8-4-4 184Do, Van Hai . . . . . . . . . . . . . Mon-P-2-3-7 105Docio-Fernandez, L. . . . . Wed-P-6-3-8 195

Wed-P-7-4-9 205Doddipatla, Rama . . . . . . Wed-P-8-4-9 215Dogil, Grzegorz . . . . . . . . . Thu-O-10-8-4 233Dohen, Marion . . . . . . . . . . Mon-O-1-2-6 82Dolatian, Hossep . . . . . . . Tue-O-5-6-3 133Domínguez, Mónica . . . . Wed-S&T-6-A-2 216Dong, Jing . . . . . . . . . . . . . . Wed-P-7-3-7 201Dong, Minghui . . . . . . . . . . Mon-O-1-2-5 82Donini, Michele . . . . . . . . . Tue-O-3-2-4 116Downing, Sylvia J. . . . . . . Tue-S&T-3-A-5 158Drager, Katie . . . . . . . . . . . . Mon-P-2-1-13 100Drake, Mark . . . . . . . . . . . . . Thu-O-9-4-5 226Draxler, Christoph . . . . . . Mon-S&T-2-B-2 109Dreyer, Markus . . . . . . . . . Wed-P-6-3-9 195Dreyfus, Gérard . . . . . . . . Mon-S&T-2-B-1 109Droppo, Jasha . . . . . . . . . . Mon-O-1-1-6 81Drude, Lukas . . . . . . . . . . . Mon-P-1-2-7 95

Wed-O-8-6-1 185Drugman, Thomas . . . . . . Tue-O-3-8-2 120Du, Jun . . . . . . . . . . . . . . . . . . Mon-O-2-10-4 92

Tue-O-4-4-1 124Duckhorn, Frank . . . . . . . Wed-S&T-6-B-3 217Duenser, Andreas . . . . . . Wed-O-8-8-3 186Duerichen, Robert . . . . . . Thu-O-10-11-1 234Dufour, Richard . . . . . . . . Wed-P-6-2-8 192Dunbar, Ewan . . . . . . . . . . . Wed-SS-7-11-6 167Dupoux, Emmanuel . . . . Mon-P-2-1-8 100

Wed-SS-7-11-2 166Wed-SS-7-11-6 167Wed-P-7-2-13 200

Duran, Daniel . . . . . . . . . . . Wed-SS-8-11-5 168Thu-O-10-8-4 233

Dutta, Indranil . . . . . . . . . . Mon-O-1-6-6 85Dyer, Chris . . . . . . . . . . . . . . Tue-O-3-1-4 115

EEbbers, Janek . . . . . . . . . . . Mon-P-1-2-7 95Ebhotemhen, Eustace . . . Tue-SS-4-11-6 113Edlund, Jens . . . . . . . . . . . . Wed-SS-7-11-5 167Edwards, Jan . . . . . . . . . . . . Tue-P-5-1-4 146Egorow, Olga . . . . . . . . . . . . Wed-O-6-8-1 173Eig, Jonathan . . . . . . . . . . . Tue-P-5-2-1 149Einspieler, Christa . . . . . . Mon-O-2-2-5 88Eisner, Frank . . . . . . . . . . . . Mon-P-2-2-7 102Ekström, Jenny . . . . . . . . . Tue-P-5-1-11 148El Fakhri, Georges . . . . . . Wed-O-6-1-4 170Elie, Benjamin . . . . . . . . . . Mon-O-1-6-2 84Elizalde, Benjamin . . . . . . Tue-P-5-3-1 152El-Khamy, Mostafa . . . . . . Tue-P-4-1-2 140Ell, Stephen R. . . . . . . . . . . Thu-P-9-4-7 241Elsner, Micha . . . . . . . . . . . Tue-P-5-1-1 146El Yagoubi, Radouane . . Wed-O-7-1-5 176Enarvi, Seppo . . . . . . . . . . . Wed-S&T-6-A-6 217Engelbart, Mathis . . . . . . . Wed-P-7-4-1 203Englebienne, Gwenn . . . . Tue-O-3-10-6 122Engwall, Olov . . . . . . . . . . . Thu-SS-9-11-2 220

Thu-SS-9-11-6 221Enomoto, Mika . . . . . . . . . . Tue-P-4-3-5 144Epps, Julien . . . . . . . . . . . . . Mon-O-2-4-2 89

Tue-SS-3-11-1 110Tue-O-4-8-3 127Wed-P-6-2-2 191

Wed-P-8-2-12 211Eriksson, Anders . . . . . . . Wed-P-8-1-10 208

Thu-P-9-3-4 238Ernestus, M. . . . . . . . . . . . . . Tue-O-4-2-3 123Erzin, Engin . . . . . . . . . . . . . Mon-P-2-2-3 101

Tue-SS-3-11-3 111Tue-P-4-3-15 146

Escudero, Juan Pablo . . . Tue-SS-3-11-2 111Escudero-Mancebo, D. . . Tue-O-5-8-6 135Espic, Felipe . . . . . . . . . . . . Tue-O-5-4-6 132Espín, Juan M. . . . . . . . . . . Mon-SS-1-8-2 76Espy-Wilson, Carol . . . . . Tue-O-3-2-2 116

Tue-O-4-8-2 127Wed-P-7-4-2 204

Essid, Slim . . . . . . . . . . . . . . Tue-O-5-10-3 135Estebas-Vilaplana, Eva . . Tue-O-5-8-6 135Estève, Yannick . . . . . . . . . Wed-P-6-3-11 196

Wed-P-8-3-6 212

Evanini, Keelan . . . . . . . . . Tue-O-5-8-1 134Tue-O-5-8-3 134

Tue-P-4-3-11 145Wed-O-7-10-5 182

Wed-P-6-1-4 189Evans, Nicholas . . . . . . . . . Mon-SS-1-8-1 76Evers, Vanessa . . . . . . . . . . Tue-O-3-10-6 122Evert, Stefan . . . . . . . . . . . . Wed-P-7-4-10 205Ewald, Otto . . . . . . . . . . . . . Tue-O-5-6-6 134

Wed-P-8-1-13 208Eyben, Florian . . . . . . . . . . Tue-S&T-3-B-3 159

FFadiga, Luciano . . . . . . . . . Wed-O-6-6-4 172Fainberg, Joachim . . . . . . Mon-P-2-3-10 106

Mon-S&T-2-A-5 109Falavigna, Daniele . . . . . . Mon-P-2-3-5 105Falck-Ytter, Terje . . . . . . . Mon-O-2-2-5 88Falik, Ohad . . . . . . . . . . . . . . Tue-S&T-3-A-5 158Fan, Ping . . . . . . . . . . . . . . . . Wed-P-7-4-12 206Farrell, Kevin . . . . . . . . . . . Tue-O-5-2-3 130Farrús, Mireia . . . . . . . . . . . Mon-S&T-2-A-1 108

Wed-P-8-1-6 207Wed-S&T-6-A-2 216

Fashal, Mervat . . . . . . . . . . Thu-O-10-4-6 232Fatima, Syeda Narjis . . . . Tue-P-4-3-15 146Fauth, Camille . . . . . . . . . . Mon-O-2-6-1 90Fayet, Cedric . . . . . . . . . . . . Wed-P-8-2-8 210Federico, Marcello . . . . . . Wed-O-8-4-3 184Fels, Sidney . . . . . . . . . . . . . Wed-O-6-1-4 170

Thu-SS-9-11-3 221Feng, Pengming . . . . . . . . . Wed-P-7-3-7 201Feng, Siyuan . . . . . . . . . . . . Wed-SS-6-2-4 160Feng, Xue . . . . . . . . . . . . . . . Wed-P-7-3-9 202Feng, Zhe . . . . . . . . . . . . . . . Wed-P-7-3-8 202

Thu-O-10-11-1 234Fernandez, Raul . . . . . . . . Mon-P-2-4-2 106Fernández Gallardo, L. . Tue-SS-5-11-3 113

Wed-SS-8-11-3 168Wed-P-6-4-1 196

Fernando, Sarith . . . . . . . . Wed-P-6-2-2 191Ferras, Marc . . . . . . . . . . . . . Tue-P-3-1-2 136Ferreira Netto, W. . . . . . . . Mon-P-2-1-7 99Ferrer, Luciana . . . . . . . . . . Wed-P-6-2-1 191

Thu-O-10-2-6 231Fingscheidt, Tim . . . . . . . . Mon-O-1-4-3 83Florêncio, Dinei . . . . . . . . . Tue-P-5-4-13 157

Wed-O-8-6-6 186Fonollosa, José A.R. . . . . Wed-P-6-2-7 192Fonseca, Nuno . . . . . . . . . . Wed-P-8-4-11 216Font, Roberto . . . . . . . . . . . Mon-SS-1-8-2 76Fotedar, Gaurav . . . . . . . . Mon-O-1-2-4 82Fougeron, Cécile . . . . . . . . Thu-O-10-8-3 233Fougner, Chris . . . . . . . . . . Tue-P-4-1-5 141Foulkes, Paul . . . . . . . . . . . . Thu-O-10-8-1 233

Thu-P-9-3-3 237Fousek, Petr . . . . . . . . . . . . . Wed-P-6-1-1 188Fox, Robert A. . . . . . . . . . . Mon-O-2-6-2 90Fraga-Silva, Thiago . . . . . Thu-SS-9-10-7 220Franceschi, Luca . . . . . . . . Tue-O-3-2-4 116Francois, Holly . . . . . . . . . . Tue-P-5-4-5 155Frank, Michael C. . . . . . . . Tue-S&T-3-A-6 158Franken, Matthias K. . . . Mon-P-2-2-7 102Franzen, Jan . . . . . . . . . . . . Mon-O-1-4-3 83Fredes, Josué . . . . . . . . . . . Tue-SS-3-11-2 111Fredouille, Corinne . . . . . Tue-P-5-2-6 150Freitag, Michael . . . . . . . . . Tue-SS-3-11-4 111

Thu-SS-10-10-2 222Thu-SS-10-10-3 222

Frej, Mohamed Yassine . Mon-O-1-6-3 84French, Peter . . . . . . . . . . . . Thu-P-9-3-3 237Freyne, Jill . . . . . . . . . . . . . . Wed-O-8-8-3 186Fridolin, Ivo . . . . . . . . . . . . . Wed-SS-7-1-12 165Fry, Michael . . . . . . . . . . . . . Wed-SS-6-11-1 161Fuchs, Robert . . . . . . . . . . . Wed-P-8-1-14 209Fuentes, Olac . . . . . . . . . . . Tue-O-5-10-1 135Fujimoto, Masakiyo . . . . Thu-P-9-1-2 235Fukuda, Takashi . . . . . . . . Mon-O-2-10-3 92

Tue-P-4-1-7 141Thu-O-10-1-4 230

Fukuoka, Ishin . . . . . . . . . . Mon-P-2-4-5 107Fung, Pascale . . . . . . . . . . . Mon-S&T-2-B-4 109

Wed-P-7-3-14 203Wed-P-8-2-6 210

Wed-S&T-6-B-4 217Funk, Riccarda . . . . . . . . . . Tue-SS-5-11-5 114Furuya, Ken’ichi . . . . . . . . Wed-O-7-2-4 177

246

GGale, William . . . . . . . . . . . . Wed-P-6-1-12 191Gales, Mark J.F. . . . . . . . . . Mon-O-1-1-2 80

Mon-O-2-1-3 87Tue-P-4-1-6 141Wed-P-6-1-8 190

Galibert, Olivier . . . . . . . . . Wed-O-7-4-3 178Galindo, Luis Angel . . . . . Wed-O-6-6-5 172Gałka, Jakub . . . . . . . . . . . . Mon-SS-1-8-6 77Gálvez, Ramiro H. . . . . . . Wed-O-6-6-2 171Ganapathy, Sriram . . . . . . Wed-O-7-2-1 177

Wed-P-6-2-12 193Gangamohan, P. . . . . . . . . Wed-O-6-4-1 170Gangashetty, S.V. . . . . . . . Mon-SS-2-8-6 80

Wed-O-8-1-1 182Wed-P-7-2-11 200

Ganzeboom, Mario . . . . . Wed-O-8-8-2 186Gao, Guanglai . . . . . . . . . . . Tue-P-5-4-2 155Gao, Shengxiang . . . . . . . . Mon-SS-2-8-2 79Gao, Wei . . . . . . . . . . . . . . . . Tue-P-5-3-4 152Gao, Yixin . . . . . . . . . . . . . . . Thu-O-9-4-4 226Garcia, N. . . . . . . . . . . . . . . . Mon-O-2-2-3 88García, Paola . . . . . . . . . . . . Tue-O-3-4-6 118

Tue-O-5-2-5 131Garcia-Mateo, Carmen . . Wed-P-6-3-8 195

Wed-P-7-4-9 205Garcia-Romero, Daniel . Tue-O-3-4-1 117

Tue-P-3-2-4 139Garimella, Sri . . . . . . . . . . . Wed-O-6-10-4 175Garland, Matt . . . . . . . . . . . Mon-SS-2-8-4 80Garner, Philip N. . . . . . . . . Mon-P-2-3-3 104Gašic, Milica . . . . . . . . . . . . Tue-P-4-3-13 145Gau, Susan Shur-Fen . . . Wed-O-6-8-2 173Gauthier, Elodie . . . . . . . . Wed-SS-7-1-6 164Gauvain, J.L. . . . . . . . . . . . . Wed-O-7-10-1 181Ge, Fengpei . . . . . . . . . . . . . Thu-P-9-1-4 235Gelderblom, Femke B. . . Tue-P-5-4-4 155Gelly, G. . . . . . . . . . . . . . . . . . Wed-O-7-10-1 181

Thu-O-9-2-5 225Gendrot, Cedric . . . . . . . . . Thu-P-9-3-10 239Georges, Munir . . . . . . . . . Tue-S&T-3-A-5 158Georgiadou, Despoina . . Wed-O-7-4-2 178Georgiou, Panayiotis . . . Wed-P-7-4-11 206

Wed-P-8-2-2 209Wed-P-8-2-10 211Wed-P-8-2-11 211Thu-O-9-2-2 224

Gerczuk, Maurice . . . . . . . Tue-SS-3-11-4 111Thu-SS-10-10-2 222Thu-SS-10-10-3 222

Gerholm, Tove . . . . . . . . . . Wed-SS-7-11-4 166Gerkmann, Timo . . . . . . . Tue-P-5-4-7 156Gerlach, Johanna . . . . . . . Wed-S&T-6-B-2 217Gessinger, Iona . . . . . . . . . Thu-O-10-8-6 234Ghaffarzadegan, S. . . . . . Wed-P-7-3-8 202

Thu-O-10-11-1 234Ghahremani, Pegah . . . . . Thu-O-9-4-2 226Ghannay, Sahar . . . . . . . . . Wed-P-8-3-6 212Ghio, Alain . . . . . . . . . . . . . . Wed-O-7-1-5 176Ghodsi, M. . . . . . . . . . . . . . . Wed-O-8-10-1 187Ghone, Atish Shankar . . Wed-S&T-6-A-5 217Ghosh, Prasanta Kumar Mon-O-1-2-4 82

Mon-P-1-2-10 96Tue-P-5-3-12 154

Thu-SS-9-10-8 220Thu-SS-10-10-1 222

Thu-O-10-4-2 232Ghosh, Soumya K. . . . . . . Thu-O-9-6-1 227Ghosh, Sucheta . . . . . . . . . Mon-O-2-6-1 90Ghoshal, Arnab . . . . . . . . . Wed-SS-7-1-8 164Gibiansky, Andrew . . . . . Tue-P-4-1-5 141Gibson, James . . . . . . . . . . Wed-P-8-2-2 209Gideon, John . . . . . . . . . . . . Tue-O-3-10-3 121Gilbert, James M. . . . . . . . Thu-P-9-4-7 241Gillespie, Stephanie . . . . Wed-P-7-4-3 204Gilmartin, Emer . . . . . . . . . Wed-P-8-3-12 214Gil-Pita, Roberto . . . . . . . . Mon-O-1-4-5 83Glarner, Thomas . . . . . . . . Mon-P-1-2-7 95

Wed-SS-7-1-7 164Glass, James . . . . . . . . . . . . Tue-O-4-10-2 128

Wed-O-7-10-6 182Wed-P-7-3-9 202Wed-P-8-3-4 212

Glavitsch, Ulrike . . . . . . . . Tue-S&T-3-B-2 159Glembek, Ondrej . . . . . . . Tue-O-5-2-5 131

Wed-SS-6-2-1 160Wed-SS-7-1-3 163

Gnanapragasam, D. . . . . . Wed-O-7-8-5 181

Gobl, Christer . . . . . . . . . . . Tue-O-3-6-2 118Tue-O-4-8-5 127

Wed-SS-7-1-1 163Wed-P-7-2-8 199Thu-P-9-3-8 238

Godoy, Elizabeth . . . . . . . Wed-P-8-1-7 207Goecke, Roland . . . . . . . . . Tue-SS-3-11-1 110Goehner, Kyle . . . . . . . . . . . Tue-P-5-3-15 154Goel, Vaibhava . . . . . . . . . . Mon-O-1-1-3 81Gogoi, Pamir . . . . . . . . . . . . Mon-O-1-6-6 85Goldstein, Louis . . . . . . . . Mon-P-2-2-5 102

Tue-O-3-2-6 117Golipour, Ladan . . . . . . . . Thu-P-9-4-12 242Gong, Yifan . . . . . . . . . . . . . Wed-O-6-10-1 174

Thu-O-9-6-3 227Thu-P-9-3-13 239

Gonzalez, Jose A. . . . . . . . Thu-P-9-4-7 241González-Ferreras, C. . . Tue-O-5-8-6 135Goo, Jahyun . . . . . . . . . . . . . Mon-P-2-3-6 105Gosztolya, Gábor . . . . . . . Tue-P-4-1-8 142

Tue-P-4-1-9 142Wed-O-6-8-5 173

Thu-SS-10-10-5 222Thu-O-9-8-5 229

Götze, Jana . . . . . . . . . . . . . Wed-SS-7-11-5 167Gowda, Dhananjaya . . . . Tue-P-3-1-8 138Gracco, Vincent L. . . . . . . Mon-P-2-2-10 103Graf, Simon . . . . . . . . . . . . . Wed-O-6-4-5 171Graff, David . . . . . . . . . . . . . Wed-O-8-1-6 183Gravano, Agustín . . . . . . . Wed-O-6-6-2 171

Wed-O-6-6-6 172Wed-P-8-1-6 207

Green, Jordan R. . . . . . . . . Tue-P-5-2-3 149Green, Phil D. . . . . . . . . . . . Thu-P-9-4-7 241Greenberg, Clayton . . . . . Mon-P-1-4-10 98Greenberg, Craig . . . . . . . Tue-O-5-2-6 131Greenwood, David . . . . . . Thu-P-9-4-8 241Greer, Timothy . . . . . . . . . Wed-O-6-1-3 169

Wed-O-6-1-5 170Gref, Michael . . . . . . . . . . . . Tue-P-5-4-6 156Gresse, Adrien . . . . . . . . . . Wed-P-6-2-8 192Grézl, František . . . . . . . . . Mon-P-2-3-4 105Grigonyte, Gintare . . . . . . Tue-P-5-1-10 148Grohe, Ann-Kathrin . . . . Tue-O-5-6-1 133Grossman, Ruth . . . . . . . . Mon-O-1-2-2 82Grósz, Tamás . . . . . . . . . . . Tue-P-4-1-8 142

Tue-P-4-1-9 142Thu-SS-10-10-5 222

Thu-O-9-8-5 229Group, SRE’16 I4U . . . . . . Tue-O-5-2-1 130Gruber, Martin . . . . . . . . . . Wed-S&T-6-A-3 216Gu, Wentao . . . . . . . . . . . . . Tue-P-5-2-13 151

Wed-P-7-4-12 206Gu, Yu . . . . . . . . . . . . . . . . . . . Tue-O-4-1-2 122Guan, Jian . . . . . . . . . . . . . . . Wed-P-7-3-7 201Guasch, Oriol . . . . . . . . . . . Thu-SS-9-11-2 220

Thu-SS-9-11-5 221Thu-SS-9-11-6 221

Guðnason, Jón . . . . . . . . . . Mon-O-2-2-4 88Wed-SS-7-1-11 165Wed-SS-7-1-13 165

Guevara-Rukoz, A. . . . . . . Mon-P-2-1-8 100Guha, Tanaya . . . . . . . . . . . Wed-P-7-3-12 202Gully, Amelia J. . . . . . . . . . Mon-O-1-10-2 85Gundogdu, Batuhan . . . . Thu-O-9-4-6 226Guo, Feng . . . . . . . . . . . . . . . Tue-P-5-3-6 152Guo, Jinxi . . . . . . . . . . . . . . . Mon-P-1-2-3 95

Thu-O-10-2-1 230Guo, Jun . . . . . . . . . . . . . . . . Tue-P-3-1-4 137Guo, Wu . . . . . . . . . . . . . . . . . Wed-O-7-10-2 182Gupta, Rahul . . . . . . . . . . . . Tue-O-4-8-2 127

Tue-O-5-10-4 136Wed-P-7-4-2 204

Gustafson, Joakim . . . . . . Mon-P-2-4-11 108Tue-SS-3-11-5 111

Tue-O-3-8-5 120Gustavsson, Lisa . . . . . . . . Wed-SS-7-11-4 166Gutkin, Alexander . . . . . . Wed-SS-6-2-6 161

Wed-SS-7-1-15 165Gutzeit, Suska . . . . . . . . . . Thu-P-9-3-4 238Guzewich, Peter . . . . . . . . Mon-O-1-4-1 83Guzmán, Gualberto . . . . . Mon-SS-1-11-7 78Gwon, Youngjune . . . . . . Tue-O-5-2-2 130

HHa, Linne . . . . . . . . . . . . . . . . Wed-SS-7-1-14 165Hadian, Hossein . . . . . . . . Mon-P-1-4-3 97Hadjitarkhani, Abie . . . . Thu-P-9-4-12 242Haeb-Umbach, R. . . . . . . . Mon-P-1-2-7 95

Wed-SS-7-1-7 164Wed-O-8-6-1 185

Hagerer, Gerhard . . . . . . . Tue-S&T-3-B-3 159Hagita, Norihiro . . . . . . . . Tue-SS-5-11-4 114Hagoort, Peter . . . . . . . . . . Mon-P-2-2-7 102Hahm, Seongjun . . . . . . . . Tue-O-5-1-6 130Haider, Fasih . . . . . . . . . . . . Wed-O-6-8-6 174Hain, Thomas . . . . . . . . . . . Mon-P-1-1-2 93

Wed-O-7-2-5 177Wed-O-8-10-2 187

Hakkani-Tür, Dilek . . . . . Wed-O-7-4-1 178Wed-P-8-3-8 213

Halimi, Sonia . . . . . . . . . . . Wed-S&T-6-B-2 217Hall, Andreia . . . . . . . . . . . . Wed-P-7-3-6 201Hall, Kathleen Currie . . . Wed-SS-6-11-1 161Hall, Phil . . . . . . . . . . . . . . . . Mon-O-1-1-5 81Hämäläinen, Perttu . . . . . Wed-S&T-6-A-6 217Han, Jiqing . . . . . . . . . . . . . . Tue-P-3-2-1 139Han, Kyu J. . . . . . . . . . . . . . . Tue-O-5-1-6 130Hansen, John H.L. . . . . . . Mon-O-1-4-4 83

Mon-O-2-10-6 92Tue-O-3-4-3 117Tue-O-5-2-4 131

Tue-P-5-3-16 154Wed-O-7-10-3 182Wed-P-6-1-13 191Wed-P-6-2-13 194

Wed-P-7-3-8 202Thu-O-10-2-2 230

Hantke, Simone . . . . . . . . . Wed-P-7-4-5 204Wed-P-8-2-1 209

Thu-P-9-3-15 240Hanulíková, Adriana . . . Tue-P-5-1-11 148Hanzlícek, Zdenek . . . . . . Wed-S&T-6-A-3 216

Wed-S&T-6-A-4 216Hao, Lixia . . . . . . . . . . . . . . . Wed-P-8-1-3 206Hara, Sunao . . . . . . . . . . . . . Wed-P-8-4-5 215Harandi, Negar M. . . . . . . Wed-O-6-1-4 170Harkness, Kirsty . . . . . . . . Wed-P-7-4-7 205Harlow, R. . . . . . . . . . . . . . . . Wed-SS-6-2-3 160Harman, Craig . . . . . . . . . . Wed-O-7-4-6 179Harmegnies, Bernard . . . Thu-O-10-8-2 233Harrison, Philip . . . . . . . . . Thu-P-9-3-3 237Hartmann, William . . . . . Mon-O-1-1-1 80Hartono, Rachmat . . . . . . Wed-S&T-6-B-5 218Harvey, Richard . . . . . . . . Thu-O-9-8-2 228Hasan, Taufiq . . . . . . . . . . . Wed-P-7-3-8 202Hasegawa-Johnson, M. . Mon-P-2-3-7 105

Tue-P-5-3-9 153Tue-P-5-4-13 157Wed-SS-6-2-1 160Wed-SS-6-2-5 161Wed-O-8-6-6 186

Wed-P-6-1-11 190Hashimoto, Kei . . . . . . . . . Mon-O-1-10-2 85Hashimoto, Tetsuya . . . . Tue-O-4-10-3 128Hassid, Sergio . . . . . . . . . . Wed-O-6-1-1 169Hayashi, Tomoki . . . . . . . Tue-O-4-1-1 122

Tue-O-4-1-5 123He, Di . . . . . . . . . . . . . . . . . . . Tue-P-5-3-9 153He, Yunjuan . . . . . . . . . . . . Tue-P-5-1-6 147Heck, Larry . . . . . . . . . . . . . Wed-O-7-4-1 178

Wed-P-8-3-8 213Heck, Michael . . . . . . . . . . . Tue-P-4-1-7 141Heeman, Peter A. . . . . . . . Tue-P-4-3-3 144Heeringa, Wilbert . . . . . . . Thu-S&T-9-A-5 243Hegde, Rajesh M. . . . . . . . Wed-P-7-3-12 202Heiser, Clemens . . . . . . . . Thu-SS-9-10-1 219Hejná, Míša . . . . . . . . . . . . . Tue-O-3-6-5 119Heldner, Mattias . . . . . . . . Tue-P-4-3-2 143Helgadóttir, Inga Rún . . Wed-SS-7-1-11 165Helmke, Hartmut . . . . . . . Wed-O-6-10-5 175Henter, Gustav Eje . . . . . . Mon-P-2-1-10 100

Thu-P-9-4-1 240Hentschel, Michael . . . . . Tue-P-4-1-3 141Heo, Hee-soo . . . . . . . . . . . . Tue-P-3-1-12 138Herbig, Tobias . . . . . . . . . . Wed-O-6-4-5 171Hermes, Zainab . . . . . . . . . Mon-O-1-6-1 84Hermjakob, Ulf . . . . . . . . . Wed-SS-6-2-1 160Hernáez, Inma . . . . . . . . . . Thu-SS-10-10-4 222Hernandez-Cordero, J. . Tue-O-5-2-6 131Hernando, Javier . . . . . . . Wed-P-6-2-7 192Herzog, Michael . . . . . . . . Thu-SS-9-10-1 219

247

Hestness, Joel . . . . . . . . . . . Tue-P-4-1-5 141Hewer, Alexander . . . . . . . Mon-O-1-10-3 85Heymann, Jahn . . . . . . . . . Mon-P-1-2-7 95Hidalgo, Guillermo . . . . . Thu-SS-9-10-1 219Higashinaka, Ryuichiro . Tue-P-4-3-1 143

Wed-P-8-3-1 212Higuchi, Takuya . . . . . . . . Tue-O-4-4-2 124

Wed-O-8-6-2 185Himawan, Ivan . . . . . . . . . . Tue-P-3-2-10 140

Wed-O-6-10-5 175Hiovain, Katri . . . . . . . . . . . Tue-O-4-6-2 126Hiramatsu, Kaoru . . . . . . Tue-O-4-10-4 128Hirose, Yuki . . . . . . . . . . . . Mon-P-2-1-8 100Hirsch, Hans-Günter . . . . Tue-P-5-4-6 156Hirschberg, Julia . . . . . . . . Mon-SS-1-11-9 79

Tue-O-5-10-6 136Thu-P-9-4-4 241

Hirschfeld, Diane . . . . . . . Wed-O-7-6-6 180Hlavnicka, Jan . . . . . . . . . . Tue-P-5-2-8 150Hoetjes, Marieke . . . . . . . . Mon-P-2-2-8 102Hofer, Joachim . . . . . . . . . Tue-S&T-3-A-5 158Hoffman, Johan . . . . . . . . . Thu-SS-9-11-5 221Hoffmeister, Björn . . . . . . Tue-P-5-3-15 154

Wed-O-7-2-6 178Wed-P-6-3-9 195

Hohenhorst, Winfried . . Thu-SS-9-10-1 219Hojo, Nobukatsu . . . . . . . Mon-P-2-4-3 106

Tue-O-3-8-4 120Holdsworth, Ed . . . . . . . . . Thu-P-9-4-7 241Homayounpour, M.M. . . Tue-O-3-4-5 118Homma, Yukinori . . . . . . Wed-P-8-3-1 212Hooper, Angela . . . . . . . . . Wed-S&T-6-B-2 217Hoory, Ron . . . . . . . . . . . . . . Mon-P-2-4-2 106

Tue-O-3-10-1 121Horáková, D. . . . . . . . . . . . . Wed-P-7-4-4 204Hörberg, Thomas . . . . . . . Tue-P-5-1-12 148

Tue-P-5-1-13 148Hori, Takaaki . . . . . . . . . . . Tue-O-3-1-3 115Horo, Luke . . . . . . . . . . . . . . Tue-O-3-6-3 119Hou, Junfeng . . . . . . . . . . . Thu-O-10-1-3 229Hou, Luying . . . . . . . . . . . . . Tue-O-5-6-2 133Hough, Julian . . . . . . . . . . . Tue-P-4-3-4 144Houghton, Steve . . . . . . . . Mon-O-2-4-1 89Howcroft, David M. . . . . . Thu-O-10-4-4 232Hrúz, Marek . . . . . . . . . . . . Thu-O-9-2-1 224Hsiao, Roger . . . . . . . . . . . . Mon-O-1-1-1 80Hsu, Chin-Cheng . . . . . . . . Tue-P-5-4-1 155

Wed-P-8-4-1 214Hsu, Cristiane . . . . . . . . . . Wed-SS-8-11-8 168Hsu, Hsiang-Ping . . . . . . . Mon-P-1-1-7 94Hsu, Wei-Ning . . . . . . . . . . Tue-O-4-10-2 128Hsu, Yu-Yin . . . . . . . . . . . . . Wed-P-8-1-12 208Hu, Qiong . . . . . . . . . . . . . . . Thu-P-9-4-12 242Hu, Wenping . . . . . . . . . . . . Tue-P-3-1-7 137Hua, Kanru . . . . . . . . . . . . . . Wed-O-6-4-2 170Huang, Chu-Ren . . . . . . . . Wed-P-7-2-12 200Huang, David . . . . . . . . . . . Mon-P-1-2-4 95Huang, D.-Y. . . . . . . . . . . . . Mon-O-1-2-5 82

Wed-P-8-4-4 214Huang, Hengguan . . . . . . Thu-P-9-1-7 236Huang, Liang . . . . . . . . . . . . Wed-P-8-3-7 213Huang, Qiang . . . . . . . . . . . Wed-P-7-3-10 202Huang, Qizheng . . . . . . . . Thu-O-9-6-2 227Huang, Yan . . . . . . . . . . . . . Thu-O-9-6-3 227

Thu-P-9-3-13 239Huang, Yinghui . . . . . . . . . Mon-O-2-1-4 87Huang, Yuchen . . . . . . . . . Mon-P-2-4-6 107Huang, Yuyun . . . . . . . . . . Wed-P-8-3-12 214Huang, Zhaocheng . . . . . . Wed-P-8-2-12 211Huang, Zhaoqiong . . . . . . Tue-P-5-3-2 152Huber, Markus . . . . . . . . . . Wed-S&T-6-B-3 217Huber, Rainer . . . . . . . . . . . Tue-O-4-2-5 124Huckvale, Mark . . . . . . . . . Thu-SS-9-10-5 219Huddleston, Nancy . . . . . Thu-P-9-4-12 242Huet, Kathy . . . . . . . . . . . . . Thu-O-10-8-2 233Huet, Stéphane . . . . . . . . . Wed-P-8-3-9 213Hughes, Thad . . . . . . . . . . . Mon-O-2-10-1 91Hughes, Vincent . . . . . . . . Thu-O-10-8-1 233

Thu-P-9-3-3 237Hung, Jeih-Weih . . . . . . . . Mon-P-1-1-7 94Hunt, Melvyn . . . . . . . . . . . Thu-P-9-4-12 242Hussen Abdelaziz, A. . . Thu-O-9-8-4 228

Thu-O-10-4-3 232Huston, Timothy . . . . . . . Tue-P-5-2-1 149Hwang, Hsin-Te . . . . . . . . . Tue-P-5-4-1 155

Wed-P-8-4-1 214Hyder, Rakib . . . . . . . . . . . . Wed-P-7-3-8 202

IIchikawa, Osamu . . . . . . . Mon-O-2-10-3 92Ijima, Yusuke . . . . . . . . . . . Mon-P-2-4-3 106

Tue-O-3-8-4 120Ikauniece, Indra . . . . . . . . Mon-S&T-2-B-3 109Inaguma, Hirofumi . . . . . Tue-P-4-3-7 144India, Miquel . . . . . . . . . . . . Wed-P-6-2-7 192Inoue, Koji . . . . . . . . . . . . . . Tue-P-4-3-7 144Ip, Martin Ho Kwan . . . . . Tue-O-4-6-3 126Irhimeh, Sufian . . . . . . . . . Thu-P-9-4-3 241Irino, Toshio . . . . . . . . . . . . Mon-P-2-1-9 100

Tue-O-4-2-2 123Tue-O-5-4-1 131Wed-P-6-4-3 197

Irtza, Saad . . . . . . . . . . . . . . Wed-O-7-10-4 182Ishi, Carlos . . . . . . . . . . . . . . Tue-SS-4-11-3 112

Tue-SS-5-11-4 114Tue-P-4-3-6 144

Ishida, Mako . . . . . . . . . . . . Mon-P-2-1-3 99Ishiguro, Hiroshi . . . . . . . Tue-SS-4-11-3 112

Tue-P-4-3-6 144Ishii, Ryo . . . . . . . . . . . . . . . . Tue-P-4-3-1 143Ishimoto, Yuichi . . . . . . . . Tue-P-4-3-5 144Issa, Amel . . . . . . . . . . . . . . . Wed-P-7-2-4 198Ito, Kayoko . . . . . . . . . . . . . Tue-O-5-8-2 134Ito, Kiwako . . . . . . . . . . . . . . Tue-P-5-1-1 146Ito, Takayuki . . . . . . . . . . . . Mon-P-2-2-10 103Itoh, Yoshiaki . . . . . . . . . . . Wed-P-6-3-2 194Iwata, Kazuhiko . . . . . . . . Mon-P-2-4-5 107

JJabaian, Bassam . . . . . . . . Wed-P-8-3-9 213Jacewicz, Ewa . . . . . . . . . . . Mon-O-2-6-2 90Jahromi, Mohsen Z. . . . . . Tue-O-4-2-4 124Jaitly, Navdeep . . . . . . . . . Mon-P-1-4-4 97

Mon-P-2-4-1 106Tue-O-3-1-1 115Wed-O-8-4-1 184

Thu-O-10-1-5 230Thu-P-9-4-11 242

Jancovic, Peter . . . . . . . . . . Mon-O-2-4-1 89Jang, Hye Jin . . . . . . . . . . . . Wed-P-6-2-5 192Jang, Inseon . . . . . . . . . . . . Mon-P-1-2-5 95Jang, Younseon . . . . . . . . . Mon-P-1-2-5 95Janott, Christoph . . . . . . . Thu-SS-9-10-1 219

Thu-SS-9-10-3 219Jansche, Martin . . . . . . . . . Wed-SS-7-1-14 165Jansson, Johan . . . . . . . . . . Thu-SS-9-11-5 221Janu, Thomas . . . . . . . . . . . Tue-P-5-2-7 150Jati, Arindam . . . . . . . . . . . Thu-O-9-2-2 224Jaumard-Hakoun, A. . . . . Mon-S&T-2-B-1 109Jelil, Sarfaraz . . . . . . . . . . . Mon-SS-1-8-5 77

Wed-P-6-2-12 193Jemel, Boutheina . . . . . . . Thu-P-9-3-6 238Jensen, Jesper . . . . . . . . . . Tue-O-4-2-4 124

Wed-P-6-4-6 197Jensen, Jesper Rindom . Mon-O-2-2-1 87Jeon, Kwang Myung . . . . Tue-S&T-3-B-4 159Jessen, Michael . . . . . . . . . Wed-P-6-2-10 193Jesus, Luis M.T. . . . . . . . . . Wed-P-7-3-6 201Ji, Heng . . . . . . . . . . . . . . . . . Wed-SS-6-2-1 160Ji, Youna . . . . . . . . . . . . . . . . Wed-O-8-6-5 185Ji, Zhe . . . . . . . . . . . . . . . . . . . Mon-SS-2-8-2 79Jia, Jia . . . . . . . . . . . . . . . . . . . Tue-O-4-8-1 126Jiao, Li . . . . . . . . . . . . . . . . . . Wed-SS-8-11-8 168Jiao, Yishan . . . . . . . . . . . . . Tue-P-5-2-1 149Jin, Ma . . . . . . . . . . . . . . . . . . Wed-O-7-10-2 182Jin, Rong . . . . . . . . . . . . . . . . Wed-P-6-2-11 193Jochim, Markus . . . . . . . . . Mon-S&T-2-A-3 108

Wed-P-7-2-10 199Johnson, Leif . . . . . . . . . . . Tue-O-3-1-1 115

Wed-O-7-8-5 181Jokisch, Oliver . . . . . . . . . . Wed-S&T-6-B-3 217Jonell, Patrik . . . . . . . . . . . . Tue-SS-3-11-5 111Jones, Caroline . . . . . . . . . Mon-O-1-6-5 85Jones, Karen . . . . . . . . . . . . Wed-O-8-1-6 183Jorrín, Jesús . . . . . . . . . . . . Tue-O-3-4-6 118Jorrín-Prieto, Jesús . . . . . Tue-O-5-2-5 131Joseph, Shaun . . . . . . . . . . Tue-P-5-3-15 154Josse, Yvan . . . . . . . . . . . . . Thu-SS-9-10-7 220Joy, Neethu Mariam . . . . Mon-P-2-3-8 105

Wed-SS-7-1-9 164Wed-O-8-8-4 186

Juang, Biing-Hwang . . . . . Wed-O-7-4-5 179Thu-O-9-1-4 224

Jung, Jee-weon . . . . . . . . . . Tue-P-3-1-12 138

Junttila, Katja . . . . . . . . . . . Wed-S&T-6-A-6 217Juvela, Lauri . . . . . . . . . . . . Tue-O-5-4-2 132

Tue-O-5-4-3 132Wed-P-8-4-7 215

Juzová, Markéta . . . . . . . . Wed-S&T-6-A-4 216

KK., Nikitha . . . . . . . . . . . . . . Tue-P-5-2-5 150Kaburagi, Tokihiko . . . . . Wed-O-6-1-6 170Kachkovskaia, Tatiana . Wed-SS-7-1-4 163Kacprzak, Stanisław . . . . Mon-SS-1-8-6 77Kadiri, Sudarsana R. . . . . Mon-SS-2-8-6 80

Wed-O-8-1-1 182Wed-P-7-2-11 200

Kager, René . . . . . . . . . . . . . Tue-O-5-6-2 133Kahn, Juliette . . . . . . . . . . . Wed-P-6-2-9 193Kain, Alexander . . . . . . . . . Tue-O-4-2-1 123

Tue-O-4-10-6 129Kakouros, Sofoklis . . . . . Wed-P-8-1-8 207Kalita, Sishir . . . . . . . . . . . . Tue-O-3-6-3 119

Tue-P-5-2-5 150Kalkunte Suresh, A. . . . . Thu-SS-9-10-8 220Kallio, Heini . . . . . . . . . . . . . Wed-S&T-6-A-6 217Kamble, Madhu R. . . . . . . Mon-SS-1-8-3 76

Wed-O-8-1-2 183Kameoka, Hirokazu . . . . Tue-O-3-8-3 120

Tue-O-3-8-4 120Tue-O-4-1-3 122

Tue-O-4-10-4 128Tue-P-5-4-10 156Wed-P-8-4-6 215

Kamiyama, Hosana . . . . . Tue-P-4-3-12 145Kamper, Herman . . . . . . . Wed-P-6-3-1 194

Thu-O-9-8-6 229Kampman, Onno . . . . . . . Wed-S&T-6-B-4 217Kampstra, Frederik . . . . . Mon-SS-1-11-1 77Kaneko, Daisuke . . . . . . . . Wed-P-6-3-2 194Kaneko, Takuhiro . . . . . . Tue-O-4-10-4 128

Wed-P-8-4-6 215Kano, Takatomo . . . . . . . . Wed-O-8-4-2 184Kant, Anjali . . . . . . . . . . . . . Mon-S&T-2-A-6 109Karafiát, Martin . . . . . . . . . Mon-P-2-3-4 105

Wed-SS-6-2-1 160Karhila, Reima . . . . . . . . . . Tue-S&T-3-B-6 159

Wed-S&T-6-A-6 217Karita, Shigeki . . . . . . . . . . Tue-P-4-1-3 141

Tue-P-4-1-4 141Karpov, Alexey A. . . . . . . Wed-O-7-6-4 180

Thu-SS-10-10-6 223Karthik, Girija R. . . . . . . . . Tue-P-5-3-12 154Karvitsky, Gennady . . . . . Tue-O-5-2-3 130Kashino, Kunio . . . . . . . . . Tue-O-4-10-4 128Kashyap, H. . . . . . . . . . . . . . Wed-P-6-2-12 193Kasten, Conner . . . . . . . . . Tue-SS-5-11-9 115Kathania, H.K. . . . . . . . . . . Wed-O-6-10-2 174Katzberg, Fabrice . . . . . . . Wed-P-7-3-2 200Kaushik, Lakshmish . . . . Wed-P-6-1-13 191Kavanagh, Colleen . . . . . . Thu-P-9-3-3 237Kawahara, Hideki . . . . . . . Mon-P-1-1-4 93

Mon-P-2-1-9 100Tue-O-5-4-1 131

Kawahara, Tatsuya . . . . . Tue-P-4-3-7 144Tue-P-4-3-14 146Wed-O-7-2-2 177

Kawai, Hisashi . . . . . . . . . . Mon-P-2-4-4 107Wed-P-6-2-3 192

Kaya, Heysem . . . . . . . . . . . Thu-SS-10-10-6 223Keating, Patricia A. . . . . . Tue-P-3-1-10 138Keegan, P.J. . . . . . . . . . . . . . Wed-SS-6-2-3 160Keidel Fernández, A. . . . Tue-P-5-1-12 148Keith, Francis . . . . . . . . . . . Mon-O-1-1-1 80

Mon-O-1-1-4 81Kember, Heather . . . . . . . Tue-O-4-6-4 126

Tue-O-5-6-1 133Kenny, Patrick . . . . . . . . . . Tue-O-5-2-5 131

Tue-P-3-1-9 138Thu-O-10-2-5 231

Keshet, Joseph . . . . . . . . . . Tue-O-3-6-5 119Wed-O-7-8-6 181

Kheyrkhah, Timothée . . Tue-O-5-2-6 131Khokhlov, Yuri . . . . . . . . . Wed-P-6-3-3 194

Thu-O-9-4-3 226Khonglah, B.K. . . . . . . . . . . Mon-P-1-2-2 94Khorram, Soheil . . . . . . . . Tue-O-3-10-3 121

Tue-O-4-8-4 127Khosravani, Abbas . . . . . Tue-O-3-4-5 118Khoury, Elie . . . . . . . . . . . . . Mon-SS-2-8-4 80

248

Khudanpur, Sanjeev . . . . Mon-P-1-4-3 97Tue-O-3-4-1 117Tue-P-4-1-1 140Tue-P-4-2-1 142

Wed-O-7-4-6 179Wed-O-7-8-2 180Thu-O-9-4-2 226

Khurana, Sameer . . . . . . . Wed-O-7-10-6 182Kibira, William . . . . . . . . . . Wed-SS-7-1-2 163Kim, Byung-Hak . . . . . . . . Tue-O-5-1-6 130Kim, Chanwoo . . . . . . . . . . Mon-O-2-10-1 91

Mon-O-2-10-5 92Thu-P-9-1-9 236

Kim, Hoirin . . . . . . . . . . . . . Mon-P-2-3-6 105Kim, Hong Kook . . . . . . . . Tue-S&T-3-B-4 159Kim, Jaebok . . . . . . . . . . . . . Tue-O-3-10-6 122Kim, Jaeyoung . . . . . . . . . . Tue-P-4-1-2 140Kim, Jangwon . . . . . . . . . . . Mon-P-2-2-5 102Kim, Jeesun . . . . . . . . . . . . . Mon-P-2-1-5 99Kim, Jonny . . . . . . . . . . . . . . Mon-P-2-1-13 100Kim, Jungsuk . . . . . . . . . . . Tue-O-5-1-6 130Kim, Myungjong . . . . . . . . Mon-O-1-10-6 86

Wed-P-6-1-7 190Kim, Nam Kyun . . . . . . . . . Tue-S&T-3-B-4 159Kim, Nam Soo . . . . . . . . . . Mon-P-1-2-6 95Kim, Sang-Hun . . . . . . . . . . Wed-P-6-3-12 196Kim, Suyoun . . . . . . . . . . . . Thu-P-9-1-8 236Kim, Taesu . . . . . . . . . . . . . . Wed-P-6-2-5 192Kim, Taesup . . . . . . . . . . . . Wed-O-6-10-6 175Kim, Wooil . . . . . . . . . . . . . . Tue-O-3-4-4 118Kim, Yoon-Chul . . . . . . . . . Mon-P-2-2-5 102Kim, Younggwan . . . . . . . Mon-P-2-3-6 105Kimball, Owen . . . . . . . . . . Mon-O-1-1-4 81King, Brian . . . . . . . . . . . . . . Wed-O-7-2-6 178King, J. . . . . . . . . . . . . . . . . . . Wed-SS-6-2-3 160King, Simon . . . . . . . . . . . . . Tue-O-4-1-4 122

Tue-O-5-4-6 132Wed-SS-7-1-16 165

Kinnunen, Tomi . . . . . . . . Mon-SS-1-8-1 76Tue-P-3-1-8 138

Wed-O-8-1-4 183Kinoshita, Keisuke . . . . . . Mon-O-2-10-2 91

Tue-O-4-4-2 124Tue-P-5-4-3 155

Wed-O-8-6-2 185Wed-P-6-4-3 197

Kirkpatrick, Matthew G. Wed-P-7-4-6 205Kitahara, Mafuyu . . . . . . . Mon-O-2-6-3 90Kitamura, Tatsuya . . . . . . Thu-SS-9-11-1 220Kjaran, Róbert . . . . . . . . . . Wed-SS-7-1-11 165

Wed-SS-7-1-13 165Kjartansson, Oddur . . . . Wed-SS-7-1-14 165Klakow, Dietrich . . . . . . . . Mon-O-2-1-2 87

Mon-P-1-4-10 98Tue-SS-4-11-6 113Wed-O-8-10-3 188Thu-O-10-4-4 232

Kleber, Felicitas . . . . . . . . . Tue-P-5-1-3 146Wed-P-7-2-10 199

Kleinhans, Janine . . . . . . . Wed-P-8-1-6 207Klempír, Jirí . . . . . . . . . . . . . Tue-P-5-2-8 150

Wed-P-7-4-4 204Kleynhans, Neil . . . . . . . . . Wed-SS-7-1-14 165Kliegl, Markus . . . . . . . . . . Tue-P-4-1-5 141Klimkov, Viacheslav . . . . Tue-O-3-8-2 120Klingler, Nicola . . . . . . . . . Wed-P-7-2-3 198Klumpp, Philipp . . . . . . . . Tue-P-5-2-7 150Klüpfel, Simon . . . . . . . . . . Wed-SS-7-1-13 165K.M., Srinivasa R. . . . . . . . Thu-SS-9-10-8 220Knight, Kevin . . . . . . . . . . . Wed-SS-6-2-1 160Knill, K.M. . . . . . . . . . . . . . . . Wed-P-6-1-8 190Ko, Hanseok . . . . . . . . . . . . Tue-O-3-4-4 118

Wed-P-6-2-14 194Kobashikawa, Satoshi . . Tue-O-3-10-2 121

Tue-P-4-3-12 145Kobayashi, Kazuhiro . . . Tue-O-4-1-1 122

Tue-O-4-1-5 123Kobayashi, Tetsunori . . . Mon-P-2-4-5 107Koch, Philipp . . . . . . . . . . . Wed-P-7-3-2 200Kocharov, Daniil . . . . . . . . Wed-SS-7-1-4 163Kockmann, Marcel . . . . . . Thu-O-10-2-5 231Köhler, Joachim . . . . . . . . Wed-O-7-8-1 180Kohtz, Lea S. . . . . . . . . . . . . Thu-O-10-8-5 233Koishida, Kazuhito . . . . . Tue-P-3-1-3 137Kojima, Kazunori . . . . . . . Wed-P-6-3-2 194Kokkinakis, Kostas . . . . . Tue-O-4-4-3 125Komatani, Kazunori . . . . Tue-P-4-2-2 142Komaty, Alain . . . . . . . . . . Tue-S&T-3-A-2 158

Kong, Lingpeng . . . . . . . . . Tue-O-3-1-4 115Kong, Qiuqiang . . . . . . . . . Wed-P-7-3-10 202Konno, Ryota . . . . . . . . . . . Wed-P-6-3-2 194Kons, Zvi . . . . . . . . . . . . . . . . Mon-P-2-4-2 106Kontogiorgos, D. . . . . . . . Tue-SS-3-11-5 111Kopparapu, Sunil K. . . . . Mon-S&T-2-A-6 109

Tue-P-5-2-10 151Korenevsky, Maxim . . . . . Thu-O-9-4-3 226Koriyama, Tomoki . . . . . . Thu-P-9-4-2 240Korpusik, Mandy . . . . . . . Wed-P-8-3-4 212Kösem, Anne . . . . . . . . . . . Wed-O-7-1-1 175Koshinaka, Takafumi . . . Thu-O-10-2-3 231

Thu-O-10-2-4 231Kothapally, Vinay . . . . . . . Tue-P-5-3-16 154Kothinti, Sandeep R. . . . . Mon-P-2-3-8 105Kotlerman, Lili . . . . . . . . . . Mon-P-1-4-9 98Kouklia, Charlotte . . . . . . Tue-SS-5-11-2 113Koutsogiannaki, Maria . Tue-P-5-4-5 155Kowalczyk, Konrad . . . . . Mon-SS-1-8-6 77Kozlov, Alexander . . . . . . Mon-SS-2-8-1 79Krahmer, Emiel . . . . . . . . . Mon-O-2-6-4 90Krajewski, Jarek . . . . . . . . Thu-SS-9-10-1 219

Thu-SS-9-10-2 219Kraljevski, Ivan . . . . . . . . . Wed-O-7-6-6 180Kreiman, Jody . . . . . . . . . . Tue-P-3-1-10 138Krona, Andreas . . . . . . . . . Wed-S&T-6-B-1 217Kronlid, Fredrik . . . . . . . . Wed-S&T-6-B-1 217Kroos, Christian . . . . . . . . Mon-P-1-1-10 94Kuang, Jianjing . . . . . . . . . Wed-P-8-1-9 208Kudashev, Oleg . . . . . . . . . Mon-SS-2-8-1 79Kumar, Aman . . . . . . . . . . . Wed-S&T-6-B-5 218Kumar, Anish . . . . . . . . . . . Tue-P-4-3-8 144Kumar, Anjishnu . . . . . . . Wed-P-6-3-9 195Kumar, Anurag . . . . . . . . . Tue-P-5-3-1 152Kumar, Avinash . . . . . . . . Mon-P-1-1-5 93

Tue-P-5-3-3 152Kumar, Manoj . . . . . . . . . . Wed-O-8-10-5 188Kumar, Nagendra . . . . . . . Wed-P-6-2-12 193Kumar, Pranaw . . . . . . . . . Wed-S&T-6-A-5 217Kumar, Shankar . . . . . . . . Mon-O-2-1-1 86Kuo, Kuan-Ting . . . . . . . . . Tue-O-4-4-4 125Kuo, Li-Wei . . . . . . . . . . . . . . Wed-P-8-2-4 210Kurata, Gakuto . . . . . . . . . Mon-O-1-1-5 81

Mon-O-2-1-5 87Mon-O-2-10-3 92

Tue-P-4-1-7 141Thu-O-9-4-5 226

Thu-O-10-1-4 230Kurimo, Mikko . . . . . . . . . . Tue-S&T-3-B-6 159

Wed-O-7-8-4 181Wed-S&T-6-A-6 217

Thu-O-10-4-5 232Kwak, Chan Woong . . . . . Tue-S&T-3-B-4 159Kwon, Haeyong . . . . . . . . . Mon-O-2-10-2 91Kwon, Oh-Wook . . . . . . . . Wed-P-6-3-12 196Kyaw, Win Thuzar . . . . . . Mon-P-2-2-12 103Kyriakopoulos, K. . . . . . . Wed-P-6-1-8 190

LLaaridh, Imed . . . . . . . . . . . Tue-P-5-2-6 150Labatut, Vincent . . . . . . . . Wed-P-6-2-8 192Laface, Pietro . . . . . . . . . . . Tue-O-5-2-3 130Laha, Anirban . . . . . . . . . . . Mon-P-1-4-9 98Lahiri, Aditi . . . . . . . . . . . . . Tue-O-5-8-4 134Lai, Catherine . . . . . . . . . . . Mon-S&T-2-A-5 109

Wed-P-8-1-6 207Lai, Jiahao . . . . . . . . . . . . . . . Mon-P-2-4-9 107Lai, Wei . . . . . . . . . . . . . . . . . Mon-P-2-2-9 102Lai, Ying-Hui . . . . . . . . . . . . Mon-P-1-1-7 94Laine, Unto K. . . . . . . . . . . . Mon-P-1-1-9 94Laksana, Eugene . . . . . . . . Wed-P-8-2-3 209Lalhminghlui, Wendy . . . Tue-O-3-6-3 119Lamalle, Laurent . . . . . . . . Wed-O-6-1-2 169Lamel, Lori . . . . . . . . . . . . . . Mon-SS-1-11-6 78

Thu-O-10-8-3 233Thu-P-9-3-7 238

Lammert, Adam . . . . . . . . Mon-P-2-2-5 102Lancia, Leonardo . . . . . . . Tue-SS-3-11-6 111Landman, Rogier . . . . . . . Wed-O-7-1-3 176Lane, Ian . . . . . . . . . . . . . . . . Mon-P-1-4-7 97

Tue-O-5-1-6 130Wed-O-7-6-1 179Thu-P-9-1-8 236

Lange, Patrick L. . . . . . . . . Tue-P-4-3-11 145Wed-O-7-10-5 182

Lapidot, Itshak . . . . . . . . . . Thu-O-9-2-4 225

Laprie, Yves . . . . . . . . . . . . . Mon-O-1-6-2 84Mon-O-2-6-1 90

Larcher, Anthony . . . . . . . Thu-O-9-2-3 225Larsen, Elin . . . . . . . . . . . . . Wed-SS-7-11-2 166Larsson, Staffan . . . . . . . . Wed-S&T-6-B-1 217Laskowski, Kornel . . . . . . Tue-P-4-3-2 143Laures-Gore, Jacqueline Wed-P-7-4-3 204Lavrentyeva, Galina . . . . . Mon-SS-2-8-1 79Law, Thomas K.T. . . . . . . . Wed-O-8-8-1 186Lawson, Aaron . . . . . . . . . . Wed-P-6-2-1 191

Thu-O-10-2-6 231Laycock, Stephen . . . . . . . Thu-P-9-4-8 241Le, Duc . . . . . . . . . . . . . . . . . . Mon-O-2-2-2 88

Tue-O-3-10-5 121Le, Phu Ngoc . . . . . . . . . . . . Mon-O-2-4-2 89Le, Quoc . . . . . . . . . . . . . . . . Thu-P-9-4-11 242Le Bruyn, Bert . . . . . . . . . . . Tue-O-5-6-2 133Lee, Chia-Fone . . . . . . . . . . Mon-O-1-4-6 84Lee, Chi-Chun . . . . . . . . . . . Wed-O-6-8-2 173

Wed-O-6-8-3 173Wed-P-8-2-4 210

Lee, Chin-Hui . . . . . . . . . . . Mon-O-2-10-4 92Tue-O-4-4-1 124Wed-P-6-1-5 189Thu-P-9-1-4 235

Lee, Chong Min . . . . . . . . . Tue-O-5-8-1 134Tue-O-5-8-3 134Wed-P-6-1-4 189

Lee, Huang-Yi . . . . . . . . . . . Tue-P-5-3-13 154Lee, Hung-Shin . . . . . . . . . . Thu-O-9-1-6 224Lee, Hung-Yi . . . . . . . . . . . . Wed-P-6-3-6 195

Thu-O-10-11-5 235Lee, Jin Won . . . . . . . . . . . . Tue-P-5-4-9 156Lee, Jungwon . . . . . . . . . . . Tue-P-4-1-2 140Lee, Kai-Zhan . . . . . . . . . . . Tue-O-5-10-6 136Lee, Kathy Y.S. . . . . . . . . . . Wed-O-8-8-1 186Lee, Kong Aik . . . . . . . . . . . Mon-SS-1-8-1 76

Tue-O-5-2-1 130Tue-P-3-1-6 137

Tue-P-3-1-11 138Lee, Lin-Shan . . . . . . . . . . . . Wed-P-6-3-6 195Lee, Nayeon . . . . . . . . . . . . . Mon-S&T-2-B-4 109Lee, Shi-wook . . . . . . . . . . . Wed-P-6-3-2 194Lee, Sungbok . . . . . . . . . . . Mon-O-1-2-2 82Lee, Tan . . . . . . . . . . . . . . . . . Wed-SS-6-2-4 160

Wed-O-6-10-3 174Wed-O-8-8-1 186

Lefèvre, Fabrice . . . . . . . . . Wed-P-8-3-9 213Lehtinen, Mona . . . . . . . . . Wed-P-7-2-9 199Le Lan, Gaël . . . . . . . . . . . . . Thu-O-9-2-3 225Le Maguer, Sébastien . . . Mon-O-1-10-3 85

Thu-O-10-8-6 234Lenarczyk, Michał . . . . . . Tue-S&T-3-A-3 158Leng, Yi Ren . . . . . . . . . . . . Thu-O-10-11-2 234Lennes, Mietta . . . . . . . . . . Mon-S&T-2-B-5 110Leong, Chee Wee . . . . . . . Wed-O-7-6-5 180Le Roux, Jonathan . . . . . . Wed-O-7-2-4 177Lev, Guy . . . . . . . . . . . . . . . . . Mon-P-1-4-9 98Levin, Keith . . . . . . . . . . . . . Wed-P-6-3-1 194Levit, Michael . . . . . . . . . . . Thu-P-9-3-13 239Levitan, Rivka . . . . . . . . . . . Wed-P-8-3-11 213Levitan, Sarah Ita . . . . . . . Tue-O-5-10-6 136Levitan, Yocheved . . . . . . Thu-P-9-4-4 241Levow, Gina-Anne . . . . . . Tue-O-5-10-2 135Lewandowski, Natalie . . Wed-SS-8-11-5 168Lewis, Molly . . . . . . . . . . . . . Tue-S&T-3-A-6 158Li, Aijun . . . . . . . . . . . . . . . . . Tue-O-5-6-5 133

Tue-P-5-2-12 151Li, Baoqing . . . . . . . . . . . . . . Tue-P-5-3-6 152Li, Bei . . . . . . . . . . . . . . . . . . . Tue-P-5-1-6 147Li, Bo . . . . . . . . . . . . . . . . . . . . Mon-O-2-10-5 92

Tue-O-3-1-1 115Tue-O-3-1-6 116

Thu-O-10-1-5 230Thu-O-10-11-3 234

Li, Gang . . . . . . . . . . . . . . . . . Mon-SS-1-8-4 76Thu-SS-9-10-6 219

Li, Haizhou . . . . . . . . . . . . . Mon-K1-1 76Mon-O-1-2-5 82Tue-P-3-1-11 138

Tue-P-5-3-5 152Wed-O-7-10-4 182

Wed-P-8-4-4 214Li, Hao . . . . . . . . . . . . . . . . . . Wed-P-6-2-11 193Li, Jiangchuan . . . . . . . . . . Thu-P-9-4-12 242Li, Jinyu . . . . . . . . . . . . . . . . . Wed-O-6-10-1 174

Thu-O-9-6-3 227Li, Junfeng . . . . . . . . . . . . . . Tue-O-4-4-6 125

249

Li, Kehuang . . . . . . . . . . . . . Thu-P-9-1-4 235Li, Lantian . . . . . . . . . . . . . . . Mon-SS-2-8-3 79

Tue-P-3-2-2 139Li, Li . . . . . . . . . . . . . . . . . . . . . Tue-P-5-4-10 156Li, Li-Jia . . . . . . . . . . . . . . . . . Mon-P-1-2-3 95Li, Ming . . . . . . . . . . . . . . . . . Mon-SS-1-8-4 76

Wed-P-7-3-11 202Thu-SS-9-10-6 219

Li, Peng . . . . . . . . . . . . . . . . . Mon-SS-2-8-2 79Li, Ruizhi . . . . . . . . . . . . . . . . Tue-O-5-2-2 130Li, Runnan . . . . . . . . . . . . . . Mon-P-2-4-6 107

Wed-P-8-4-10 216Li, Sheng . . . . . . . . . . . . . . . . Wed-P-6-2-3 192Li, Wei . . . . . . . . . . . . . . . . . . . Wed-P-6-1-5 189Li, Weicong . . . . . . . . . . . . . Mon-O-1-6-5 85Li, Wenpeng . . . . . . . . . . . . . Mon-P-1-4-5 97Li, Xin . . . . . . . . . . . . . . . . . . . Wed-P-7-3-11 202Li, Xu . . . . . . . . . . . . . . . . . . . . Tue-O-4-4-6 125Li, Ya . . . . . . . . . . . . . . . . . . . . Mon-P-2-4-7 107

Wed-P-6-1-9 190Li, Zhi-Yi . . . . . . . . . . . . . . . . Mon-SS-2-8-2 79Liang, Jiaen . . . . . . . . . . . . . Tue-P-5-3-6 152Liang, Zhi-Pei . . . . . . . . . . . Mon-O-1-6-1 84Liao, Hank . . . . . . . . . . . . . . Thu-O-10-1-6 230Liao, Yu-Hsien . . . . . . . . . . Wed-P-8-2-4 210Licata, Keli . . . . . . . . . . . . . . Mon-O-2-2-2 88Liebson, Elizabeth S. . . . Wed-P-8-2-3 209Lilley, Jason . . . . . . . . . . . . . Tue-P-5-2-11 151

Thu-S&T-9-A-4 243Lim, Boon Pang . . . . . . . . . Mon-P-2-3-7 105

Wed-P-6-1-11 190Lim, Hyungjun . . . . . . . . . . Mon-P-2-3-6 105Lim, Lynn-Li . . . . . . . . . . . . Mon-O-1-1-5 81Lin, Kin Wah Edward . . . Wed-P-7-3-1 200Lin, Ying . . . . . . . . . . . . . . . . Wed-SS-6-2-1 160Lin, Yun-Shao . . . . . . . . . . . Wed-O-6-8-3 173Linarès, Georges . . . . . . . . Wed-P-8-3-5 212Lindblom, Björn . . . . . . . . Thu-K4-1 218Ling, Zhen-Hua . . . . . . . . . Tue-O-4-1-2 122Linhard, Klaus . . . . . . . . . . Mon-O-1-4-2 83Lippus, Pärtel . . . . . . . . . . . Tue-O-3-6-1 118Liss, Julie . . . . . . . . . . . . . . . Tue-P-5-2-1 149

Tue-P-5-2-9 151Litman, Diane . . . . . . . . . . . Tue-P-4-3-8 144Little, Max A. . . . . . . . . . . . . Mon-O-2-2-1 87Liu, Bin . . . . . . . . . . . . . . . . . . Mon-P-2-4-7 107Liu, Bing . . . . . . . . . . . . . . . . Wed-O-7-6-1 179Liu, Chaoran . . . . . . . . . . . . Tue-P-4-3-6 144Liu, Chunxi . . . . . . . . . . . . . Wed-O-7-4-6 179Liu, Daben . . . . . . . . . . . . . . Wed-SS-7-1-8 164Liu, Gang . . . . . . . . . . . . . . . . Wed-P-6-2-11 193Liu, Hong . . . . . . . . . . . . . . . Tue-P-5-3-11 153Liu, Hongchao . . . . . . . . . . Wed-P-7-2-12 200Liu, Shih-Hung . . . . . . . . . . Thu-O-9-6-4 227Liu, Wenbo . . . . . . . . . . . . . . Mon-SS-1-8-4 76

Thu-SS-9-10-6 219Liu, Wenju . . . . . . . . . . . . . . Mon-P-1-2-1 94Liu, X. . . . . . . . . . . . . . . . . . . . Mon-O-2-1-3 87Liu, Xiaolin . . . . . . . . . . . . . . Wed-P-7-3-11 202Liu, Xunying . . . . . . . . . . . . Wed-O-6-10-3 174Liu, Yi-Wen . . . . . . . . . . . . . . Tue-P-5-3-13 154Liu, Yuanyuan . . . . . . . . . . Wed-O-8-8-1 186Liu, Yuzong . . . . . . . . . . . . . Wed-O-7-2-6 178Liu, Zheng . . . . . . . . . . . . . . Tue-P-5-3-6 152Livescu, Karen . . . . . . . . . . Tue-P-4-2-6 143

Wed-P-6-3-1 194Thu-O-9-1-1 223Thu-O-9-8-6 229

Lleida, Eduardo . . . . . . . . . Wed-P-6-2-4 192Wed-P-6-2-6 192

Llombart, Jorge . . . . . . . . . Wed-P-6-2-4 192Logan, Yash-Yee . . . . . . . . Wed-P-7-4-3 204Lolive, Damien . . . . . . . . . . Wed-P-8-2-8 210Lopes, Carla . . . . . . . . . . . . Tue-O-5-8-5 135

Wed-P-6-1-3 189Lopez-Otero, Paula . . . . . Wed-P-6-3-8 195

Wed-P-7-4-9 205Lord, Alekzandra . . . . . . . Wed-SS-6-11-2 161Lorenzo-Trueba, Jaime . Mon-P-2-1-10 100

Thu-P-9-4-1 240Loweimi, Erfan . . . . . . . . . . Mon-P-1-1-2 93

Wed-O-7-2-5 177Lozano-Diez, Alicia . . . . . Thu-O-10-2-6 231Lu, Bo-Ru . . . . . . . . . . . . . . . . Wed-P-6-3-6 195Lu, Di . . . . . . . . . . . . . . . . . . . . Wed-SS-6-2-1 160Lu, Liang . . . . . . . . . . . . . . . . Tue-O-3-1-4 115

Thu-O-9-1-1 223Lu, Xugang . . . . . . . . . . . . . . Wed-P-6-2-3 192

Lu, Yu-Ding . . . . . . . . . . . . . Thu-O-9-1-6 224Lucero, Jorge C. . . . . . . . . . Tue-O-3-2-1 116Lui, Simon . . . . . . . . . . . . . . Wed-P-7-3-1 200Luk, San-hei Kenny . . . . . Mon-P-2-1-4 99Lunsford, Rebecca . . . . . . Tue-P-4-3-3 144Luo, Dean . . . . . . . . . . . . . . . Tue-P-5-1-9 147Luo, Qinyi . . . . . . . . . . . . . . . Tue-O-5-10-4 136Luo, Ruxin . . . . . . . . . . . . . . Tue-P-5-1-9 147Luo, Zhaojie . . . . . . . . . . . . Wed-P-8-4-8 215Luque, Jordi . . . . . . . . . . . . Wed-O-6-6-5 172Luz, Saturnino . . . . . . . . . . Wed-O-6-8-6 174

Wed-P-8-2-9 211Lyon, Thomas D. . . . . . . . Wed-O-8-10-5 188

MM., Sasikumar . . . . . . . . . . . Wed-S&T-6-A-5 217Ma, Bin . . . . . . . . . . . . . . . . . . Wed-SS-7-11-1 166

Thu-SS-9-10-9 220Ma, Feng . . . . . . . . . . . . . . . . Mon-O-2-10-4 92Ma, Jeff . . . . . . . . . . . . . . . . . . Mon-O-1-1-1 80

Mon-O-1-1-4 81Ma, Jianbo . . . . . . . . . . . . . . Tue-P-3-1-6 137Ma, Min . . . . . . . . . . . . . . . . . Mon-O-2-1-1 86Ma, Mingbo . . . . . . . . . . . . . Wed-P-8-3-7 213Ma, Xi . . . . . . . . . . . . . . . . . . . Tue-O-4-8-1 126Ma, Zhanyu . . . . . . . . . . . . . Tue-P-3-1-4 137Maas, Roland . . . . . . . . . . . Tue-P-5-3-15 154

Wed-O-7-2-6 178Maass, Marco . . . . . . . . . . . Wed-P-7-3-2 200Mackie, Scott . . . . . . . . . . . . Wed-SS-6-11-1 161Maclagan, M.A. . . . . . . . . . . Wed-SS-6-2-3 160Madhyastha, Pranava . . . Wed-O-8-10-2 187Madikeri, Srikanth . . . . . . Tue-P-3-1-2 136Madureira, Sandra . . . . . . Thu-P-9-3-9 239Maekawa, Kikuo . . . . . . . . Tue-O-4-6-6 126Mahshie, James . . . . . . . . . Tue-P-5-2-13 151Mahto, Shivangi . . . . . . . . . Thu-O-10-2-3 231Mahu, Rodrigo . . . . . . . . . . Tue-SS-3-11-2 111Maia, Ranniery . . . . . . . . . . Wed-P-8-4-9 215Maier, Angelika . . . . . . . . . Tue-P-4-3-4 144Maiti, Soumi . . . . . . . . . . . . Thu-O-9-6-6 228Mak, Brian . . . . . . . . . . . . . . Mon-P-2-3-9 106

Thu-P-9-1-7 236Mak, Man-Wai . . . . . . . . . . . Tue-P-3-2-6 139Maki, Kotaro . . . . . . . . . . . . Thu-SS-9-11-1 220Makinae, Hisanori . . . . . . Thu-SS-9-11-1 220Makino, Shoji . . . . . . . . . . . Tue-P-5-4-10 156Malandrakis, Nikolaos . . Wed-SS-6-2-1 160

Wed-SS-7-1-3 163Malinen, Jarmo . . . . . . . . . Thu-SS-9-11-4 221Malisz, Zofia . . . . . . . . . . . . Tue-O-3-8-5 120Mallidi, Harish . . . . . . . . . . Tue-O-5-2-2 130Malykh, Egor . . . . . . . . . . . . Mon-SS-2-8-1 79Mandal, Tanumay . . . . . . Mon-P-2-2-15 104Mandel, Michael I. . . . . . . Thu-O-9-6-6 228Manohar, Vimal . . . . . . . . . Tue-P-4-1-1 140

Wed-O-7-8-2 180Thu-O-9-4-2 226

Manríquez, Rodrigo . . . . Tue-O-5-4-5 132Mansikkaniemi, André . Thu-O-10-4-5 232Marcel, Sébastien . . . . . . . Tue-S&T-3-A-2 158Marcusson, Amelie . . . . . Wed-SS-6-11-2 161Marin, Alex . . . . . . . . . . . . . Tue-P-4-3-10 145Marklund, Ellen . . . . . . . . . Tue-P-5-1-15 149

Wed-SS-6-11-2 161Wed-SS-7-11-4 166

Markó, Alexandra . . . . . . . Thu-O-9-8-5 229Marques, Luciana . . . . . . . Mon-P-2-1-12 100Marschik, Peter B. . . . . . . Mon-O-2-2-5 88Marteau, P.-F. . . . . . . . . . . . Wed-P-8-2-8 210Martínez-Hinarejos, CD Wed-P-8-3-10 213Marxer, Ricard . . . . . . . . . . Tue-P-5-4-8 156Masataki, Hirokazu . . . . . Tue-P-4-3-1 143Maslowski, Merel . . . . . . . Mon-P-2-1-6 99Mason, Lisa . . . . . . . . . . . . . Tue-O-5-2-6 131Masuda-Katsuse, Ikuyo . Thu-S&T-9-A-3 243Masumura, Ryo . . . . . . . . . Mon-P-2-4-3 106

Tue-P-4-3-1 143Tue-P-4-3-12 145Wed-P-8-3-2 212

Matassoni, Marco . . . . . . . Mon-P-2-3-5 105Matejka, Pavel . . . . . . . . . . Mon-P-2-3-4 105

Tue-O-5-2-5 131Tue-P-3-2-7 140

Matoušek, Jindrich . . . . . Wed-P-7-3-4 201Wed-S&T-6-A-3 216Wed-S&T-6-A-4 216

Matsoukas, Spyros . . . . . Thu-O-9-4-4 226Matsui, Toshie . . . . . . . . . . Mon-P-2-1-9 100

Tue-O-4-2-2 123Wed-P-6-4-3 197

Matsuo, Yoshihiro . . . . . . Wed-P-8-3-1 212Matthews, Iain . . . . . . . . . . Thu-P-9-4-8 241Matthiesen, Martin . . . . . Mon-S&T-2-B-5 110Mau, Ted . . . . . . . . . . . . . . . . Mon-O-1-10-6 86

Wed-P-6-1-7 190Mauranen, Anna . . . . . . . . Mon-S&T-2-A-2 108May, Jonathan . . . . . . . . . . Wed-SS-6-2-1 160Mazur, Radoslaw . . . . . . . Wed-P-7-3-2 200McAllaster, Donald . . . . . Wed-P-6-1-10 190McAuliffe, Michael . . . . . . Mon-P-1-2-9 96

Wed-P-8-1-5 207Thu-P-9-3-2 237

McCree, Alan . . . . . . . . . . . . Tue-P-3-2-4 139McDermott, Erik . . . . . . . . Mon-O-2-10-5 92

Tue-P-4-2-3 142McDonnell, Rachel . . . . . . Mon-O-1-10-1 85McGrath, Kathleen . . . . . . Thu-S&T-9-A-4 243McInnis, Melvin . . . . . . . . . Tue-O-4-8-4 127McLaren, Mitchell . . . . . . . Wed-P-6-2-1 191

Thu-O-10-2-6 231McLoughlin, Ian . . . . . . . . . Wed-O-7-10-2 182McQueen, James M. . . . . . Mon-P-2-2-7 102McWilliams, Kelly . . . . . . . Wed-O-8-10-5 188Medani, Takfarinas . . . . . Mon-S&T-2-B-1 109Medennikov, Ivan . . . . . . . Wed-P-6-3-3 194

Thu-O-9-4-3 226Meenakshi, G. Nisha . . . . Mon-P-1-2-10 96Meermeier, Ralf . . . . . . . . . Tue-S&T-3-A-1 158Mehta, Daryush D. . . . . . . Mon-P-2-2-13 103Meignier, Sylvain . . . . . . . Thu-O-9-2-3 225Meireles, Alexsandro R. Mon-O-2-4-4 89Meister, Einar . . . . . . . . . . . Wed-SS-7-1-12 165Menacer, M.A. . . . . . . . . . . . Thu-O-10-4-1 231Mendelev, Valentin . . . . . Thu-O-9-4-3 226Mendels, Gideon . . . . . . . . Tue-O-5-10-6 136Mendelson, Joseph . . . . . Mon-O-1-10-5 86

Mon-P-2-4-11 108Tue-SS-3-11-5 111Wed-SS-7-1-16 165

Meng, Helen . . . . . . . . . . . . Mon-P-2-4-6 107Tue-O-4-8-1 126Tue-P-3-1-7 137

Wed-P-8-4-10 216Meng, Zhong . . . . . . . . . . . . Wed-O-7-4-5 179

Thu-O-9-1-4 224Menon, Anjali . . . . . . . . . . . Thu-P-9-1-9 236Menon, Raghav . . . . . . . . . Wed-SS-7-1-2 163Merritt, Thomas . . . . . . . . Tue-O-3-8-2 120Mertens, Julia . . . . . . . . . . . Mon-O-1-2-2 82Mertins, Alfred . . . . . . . . . Wed-P-7-3-2 200Metze, Florian . . . . . . . . . . Mon-P-1-4-2 96

Wed-P-7-3-13 203Meunier, Christine . . . . . . Tue-P-5-2-6 150Meyer, Antje S. . . . . . . . . . . Mon-P-2-1-6 99Meyer, Bernd T. . . . . . . . . . Tue-O-4-2-5 124

Wed-P-6-4-7 197Meyer, Werner . . . . . . . . . . Wed-S&T-6-B-3 217Michalsky, Jan . . . . . . . . . . Wed-SS-8-11-7 168Michel, Wilfried . . . . . . . . . Tue-P-4-2-5 143Michelas, Amandine . . . . Wed-P-8-1-4 207Michelsanti, Daniel . . . . . Tue-P-5-4-12 157Miguel, Antonio . . . . . . . . Wed-P-6-2-4 192

Wed-P-6-2-6 192Mihajlik, Péter . . . . . . . . . . Wed-SS-6-2-2 160Mihkla, Meelis . . . . . . . . . . Wed-P-8-1-2 206Mihuc, Sarah . . . . . . . . . . . . Mon-P-1-2-9 96Mikušová, Nina . . . . . . . . . Mon-S&T-2-A-2 108Milde, Benjamin . . . . . . . . Wed-O-7-8-1 180Milner, Ben . . . . . . . . . . . . . . Tue-P-5-4-11 157Miloševic, Milana . . . . . . . Tue-S&T-3-B-2 159Mimura, Masato . . . . . . . . Tue-P-4-3-7 144

Wed-O-7-2-2 177Minagi, Shogo . . . . . . . . . . . Wed-P-8-4-5 215Minamiguchi, Ryo . . . . . . Wed-P-6-3-7 195Minato, Takashi . . . . . . . . Tue-SS-4-11-3 112Minematsu, Nobuaki . . . Mon-P-1-4-8 97

Tue-O-3-2-5 117Tue-O-4-10-3 128Tue-O-5-8-2 134

Ming, Huaiping . . . . . . . . . Mon-O-1-2-5 82Minker, Wolfgang . . . . . . . Wed-O-7-6-4 180Mirheidari, Bahman . . . . Wed-P-7-4-7 205Mirkin, Shachar . . . . . . . . . Mon-P-1-4-9 98Mirsamadi, S. . . . . . . . . . . . Mon-O-2-10-6 92

250

Misra, Abhinav . . . . . . . . . . Wed-P-6-2-13 194Thu-O-10-2-2 230

Misra, Ananya . . . . . . . . . . Mon-O-2-10-1 91Mon-O-2-10-5 92

Mitchinson, Ben . . . . . . . . Wed-S&T-6-A-1 216Miura, Iori . . . . . . . . . . . . . . . Wed-O-7-2-4 177Miwa, Kenichiro . . . . . . . . Wed-O-6-4-4 171Miyashita, Genta . . . . . . . . Mon-P-1-1-1 92Miyoshi, Hiroyuki . . . . . . . Tue-O-4-10-1 128Möbius, Bernd . . . . . . . . . . Tue-SS-4-11-5 112

Wed-P-7-2-5 199Thu-O-10-8-6 234

Mochihashi, Daichi . . . . . Wed-O-7-8-3 181Mohammadi, Amir . . . . . Tue-S&T-3-A-2 158Mohammadi, Seyed H. . . Tue-O-4-10-6 129Moinet, Alexis . . . . . . . . . . Tue-O-3-8-2 120Mok, Peggy . . . . . . . . . . . . . . Wed-P-8-1-14 209Mokhtari, Parham . . . . . . Tue-O-3-6-4 119Möller, Sebastian . . . . . . . Wed-SS-8-11-3 168

Wed-P-6-4-1 196Molloy, Hillary R. . . . . . . . Tue-P-4-3-11 145

Wed-O-7-10-5 182Moniz, Helena . . . . . . . . . . Tue-SS-5-11-8 114Monta, Natsuki . . . . . . . . . Wed-O-7-2-4 177Montas, Eva . . . . . . . . . . . . . Mon-P-2-2-10 103Moon, Jung Min . . . . . . . . . Tue-S&T-3-B-4 159Moore, Elliot . . . . . . . . . . . . Wed-P-7-4-3 204Moore, Roger K. . . . . . . . . Wed-S&T-6-A-1 216

Thu-P-9-4-7 241Moosmüller, Sylvia . . . . . Wed-P-7-2-3 198Morales, Michelle R. . . . . Wed-P-8-3-11 213Morchid, Mohamed . . . . . Wed-P-8-3-3 212

Wed-P-8-3-5 212Morency, L.-P. . . . . . . . . . . . Wed-P-8-2-3 209Morgan, Angela . . . . . . . . . Wed-O-8-8-3 186Mori, Hiroki . . . . . . . . . . . . . Wed-P-8-2-7 210Mori, Takuma . . . . . . . . . . . Mon-O-2-10-2 91Morise, Masanori . . . . . . . Mon-P-1-1-1 92

Mon-P-1-1-4 93Tue-O-5-4-1 131Wed-O-6-4-6 171

Moró, Anna . . . . . . . . . . . . . Mon-P-1-4-11 98Mostafa, Naziba . . . . . . . . Wed-P-7-3-14 203Motlicek, Petr . . . . . . . . . . . Tue-P-3-1-2 136

Wed-O-6-10-5 175Mower Provost, Emily . . Mon-O-2-2-2 88

Tue-O-3-10-3 121Tue-O-3-10-5 121Tue-O-4-8-4 127

Mrkšic, Nikola . . . . . . . . . . Tue-P-4-3-13 145Muddireddy, P.R. . . . . . . . Wed-P-6-3-9 195Mukherjee, Sankar . . . . . . Wed-O-6-6-4 172Mulholland, Matthew . . . Tue-O-5-8-1 134

Tue-O-5-8-3 134Wed-P-6-1-4 189

Müller, Ludek . . . . . . . . . . . Thu-O-9-2-1 224Müller, Markus . . . . . . . . . . Tue-SS-4-11-4 112Mun, Seongkyu . . . . . . . . . Tue-O-3-4-4 118

Wed-P-6-2-14 194Mundnich, Karel . . . . . . . . Wed-P-7-4-11 206Munson, Benjamin . . . . . . Tue-P-5-1-4 146Murphy, Andy . . . . . . . . . . Wed-SS-7-1-1 163

Thu-P-9-3-8 238Murphy, Damian T. . . . . . Mon-O-1-10-2 85Murthy, B.H.V.S. N. . . . . . Tue-S&T-3-A-4 158Murthy, Hema A. . . . . . . . Wed-O-7-1-3 176

Wed-S&T-6-A-5 217Thu-O-10-11-4 234

Murtola, Tiina . . . . . . . . . . . Thu-SS-9-11-4 221Murty, K. Sri Rama . . . . . Wed-SS-7-1-5 163

Wed-P-6-2-12 193

NNaaman, Einat . . . . . . . . . . Wed-O-7-8-6 181Nadolski, Adam . . . . . . . . Tue-O-3-8-2 120Nagaraja, Varun . . . . . . . . Thu-O-9-4-4 226Nagarsheth, Parav . . . . . . Mon-SS-2-8-4 80Nagesha, Venki . . . . . . . . . Wed-P-6-1-10 190Nagrani, Arsha . . . . . . . . . . Wed-O-8-1-5 183Nahamoo, David . . . . . . . . Tue-O-3-1-5 115Nair, Angelika . . . . . . . . . . Mon-S&T-2-B-1 109Najafian, Maryam . . . . . . . Wed-O-7-10-6 182Nakadai, Kazuhiro . . . . . . Tue-P-4-2-2 142Nakagawa, Seiichi . . . . . . Wed-P-6-1-6 190

Nakamura, Satoshi . . . . . Tue-O-3-8-3 120Tue-P-4-1-7 141

Wed-O-7-1-4 176Wed-O-8-4-2 184Wed-O-8-4-4 184

Nakamura, Shizuka . . . . . Tue-P-4-3-14 146Nakanishi, Ryosuke . . . . Tue-P-4-3-14 146Nakashika, Toru . . . . . . . . Wed-P-8-4-2 214

Thu-P-9-4-14 243Nakatani, Tomohiro . . . . Mon-O-2-10-2 91

Tue-O-4-4-2 124Tue-P-4-1-3 141Tue-P-4-1-4 141Tue-P-5-4-3 155

Wed-O-8-6-2 185Wed-P-6-4-3 197Thu-P-9-1-5 236

Namasivayam, Aravind . Mon-P-2-2-14 104Nankaku, Yoshihiko . . . . Mon-O-1-10-2 85Nara, Kiranpreet . . . . . . . . Tue-O-3-6-6 119Narayanan, Arun . . . . . . . Mon-O-2-10-1 91

Mon-O-2-10-5 92Tue-P-4-2-4 143Thu-O-9-1-3 223

Thu-P-9-1-10 237Narayanan, Shrikanth S. Mon-O-1-2-2 82

Mon-O-1-10-4 86Mon-P-2-2-2 101Mon-P-2-2-5 102Tue-O-3-2-6 117

Tue-O-5-10-4 136Tue-P-3-2-9 140

Wed-SS-6-2-1 160Wed-SS-7-1-3 163Wed-O-6-1-3 169Wed-O-6-1-5 170

Wed-O-8-10-5 188Wed-P-7-4-2 204

Wed-P-7-4-11 206Wed-P-8-2-2 209

Wed-P-8-2-11 211Thu-P-9-1-3 235

Narendra, N.P. . . . . . . . . . . Thu-P-9-3-11 239Narita, Tomohiro . . . . . . . Wed-O-7-2-4 177Narwekar, Abhishek . . . . Thu-O-10-4-2 232Nasir, Md. . . . . . . . . . . . . . . . Wed-P-7-4-11 206

Wed-P-8-2-11 211Nataraj, K.S. . . . . . . . . . . . . . Mon-O-2-4-5 89Navas, Eva . . . . . . . . . . . . . . Thu-SS-10-10-4 222Nayak, Krishna . . . . . . . . . Mon-P-2-2-5 102Nayak, Neha . . . . . . . . . . . . Wed-P-8-3-8 213Nayak, Shekhar . . . . . . . . . Wed-SS-7-1-5 163Neeracher, Matthias . . . . Thu-P-9-4-12 242Nellore, Bhanu Teja . . . . Wed-P-7-2-11 200Németh, Géza . . . . . . . . . . . Mon-P-1-1-6 93Nercessian, Shahan . . . . . Tue-O-5-2-2 130Nerpagar, Rachana . . . . . Wed-S&T-6-A-5 217Neubig, Graham . . . . . . . . Wed-O-7-8-3 181Neufeld, Chris . . . . . . . . . . Tue-O-4-2-6 124Neumann, Michael . . . . . . Tue-O-4-8-6 127Neuschaefer-Rube, C. . . . Wed-O-6-1-2 169Ney, Hermann . . . . . . . . . . Tue-O-3-1-2 115

Tue-P-4-2-5 143Ng, Raymond W.M. . . . . . Wed-O-8-10-2 187Ng, Tim . . . . . . . . . . . . . . . . . Mon-O-1-1-1 80

Mon-O-1-1-4 81Ng, Wen Zheng Terence Thu-SS-9-10-9 220

Thu-O-10-11-2 234Nguyen, Noël . . . . . . . . . . . Tue-SS-3-11-6 111

Wed-O-6-6-4 172Ni, Jinfu . . . . . . . . . . . . . . . . . Mon-P-2-4-4 107Ni, Zhidong . . . . . . . . . . . . . Thu-SS-9-10-6 219Ní Chasaide, Ailbhe . . . . Tue-O-3-6-2 118

Wed-SS-7-1-1 163Thu-P-9-3-8 238

Ní Chiaráin, Neasa . . . . . . Wed-SS-7-1-1 163Nidadavolu, Phani S. . . . . Tue-O-5-2-2 130Niebuhr, Oliver . . . . . . . . . Mon-P-2-1-11 100

Tue-SS-5-11-1 113Wed-SS-8-11-6 168Thu-O-10-8-5 233

Niehues, Jan . . . . . . . . . . . . Mon-P-1-4-2 96Wed-O-8-4-5 184

Niesler, Thomas . . . . . . . . Mon-SS-1-11-8 78Wed-SS-7-1-2 163

Nikulásdóttir, Anna B. . . Wed-SS-7-1-11 165Wed-SS-7-1-13 165

Nikulin, Aleksander . . . . Wed-S&T-6-A-6 217

Nilsson Björkenstam, K. Wed-SS-7-11-5 167Ning, Yishuang . . . . . . . . . Wed-P-8-4-10 216Nirschl, Michael . . . . . . . . Mon-O-2-1-1 86Nishizaki, Hiromitsu . . . Wed-P-8-3-2 212Nookala, Usha A. . . . . . . . Thu-O-10-2-1 230Norel, Raquel . . . . . . . . . . . Wed-P-7-4-6 205Nöth, Elmar . . . . . . . . . . . . . Mon-O-2-2-3 88

Mon-O-2-2-6 88Tue-P-5-2-7 150

Wed-P-7-4-10 205Novák-Tót, Eszter . . . . . . Wed-SS-8-11-6 168Novoa, José . . . . . . . . . . . . . Tue-SS-3-11-2 111Novoselov, Sergey . . . . . . Mon-SS-2-8-1 79Novotný, M. . . . . . . . . . . . . . Wed-P-7-4-4 204Novotný, Ondrej . . . . . . . . Tue-O-5-2-5 131

Tue-P-3-2-7 140Nowicki, Jakub . . . . . . . . . . Tue-S&T-3-A-5 158Nwe, Tin Lay . . . . . . . . . . . . Thu-SS-9-10-9 220Nyström, Pär . . . . . . . . . . . . Mon-O-2-2-5 88

OObuchi, Yasunari . . . . . . . Wed-SS-8-11-1 167Oertel, Catharine . . . . . . . Tue-SS-3-11-5 111Ogawa, Atsunori . . . . . . . . Tue-P-4-1-3 141

Tue-P-4-1-4 141Tue-P-5-4-3 155

Wed-O-8-6-2 185Thu-P-9-1-5 236

Oh, Eunmi . . . . . . . . . . . . . . Tue-P-5-4-5 155Ohashi, Hiroki . . . . . . . . . . Mon-P-2-2-10 103Ohsugi, Yasuhito . . . . . . . Tue-O-3-8-4 120Öktem, Alp . . . . . . . . . . . . . Mon-S&T-2-A-1 108Omologo, Maurizio . . . . . Tue-O-5-1-3 129Öngür, Dost . . . . . . . . . . . . . Wed-P-8-2-3 209Oplustil, Pilar . . . . . . . . . . . Wed-SS-7-1-16 165Orio, Patricio . . . . . . . . . . . . Tue-O-5-4-5 132Orozco-Arroyave, J.R. . . Mon-O-2-2-3 88

Mon-O-2-2-6 88Tue-P-5-2-7 150

Ortega, Alfonso . . . . . . . . . Wed-P-6-2-4 192Wed-P-6-2-6 192

Østergaard, Jan . . . . . . . . . Tue-O-4-2-4 124Östling, Robert . . . . . . . . . Tue-P-5-1-13 148

Wed-SS-7-11-3 166Ostrand, Rachel . . . . . . . . Wed-P-7-4-6 205Ottl, Sandra . . . . . . . . . . . . . Thu-SS-10-10-3 222Oualil, Youssef . . . . . . . . . Mon-O-2-1-2 87

Mon-P-1-4-10 98Wed-O-6-10-5 175Wed-O-8-10-3 188

Ozawa, Kenji . . . . . . . . . . . . Mon-P-1-1-1 92

PPaats, Andrus . . . . . . . . . . . Wed-SS-7-1-12 165Pagmar, David . . . . . . . . . . Wed-SS-7-11-4 166Pahuja, Vardaan . . . . . . . . Mon-P-1-4-9 98Paiva, Ana . . . . . . . . . . . . . . . Tue-SS-5-11-8 114Paletz, Susannah . . . . . . . Tue-P-4-3-8 144Pałka, Szymon . . . . . . . . . . Thu-S&T-9-A-1 243Palmer, Frederik . . . . . . . . Tue-SS-5-11-5 114Palomäki, Kalle . . . . . . . . . Wed-S&T-6-A-6 217Pampouchidou, A. . . . . . . Wed-O-8-8-5 187Pan, Jielin . . . . . . . . . . . . . . . Tue-P-5-3-2 152Pan, Jing . . . . . . . . . . . . . . . . Wed-P-7-3-11 202Pan, Xiaoman . . . . . . . . . . . Wed-SS-6-2-1 160Pan, Yilin . . . . . . . . . . . . . . . . Tue-P-3-2-1 139Panchapagesan, S. . . . . . . Thu-O-9-4-4 226Pandey, Prem C. . . . . . . . . Mon-O-2-4-5 89Pandia, Karthik . . . . . . . . . Wed-O-7-1-3 176Pang, Cheng . . . . . . . . . . . . Tue-P-5-3-11 153Pantic, Maja . . . . . . . . . . . . . Wed-O-6-8-4 173Papadopoulos, Pavlos . . Wed-SS-6-2-1 160

Thu-P-9-1-3 235Parada, Carolina . . . . . . . . Tue-P-5-3-8 153

Thu-O-10-11-3 234Parada-Cabaleiro, E. . . . . Wed-P-8-2-1 209Parcheta, Zuzanna . . . . . . Wed-P-8-3-10 213Parcollet, Titouan . . . . . . . Wed-P-8-3-5 212Park, Ji Ho . . . . . . . . . . . . . . Mon-S&T-2-B-4 109Park, Se Rim . . . . . . . . . . . . Tue-P-5-4-9 156Park, Soo Jin . . . . . . . . . . . . Tue-P-3-1-10 138Park, Young-cheol . . . . . . Wed-O-8-6-5 185Parlato-Oliveira, Erika . . Mon-P-2-1-8 100Parthasarathi, Sree H.K. Wed-O-7-2-6 178Parthasarathy, S. . . . . . . . . Wed-P-6-1-12 191Parthasarathy, Srinivas . Tue-O-3-10-4 121Pascual, Santiago . . . . . . . Thu-O-9-6-5 227

251

Patel, Rupal . . . . . . . . . . . . . Wed-P-7-4-3 204Patel, Tanvina B. . . . . . . . . Mon-SS-1-8-3 76Patil, Hemant A. . . . . . . . . Mon-SS-1-8-3 76

Wed-O-8-1-2 183Wed-P-7-3-15 203Wed-P-7-3-16 203

Patil, Kailash . . . . . . . . . . . . Mon-SS-2-8-4 80Patil, Nimisha . . . . . . . . . . . Wed-O-6-1-3 169

Wed-O-6-1-5 170Patterson, Roy D. . . . . . . . Mon-P-2-1-9 100

Tue-O-4-2-2 123Paulik, Matthias . . . . . . . . Wed-SS-7-1-8 164Peddinti, Vijayaditya . . . Tue-P-4-1-1 140

Tue-P-4-2-1 142Thu-O-9-1-3 223Thu-O-9-4-2 226

Pederson, Eric . . . . . . . . . . Tue-P-5-1-8 147Pedzimaz, Tomasz . . . . . Thu-S&T-9-A-1 243Pelachaud, Catherine . . . Wed-K3-1 160Peng, Gang . . . . . . . . . . . . . . Tue-P-5-1-14 148

Wed-O-7-1-2 175Pennant, Luciana . . . . . . . Wed-P-8-2-3 209Peperkamp, Sharon . . . . . Mon-P-2-1-8 100Perdigão, Fernando . . . . . Tue-O-5-8-5 135

Wed-P-6-1-3 189Peres, Daniel Oliveira . . . Mon-P-2-1-7 99Pérez, Juan Manuel . . . . . Wed-O-6-6-6 172

Wed-P-8-1-6 207Pernkopf, Franz . . . . . . . . Mon-P-1-2-8 96

Tue-O-5-1-5 130Wed-O-8-6-3 185

Peters, Judith . . . . . . . . . . . Mon-P-2-2-8 102Peterson, Sean D. . . . . . . . Tue-O-5-4-5 132Pettorino, Massimo . . . . . Wed-P-7-4-12 206Petukhova, Volha . . . . . . . Mon-O-1-2-1 81

Tue-SS-4-11-6 113Pétursson, Matthías . . . . Wed-SS-7-1-13 165Pfeifenberger, Lukas . . . Wed-O-8-6-3 185Phan, Huy . . . . . . . . . . . . . . . Wed-P-7-3-2 200Piccaluga, Myriam . . . . . . Thu-O-10-8-2 233Piccinini, Page . . . . . . . . . . Tue-S&T-3-A-6 158Picheny, Michael . . . . . . . . Mon-O-1-1-5 81

Tue-O-3-1-5 115Piitulainen, Jussi . . . . . . . . Mon-S&T-2-B-5 110Pirhosseinloo, Shadi . . . . Tue-O-4-4-3 125Plante-Hébert, Julien . . . Thu-P-9-3-6 238Platek, Ondrej . . . . . . . . . . Wed-P-6-1-10 190Plchot, Oldrich . . . . . . . . . . Tue-O-5-2-5 131

Tue-P-3-2-7 140Plug, Leendert . . . . . . . . . . Tue-O-5-6-4 133Plumbley, Mark D. . . . . . . Mon-P-1-1-10 94

Wed-P-7-3-10 202Pokorny, Florian B. . . . . . Mon-O-2-2-5 88Pollet, Vincent . . . . . . . . . . Thu-P-9-4-3 241Półrola, Paweł . . . . . . . . . . . Wed-P-7-4-12 206Pompili, Anna . . . . . . . . . . . Tue-SS-5-11-8 114Pontil, Massimiliano . . . . Tue-O-3-2-4 116Poorjam, Amir Hossein . Mon-O-2-2-1 87Potard, Blaise . . . . . . . . . . . Thu-P-9-4-9 242Pourdamghani, Nima . . . Wed-SS-6-2-1 160Povey, Daniel . . . . . . . . . . . Mon-P-1-4-3 97

Tue-O-3-4-1 117Tue-P-4-1-1 140Tue-P-4-2-1 142

Wed-O-7-8-2 180Thu-O-9-4-2 226

Prabhavalkar, Rohit . . . . Tue-O-3-1-1 115Thu-O-10-1-5 230

Pradhan, Gayadhar . . . . . Mon-P-1-1-5 93Tue-P-5-3-3 152

Wed-O-6-10-2 174Prado, Pavel . . . . . . . . . . . . . Tue-O-5-4-5 132Prahallad, Kishore . . . . . . Thu-P-9-4-12 242Prakash, Jeena J. . . . . . . . . Thu-O-10-11-4 234Prasad, RaviShankar . . . . Wed-P-7-2-11 200Prasanna, S.R. M. . . . . . . . Mon-SS-1-8-5 77

Mon-P-1-1-8 94Mon-P-1-2-2 94Tue-O-3-6-3 119Tue-P-5-2-4 150Tue-P-5-2-5 150

Wed-P-6-2-12 193Thu-P-9-4-6 241

Prateek, K.L. . . . . . . . . . . . . Wed-O-7-1-3 176Prenger, Ryan . . . . . . . . . . . Tue-P-4-1-5 141Prévot, Laurent . . . . . . . . . Tue-SS-3-11-6 111Proctor, Michael . . . . . . . . Wed-P-7-2-2 198

Proença, Jorge . . . . . . . . . . Tue-O-5-8-5 135Wed-P-6-1-3 189

Prudnikov, Alexey . . . . . . Thu-O-9-4-3 226Psutka, Josef V. . . . . . . . . . Wed-P-6-3-13 196Puga, Karin . . . . . . . . . . . . . Wed-P-8-1-14 209Pugachevskiy, Sergey . . . Tue-SS-3-11-4 111

Thu-SS-10-10-3 222Pugh, Robert A. . . . . . . . . . Wed-O-7-10-5 182Pullela, Keerthi . . . . . . . . . Tue-P-5-2-4 150Pundak, Golan . . . . . . . . . . Mon-O-2-10-5 92

Tue-O-5-1-2 129Pusateri, Ernest . . . . . . . . . Wed-P-6-1-10 190Pushpavathi, M. . . . . . . . . . Tue-P-5-2-5 150Pust, Michael . . . . . . . . . . . . Wed-SS-6-2-1 160Putrycz, Bartosz . . . . . . . . Tue-O-3-8-2 120

QQi, Xiaoke . . . . . . . . . . . . . . . Wed-P-7-3-5 201Qian, Kaizhi . . . . . . . . . . . . . Tue-P-5-4-13 157Qian, Kun . . . . . . . . . . . . . . . Thu-SS-9-10-1 219Qian, Qi . . . . . . . . . . . . . . . . . Wed-P-6-2-11 193Qian, Yanmin . . . . . . . . . . . Mon-P-1-4-6 97

Tue-P-3-1-5 137Wed-O-7-2-3 177

Qian, Yao . . . . . . . . . . . . . . . Tue-O-5-8-1 134Wed-O-7-10-5 182

Quatieri, Thomas F. . . . . Mon-P-2-2-13 103Wed-P-8-1-7 207

Quinn, John . . . . . . . . . . . . . Wed-SS-7-1-2 163Quiroz, Sergio I. . . . . . . . . Tue-O-4-6-1 125

RRábai, Krisztina . . . . . . . . . Wed-SS-6-2-2 160Ragni, A. . . . . . . . . . . . . . . . . Mon-O-2-1-3 87

Wed-P-6-1-8 190Rahimi, Zahra . . . . . . . . . . . Tue-P-4-3-8 144Rahman, Md. Hafizur . . Tue-P-3-2-10 140Raitio, Tuomo . . . . . . . . . . Thu-P-9-4-12 242Raj, Bhiksha . . . . . . . . . . . . Mon-P-1-2-7 95

Tue-P-5-3-1 152Raju, Manoj . . . . . . . . . . . . . Mon-O-1-2-1 81Rallabandi, SaiKrishna . Mon-SS-1-11-4 78

Mon-SS-1-11-5 78Ramabhadran, B. . . . . . . . Mon-O-1-1-5 81

Mon-O-2-1-4 87Mon-O-2-1-5 87Mon-P-2-4-2 106Tue-O-3-1-5 115Thu-O-9-4-5 226

Thu-O-10-1-4 230Thu-P-9-4-5 241

Ramakrishnan, A.G. . . . . Mon-P-2-2-4 101Ramanarayanan, V. . . . . . Mon-SS-1-11-3 78

Mon-P-2-2-5 102Tue-P-4-3-11 145Wed-O-7-6-5 180

Ramírez López, Ana . . . . Tue-O-5-4-2 132Ramos, Miguel Varela . . Wed-P-8-4-11 216Ranjan, Shivesh . . . . . . . . . Tue-O-3-4-3 117

Tue-O-5-2-4 131Wed-P-6-2-13 194Thu-O-10-2-2 230

Rantula, Olli . . . . . . . . . . . . Wed-S&T-6-A-6 217Rao, Kanishka . . . . . . . . . . Tue-O-3-1-1 115

Tue-O-5-1-1 129Wed-O-7-8-5 181

Thu-O-10-1-5 230Rao, K. Sreenivasa . . . . . . Mon-P-2-2-15 104Rao, Wei . . . . . . . . . . . . . . . . . Tue-P-5-3-5 152Rao M.V., Achuth . . . . . . . Thu-SS-10-10-1 222Raposo de Medeiros, B. . Mon-O-2-4-4 89Räsänen, Okko . . . . . . . . . . Tue-O-5-4-2 132

Wed-P-6-1-2 189Wed-P-8-1-8 207

Rasipuram, Ramya . . . . . Thu-P-9-4-12 242Rastrow, Ariya . . . . . . . . . . Tue-P-5-3-15 154Ratajczak, Martin . . . . . . . Tue-O-5-1-5 130Ratnagiri, Madhavi . . . . . Tue-P-5-2-11 151Rautara, Sarita . . . . . . . . . . Mon-S&T-2-A-6 109Ravanelli, Mirco . . . . . . . . Tue-O-5-1-3 129Raveh, Eran . . . . . . . . . . . . . Tue-SS-4-11-5 112

Thu-O-10-8-6 234Raykar, Vikas . . . . . . . . . . . Mon-P-1-4-9 98Raymond, Christian . . . . Wed-O-7-4-4 178Rayner, Manny . . . . . . . . . . Wed-S&T-6-B-2 217Reetz, Henning . . . . . . . . . Tue-O-5-8-4 134Rehr, Robert . . . . . . . . . . . . Tue-P-5-4-7 156

Reidy, Patrick F. . . . . . . . . Tue-P-5-1-4 146Reinhold, Isabella . . . . . . Wed-P-7-3-3 201Reiss, Attila . . . . . . . . . . . . . Thu-O-10-11-1 234Remes, Ulpu . . . . . . . . . . . . Wed-P-6-1-2 189Renals, Steve . . . . . . . . . . . . Mon-P-2-3-10 106

Wed-P-6-3-10 196Rendel, Asaf . . . . . . . . . . . . Mon-P-2-4-2 106

Tue-O-5-4-4 132Renner, Lena F. . . . . . . . . . Mon-P-2-2-11 103Rennie, Steven J. . . . . . . . . Mon-O-2-10-3 92Reuber, Markus . . . . . . . . . Wed-P-7-4-7 205Reverdy, Justine . . . . . . . . Tue-P-4-3-9 145Reynolds, Douglas . . . . . . Tue-O-5-2-6 131

Tue-P-3-2-5 139Rialland, Annie . . . . . . . . . Thu-P-9-3-7 238Ribeiro, Antonio Celso . Mon-O-2-4-4 89Ribeiro, M. Sam . . . . . . . . . Mon-P-2-4-10 108Ricard, Joseph . . . . . . . . . . Mon-SS-1-11-7 78Riccardi, Giuseppe . . . . . Wed-O-7-6-3 179Richardson, Brigitte . . . . Wed-P-7-3-9 202Richardson, Fred . . . . . . . Tue-O-5-2-2 130Ridouane, Rachid . . . . . . . Mon-O-1-6-4 84Riou, Matthieu . . . . . . . . . . Wed-P-8-3-9 213Rocha, Bruno . . . . . . . . . . . Wed-P-7-3-6 201Rodehorst, Mike . . . . . . . . Thu-O-9-4-4 226Roebel, Axel . . . . . . . . . . . . Tue-O-4-10-5 128Rognoni, Luca . . . . . . . . . . Tue-O-4-6-5 126Rohdin, Johan . . . . . . . . . . Tue-O-5-2-5 131Rojas-Barahona, Lina . . . Tue-P-4-3-13 145Romanenko, Aleksei . . . . Wed-P-6-3-3 194

Thu-O-9-4-3 226Romøren, Anna Sara H. Mon-P-2-2-16 104Ronanki, Srikanth . . . . . . Tue-O-4-1-4 122Roomi, Bergul . . . . . . . . . . . Mon-O-1-1-5 81Rosa-Zurera, Manuel . . . Mon-O-1-4-5 83Rose, Richard . . . . . . . . . . . Mon-O-2-10-5 92

Thu-P-9-1-10 237Rosenberg, Andrew . . . . Mon-P-2-4-2 106

Thu-P-9-4-5 241Rossato, Solange . . . . . . . . Wed-P-6-2-9 193Rosset, Sophie . . . . . . . . . . Wed-O-7-4-3 178Rosti, Antti-Veikko . . . . . Wed-SS-7-1-8 164Rouat, Jean . . . . . . . . . . . . . Tue-S&T-3-B-5 159

Wed-O-8-6-4 185Rouhe, Aku . . . . . . . . . . . . . Tue-S&T-3-B-6 159Roustan, Benjamin . . . . . Mon-O-1-2-6 82Rouvier, Mickael . . . . . . . . Tue-P-3-2-3 139

Wed-P-6-2-8 192Roux, Justus . . . . . . . . . . . . Thu-O-10-8-4 233Rozen, Piotr . . . . . . . . . . . . . Tue-S&T-3-A-5 158Rozenberg, Shai . . . . . . . . Tue-O-3-10-1 121Ruede, Robin . . . . . . . . . . . Tue-SS-4-11-4 112Ruhs, Mirko . . . . . . . . . . . . . Thu-O-10-11-1 234Ruiz, Nicholas . . . . . . . . . . Wed-O-8-4-3 184Russell, Martin . . . . . . . . . . Mon-O-2-4-1 89Russell, Scott . . . . . . . . . . . Wed-P-7-4-3 204Rusz, Jan . . . . . . . . . . . . . . . . Tue-P-5-2-8 150

Wed-P-7-4-4 204Ružicka, Evžen . . . . . . . . . Tue-P-5-2-8 150

Wed-P-7-4-4 204

SS., Irfan . . . . . . . . . . . . . . . . . . Mon-O-1-6-6 85Sadamitsu, Kugatsu . . . . Wed-P-8-3-1 212Sadeghian, Roozbeh . . . . Wed-O-8-8-6 187Sadjadi, Seyed Omid . . . Tue-O-5-2-6 131

Tue-P-3-2-5 139Saeb, Armin . . . . . . . . . . . . . Wed-SS-7-1-2 163Sagha, Hesam . . . . . . . . . . . Wed-P-7-4-5 204

Wed-P-8-2-5 210Sagisaka, Yoshinori . . . . . Mon-P-2-2-12 103Sahidullah, Md. . . . . . . . . . Mon-SS-1-8-1 76

Tue-P-3-1-8 138Wed-O-8-1-4 183

Sahkai, Heete . . . . . . . . . . . Wed-P-8-1-2 206Sahu, Saurabh . . . . . . . . . . Tue-O-4-8-2 127

Wed-P-7-4-2 204Sailor, Hardik B. . . . . . . . . Wed-O-8-1-2 183

Wed-P-7-3-15 203Sainath, Tara N. . . . . . . . . . Mon-O-2-10-1 91

Mon-O-2-10-5 92Tue-O-3-1-1 115Tue-O-3-1-6 116Tue-O-5-1-2 129Thu-O-9-1-3 223

Thu-O-10-1-5 230Thu-O-10-11-3 234

252

Saito, Daisuke . . . . . . . . . . Mon-P-1-4-8 97Tue-O-3-2-5 117

Tue-O-4-10-3 128Tue-O-5-8-2 134

Saito, Yuki . . . . . . . . . . . . . . Tue-O-4-10-1 128Sak, Hasim . . . . . . . . . . . . . . Mon-O-2-10-5 92

Tue-O-5-1-1 129Thu-O-10-1-6 230

Sakai, Shinsuke . . . . . . . . . Wed-O-7-2-2 177Sakakibara, Ken-Ichi . . . . Mon-P-1-1-4 93

Tue-O-5-4-1 131Sakti, Sakriani . . . . . . . . . . Wed-O-7-1-4 176

Wed-O-8-4-2 184Wed-O-8-4-4 184

Salim, Fahim A. . . . . . . . . . Wed-O-6-8-6 174Salimbajevs, Askars . . . . Mon-S&T-2-B-3 109Salvi, Giampiero . . . . . . . . Tue-P-5-2-2 149Samaddar, A.B. . . . . . . . . . Wed-O-6-10-2 174Samarakoon, Lahiru . . . . Mon-P-2-3-9 106Sameti, Hossein . . . . . . . . . Mon-P-1-4-3 97Samui, Suman . . . . . . . . . . Thu-O-9-6-1 227Sanabria, Ramon . . . . . . . Mon-P-1-4-2 96Sánchez, Ariadna . . . . . . . Wed-O-6-6-5 172Sanchez, Jon . . . . . . . . . . . . Thu-SS-10-10-4 222Sánchez-Martín, P. . . . . . . Thu-SS-9-11-5 221Sandhan, Tushar . . . . . . . Mon-P-1-2-6 95Sandsten, Maria . . . . . . . . . Wed-P-7-3-3 201Sangwan, Abhijeet . . . . . . Wed-P-6-1-13 191San Segundo, Eugenia . . Thu-P-9-3-3 237Saon, George . . . . . . . . . . . . Mon-O-1-1-3 81

Mon-O-1-1-5 81Mon-O-2-1-5 87Tue-O-3-1-5 115

Saraclar, Murat . . . . . . . . . Thu-O-9-4-6 226Sarasola, Xabier . . . . . . . . . Thu-SS-10-10-4 222Sarkar, Achintya Kr. . . . . Wed-O-8-1-4 183Sarmah, Priyankoo . . . . . Mon-O-1-6-6 85

Tue-O-3-6-3 119Saruwatari, Hiroshi . . . . . Tue-O-4-10-1 128

Thu-P-9-4-2 240Saryazdi, Raheleh . . . . . . Tue-SS-4-11-2 112Sato, Masaaki . . . . . . . . . . . Wed-P-8-4-5 215Satt, Aharon . . . . . . . . . . . . Tue-O-3-10-1 121Saurous, Rif A. . . . . . . . . . . Thu-P-9-4-11 242Sawada, Naoki . . . . . . . . . . Wed-P-8-3-2 212Saz Torralba, Oscar . . . . Mon-P-1-1-2 93S.B., Sunil Kumar . . . . . . . Mon-P-2-2-15 104Scarborough, Rebecca . . Mon-P-2-1-12 100Schaffer, J. David . . . . . . . Wed-O-8-8-6 187Schatz, Thomas . . . . . . . . . Wed-P-7-2-13 200Scherer, Stefan . . . . . . . . . . Wed-P-8-3-11 213Scheutz, Hannes . . . . . . . . Wed-P-7-2-3 198Schieder, Sebastian . . . . . Thu-SS-9-10-2 219Schiller, Dominik . . . . . . . Thu-SS-9-10-7 220Schlangen, David . . . . . . . Tue-P-4-3-4 144Schlüter, Ralf . . . . . . . . . . . Tue-O-3-1-2 115

Tue-P-4-2-5 143Schmidhuber, Jürgen . . . Thu-O-9-8-3 228Schmidt, Christoph . . . . . Wed-O-7-8-1 180Schmidt, Gerhard . . . . . . . Mon-O-1-4-2 83

Wed-O-6-4-5 171Schmitt, Maximilian . . . . Wed-O-6-8-4 173

Thu-SS-9-10-1 219Schneider, Gerold . . . . . . Tue-P-5-1-10 148Schnieder, Sebastian . . . Thu-SS-9-10-1 219Schoffelen, Jan-Mathijs . Mon-P-2-2-7 102Schoormann, Heike . . . . . Wed-SS-8-11-7 168Schötz, Susanne . . . . . . . . Tue-O-5-6-6 134Schröder, Anne . . . . . . . . . Tue-O-5-10-5 136Schuller, Björn . . . . . . . . . . Mon-O-2-2-5 88

Tue-SS-3-11-4 111Tue-S&T-3-B-3 159

Wed-O-6-8-4 173Wed-P-7-4-5 204Wed-P-7-4-8 205Wed-P-8-2-1 209Wed-P-8-2-5 210

Thu-SS-9-10-1 219Thu-SS-10-10-2 222Thu-SS-10-10-3 222Thu-SS-10-10-8 223

Thu-P-9-3-15 240Schultz, Tanja . . . . . . . . . . Wed-P-7-4-1 203Schwarz, Iris-Corinna . . Wed-SS-6-11-2 161Schweitzer, Antje . . . . . . . Tue-SS-5-11-6 114

Wed-SS-8-11-5 168Wed-O-6-6-3 172

Schweitzer, Katrin . . . . . . Tue-SS-5-11-6 114

Seelamantula, C. S. . . . . . Mon-O-2-4-3 89Wed-O-6-4-3 170

Seeram, Tejaswi . . . . . . . . Wed-SS-7-1-10 164Segura, Carlos . . . . . . . . . . Wed-O-6-6-5 172Seiderer, Andreas . . . . . . . Thu-SS-9-10-7 220Seidl, Amanda . . . . . . . . . . Wed-SS-6-11-3 162

Thu-SS-9-10-1 219Thu-SS-9-10-4 219

Selamtzis, Andreas . . . . . Tue-P-5-2-2 149Sell, Gregory . . . . . . . . . . . . Tue-P-3-2-4 139Seltzer, Michael L. . . . . . . Wed-O-6-10-1 174Seo, Jeongil . . . . . . . . . . . . . Mon-P-1-2-5 95Sercu, Tom . . . . . . . . . . . . . . Mon-O-1-1-5 81Serigos, Jacqueline . . . . . Mon-SS-1-11-7 78Serrà, Joan . . . . . . . . . . . . . . Thu-O-9-6-5 227Serrano, Luis . . . . . . . . . . . . Thu-SS-10-10-4 222Serrurier, Antoine . . . . . . Wed-O-6-1-2 169Seshadri, Shreyas . . . . . . . Tue-O-5-4-2 132

Wed-P-6-1-2 189Sethu, Vidhyasaharan . . Tue-O-4-8-3 127

Tue-P-3-1-6 137Wed-O-7-10-4 182Wed-O-8-1-3 183Wed-P-6-2-2 191

Sethy, Abhinav . . . . . . . . . . Mon-O-2-1-4 87Mon-O-2-1-5 87Thu-O-9-4-5 226

Setter, Jane . . . . . . . . . . . . . Wed-P-8-1-14 209Settle, Shane . . . . . . . . . . . . Wed-P-6-3-1 194

Thu-O-9-8-6 229Sezgin, Metin . . . . . . . . . . . Tue-SS-3-11-3 111Shafran, Izhak . . . . . . . . . . Mon-O-2-10-5 92

Thu-P-9-1-10 237Shahnawazuddin, S. . . . . Mon-P-1-1-5 93

Tue-P-5-3-3 152Wed-O-6-10-2 174

Shakhnarovich, G. . . . . . . Thu-O-9-8-6 229Shanmugam, Aswin . . . . Wed-S&T-6-A-5 217Shannon, Matt . . . . . . . . . . Mon-O-2-10-5 92

Tue-O-5-1-1 129Tue-P-5-3-8 153Thu-O-9-1-2 223

Sharma, Bidisha . . . . . . . . Mon-P-1-1-8 94Sharma, Jitendra . . . . . . . . Wed-O-7-1-3 176Sharma, Shubham . . . . . . Mon-P-2-2-4 101Shaw, Francesca . . . . . . . . Thu-P-9-4-9 242Shaw, Jason A. . . . . . . . . . . Wed-P-7-2-2 198Shchemelinin, Vadim . . . Mon-SS-2-8-1 79Shechtman, Slava . . . . . . . Tue-O-5-4-4 132Sheena, Yaniv . . . . . . . . . . . Tue-O-3-6-5 119Shen, Chen . . . . . . . . . . . . . . Tue-O-5-1-4 129

Thu-O-10-1-1 229Shen, Peng . . . . . . . . . . . . . . Wed-P-6-2-3 192Shen, Xiaoyu . . . . . . . . . . . . Mon-P-1-4-10 98Shi, Ying . . . . . . . . . . . . . . . . Tue-P-3-2-2 139Shiga, Yoshinori . . . . . . . . Mon-P-2-4-4 107Shih, Chin-Hong . . . . . . . . Thu-O-9-6-4 227Shimada, Kazuki . . . . . . . . Wed-O-7-2-2 177Shinozaki, Takahiro . . . . Wed-O-7-8-3 181Shiozawa, Fumiya . . . . . . Tue-O-5-8-2 134Shirley, Ben . . . . . . . . . . . . . Wed-P-6-4-5 197Shokouhi, Navid . . . . . . . . Tue-O-5-2-4 131Shon, Suwon . . . . . . . . . . . . Tue-O-3-4-4 118

Wed-P-6-2-14 194Shosted, Ryan . . . . . . . . . . Mon-O-1-6-1 84Shoul, Karim . . . . . . . . . . . . Mon-O-1-6-4 84Shriberg, Elizabeth E. . . . Tue-O-5-10-1 135Shyu, Frank . . . . . . . . . . . . . Wed-P-6-3-6 195Sidorov, Maxim . . . . . . . . . Wed-O-7-6-4 180Signorello, Rosario . . . . . Wed-O-6-1-1 169Silen, Hanna . . . . . . . . . . . . Tue-O-4-1-6 123Silnova, Anna . . . . . . . . . . . Tue-O-5-2-5 131

Tue-P-3-2-8 140Silva, Samuel . . . . . . . . . . . . Mon-P-2-2-1 101Silvera-Tawil, David . . . . Wed-O-8-8-3 186Sim, Khe Chai . . . . . . . . . . . Mon-O-2-10-5 92

Mon-P-2-3-9 106Tue-P-4-2-4 143

Simantiraki, Olympia . . . Wed-O-8-8-5 187Simko, Gabor . . . . . . . . . . . Tue-P-5-3-8 153

Thu-O-10-11-3 234Šimko, Juraj . . . . . . . . . . . . Tue-O-3-6-1 118

Tue-O-4-6-2 126Wed-P-7-2-9 199

Simões, Antônio R.M. . . . Mon-O-2-4-4 89Simon, Anne Catherine . Thu-P-9-3-12 239Simonnet, Edwin . . . . . . . . Wed-P-8-3-6 212

Simpson, Adrian P. . . . . . Tue-SS-5-11-5 114Tue-SS-5-11-7 114

Sinclair, Mark . . . . . . . . . . . Mon-S&T-2-A-5 109Singer, Elliot . . . . . . . . . . . . Tue-O-5-2-6 131

Tue-P-3-2-5 139Singh, Mittul . . . . . . . . . . . . Mon-P-1-4-10 98

Wed-O-8-10-3 188Sinha, Ashok Kumar . . . . Mon-S&T-2-A-6 109Sinha, Rohit . . . . . . . . . . . . . Mon-SS-1-8-5 77

Wed-P-6-2-12 193Sini, Aghilas . . . . . . . . . . . . Mon-O-2-6-1 90Siniscalchi, Sabato M. . . Wed-P-6-1-5 189

Thu-P-9-1-4 235Siohan, Olivier . . . . . . . . . . Mon-O-2-10-5 92

Mon-P-2-3-2 104Thu-O-9-1-3 223

Sitaram, Sunayana . . . . . . Mon-SS-1-11-5 78Siu, Man-Hung . . . . . . . . . . Mon-O-1-1-1 80

Mon-O-1-1-4 81Sivaraman, Ganesh . . . . . Tue-O-3-2-2 116

Tue-O-4-8-2 127Sjons, Johan . . . . . . . . . . . . Tue-P-5-1-13 148

Tue-P-5-1-15 149Skarnitzl, Radek . . . . . . . . Wed-P-7-2-6 199

Wed-P-8-1-10 208Skerry-Ryan, R.J. . . . . . . . . Thu-P-9-4-11 242Skordilis, Zisis . . . . . . . . . . Mon-P-2-2-5 102Skrelin, Pavel . . . . . . . . . . . Wed-SS-7-1-4 163Sloetjes, Han . . . . . . . . . . . . Wed-SS-6-11-4 162Smaïli, Kamel . . . . . . . . . . . Thu-O-10-4-1 231Šmídl, Luboš . . . . . . . . . . . . Wed-P-6-3-13 196Smit, Peter . . . . . . . . . . . . . . Tue-S&T-3-B-6 159

Wed-O-7-8-4 181Thu-O-10-4-5 232

Smith, Daniel . . . . . . . . . . . Wed-O-8-8-3 186Smith, Noah A. . . . . . . . . . . Tue-O-3-1-4 115Smith, Rachel . . . . . . . . . . . Tue-O-5-6-4 133Smolander, A.-R. . . . . . . . . Wed-S&T-6-A-6 217Sneddon, Alex . . . . . . . . . . Wed-O-8-8-3 186Snyder, David . . . . . . . . . . . Tue-O-3-4-1 117

Thu-O-9-4-4 226So, Clifford . . . . . . . . . . . . . . Wed-P-7-3-1 200Socolof, Michaela . . . . . . . Mon-P-1-2-9 96

Thu-P-9-3-2 237Soderstrom, Melanie . . . Wed-SS-6-11-3 162

Wed-SS-6-11-4 162Thu-SS-9-10-1 219Thu-SS-9-10-4 219

Sohel, Ferdous . . . . . . . . . . Mon-P-1-2-4 95Solera-Ureña, Rubén . . . . Tue-SS-5-11-8 114Solewicz, Yosef A. . . . . . . Wed-P-6-2-10 193Soltau, Hagen . . . . . . . . . . . Thu-O-10-1-6 230Somandepalli, Krishna . Mon-P-2-2-2 101Sonderegger, Morgan . . . Mon-P-1-2-9 96

Thu-P-9-3-2 237Song, Inchul . . . . . . . . . . . . Wed-O-6-10-6 175Song, Yan . . . . . . . . . . . . . . . Wed-O-7-10-2 182Song, Zhanmei . . . . . . . . . . Wed-P-7-3-11 202Soni, Meet H. . . . . . . . . . . . . Mon-SS-1-8-3 76

Wed-P-7-3-16 203Sonowal, Sukanya . . . . . . Mon-P-1-2-6 95Soong, Frank K. . . . . . . . . . Tue-P-3-1-7 137

Tue-P-5-1-5 147Wed-O-7-10-5 182

Sorensen, Tanner . . . . . . . Mon-O-1-10-4 86Mon-P-2-2-5 102Tue-O-3-2-6 117

Sorin, Alexander . . . . . . . . Tue-O-5-4-4 132Soto, Victor . . . . . . . . . . . . . Mon-SS-1-11-9 79Spálenka, K. . . . . . . . . . . . . . Wed-P-7-4-4 204Spechbach, Hervé . . . . . . . Wed-S&T-6-B-2 217Specia, Lucia . . . . . . . . . . . . Wed-O-8-10-2 187Sperber, Matthias . . . . . . . Mon-P-1-4-2 96Spille, Constantin . . . . . . . Tue-O-4-2-5 124

Wed-P-6-4-7 197Sproat, Richard . . . . . . . . . Mon-P-2-4-1 106

Wed-SS-6-2-6 161Thu-P-9-4-13 242

Sreeram, Victor . . . . . . . . . Tue-P-5-3-4 152Sridharan, Sridha . . . . . . . Tue-P-3-2-10 140Srinivasamurthy, Ajay . . Wed-O-6-10-5 175Sriskandaraja, Kaavya . . Wed-O-8-1-3 183Stafylakis, Themos . . . . . Thu-O-9-8-1 228Stanton, Daisy . . . . . . . . . . Thu-P-9-4-11 242Starkhammar, Josefin . . Wed-P-7-3-3 201Stasak, Brian . . . . . . . . . . . . Tue-SS-3-11-1 110Stehwien, Sabrina . . . . . . . Wed-O-6-6-1 171

253

Steidl, Stefan . . . . . . . . . . . . Thu-SS-9-10-1 219Thu-SS-10-10-7 223

Steiner, Ingmar . . . . . . . . . Mon-O-1-10-3 85Tue-SS-4-11-5 112Thu-O-10-8-6 234

Steiner, Peter . . . . . . . . . . . Mon-P-1-1-3 93Stemmer, Georg . . . . . . . . Tue-S&T-3-A-5 158Stengel-Eskin, Elias . . . . . Thu-P-9-3-2 237Stepanov, Evgeny A. . . . . Wed-O-7-6-3 179Stern, Richard M. . . . . . . . Tue-SS-3-11-2 111

Thu-P-9-1-9 236Stolcke, Andreas . . . . . . . . Mon-O-1-1-6 81

Tue-O-5-8-5 135Wed-P-6-1-3 189

Stone, Maureen . . . . . . . . . Wed-O-6-1-4 170Stone, Simon . . . . . . . . . . . . Mon-P-1-1-3 93

Tue-O-5-10-5 136Strasly, Irene . . . . . . . . . . . . Wed-S&T-6-B-2 217Strassel, Stephanie . . . . . Wed-O-8-1-6 183Strik, Helmer . . . . . . . . . . . Wed-O-8-8-2 186Strom, Nikko . . . . . . . . . . . . Thu-O-9-4-4 226Strömbergsson, Sofia . . . Wed-SS-7-11-5 167Stüker, Sebastian . . . . . . . Mon-P-1-4-2 96

Tue-SS-4-11-4 112Sturim, Douglas . . . . . . . . Tue-O-5-2-2 130Šturm, Pavel . . . . . . . . . . . . Wed-P-7-2-6 199Stylianou, Yannis . . . . . . . Tue-P-5-4-15 157

Thu-P-9-1-1 235Su, Pei-Hao . . . . . . . . . . . . . . Tue-P-4-3-13 145Suendermann-Oeft, D. . . Mon-SS-1-11-3 78

Tue-P-4-3-11 145Wed-O-7-6-5 180

Wed-O-7-10-5 182Sugai, Kosuke . . . . . . . . . . . Wed-P-7-2-1 198Sun, Lei . . . . . . . . . . . . . . . . . Mon-O-2-10-4 92Sun, Lifa . . . . . . . . . . . . . . . . Wed-P-8-4-10 216Sun, Ming . . . . . . . . . . . . . . . Thu-O-9-4-4 226Sun, Sining . . . . . . . . . . . . . . Tue-P-5-3-5 152Sun, Wen . . . . . . . . . . . . . . . . Tue-P-5-2-12 151Suni, Antti . . . . . . . . . . . . . . Tue-O-4-6-2 126Sur, Mriganka . . . . . . . . . . . Wed-O-7-1-3 176Suthokumar, Gajan . . . . . Wed-O-8-1-3 183Sutton, Brad . . . . . . . . . . . . Mon-O-1-6-1 84Suzuki, Kyori . . . . . . . . . . . Mon-S&T-2-B-6 110Suzuki, Masayuki . . . . . . . Tue-P-4-1-7 141

Thu-O-9-4-5 226Thu-O-10-1-4 230

Švec, Jan . . . . . . . . . . . . . . . . Wed-P-6-3-13 196Svensson Lundmark, M. Wed-P-8-1-13 208Swart, Albert . . . . . . . . . . . . Tue-O-5-2-5 131

Tue-P-3-1-1 136Swerts, Marc . . . . . . . . . . . . Mon-O-2-6-4 90Szabó, Lili . . . . . . . . . . . . . . . Wed-SS-6-2-2 160Szaszák, György . . . . . . . . Mon-P-1-4-11 98

Wed-O-6-10-5 175Székely, Éva . . . . . . . . . . . . . Mon-P-2-4-11 108

TTabain, Marija . . . . . . . . . . Wed-P-7-2-7 199Tachioka, Yuuki . . . . . . . . Wed-O-7-2-4 177Tajima, Keiichi . . . . . . . . . . Mon-O-2-6-3 90Tak, Rishabh . . . . . . . . . . . . Wed-P-7-3-16 203Takaki, Shinji . . . . . . . . . . . Tue-O-3-8-1 119

Tue-O-4-1-3 122Wed-P-8-4-6 215

Thu-P-9-4-14 243Takamichi, Shinnosuke Tue-O-4-10-1 128

Thu-P-9-4-2 240Takanashi, Katsuya . . . . . Tue-P-4-3-14 146Takeda, Kazuya . . . . . . . . . Tue-O-4-1-1 122Takeda, Ryu . . . . . . . . . . . . Tue-P-4-2-2 142Takemoto, Hironori . . . . Thu-SS-9-11-1 220Takiguchi, Izumi . . . . . . . Mon-O-2-6-5 91Takiguchi, Tetsuya . . . . . Wed-P-8-4-3 214

Wed-P-8-4-8 215Takimoto, Eri . . . . . . . . . . . Tue-O-4-2-2 123Tamamori, Akira . . . . . . . Tue-O-4-1-1 122

Tue-O-4-1-5 123Tan, Zheng-Hua . . . . . . . . . Tue-P-3-1-4 137

Tue-P-5-4-12 157Wed-O-8-1-4 183Wed-P-6-4-6 197

Tan, Zhili . . . . . . . . . . . . . . . . Tue-P-3-2-6 139Tanaka, Hiroki . . . . . . . . . . Wed-O-7-1-4 176Tanaka, Kazuyo . . . . . . . . Wed-P-6-3-2 194Tanaka, Kei . . . . . . . . . . . . . Wed-P-8-4-5 215Tanaka, Kou . . . . . . . . . . . . Tue-O-3-8-3 120

Tang, Hao . . . . . . . . . . . . . . . Thu-O-9-1-1 223Tang, Keyi . . . . . . . . . . . . . . Wed-O-6-1-4 170Tang, Qingming . . . . . . . . Tue-P-4-2-6 143Tang, Yan . . . . . . . . . . . . . . . Wed-P-6-4-5 197Tang, Zhiyuan . . . . . . . . . . Tue-P-3-2-2 139Tao, Fei . . . . . . . . . . . . . . . . . Tue-P-5-3-14 154Tao, Jianhua . . . . . . . . . . . . Mon-P-2-4-7 107

Wed-P-6-1-9 190Wed-P-7-3-5 201

Tarján, Balázs . . . . . . . . . . Wed-SS-6-2-2 160Tasaki, Hiroto . . . . . . . . . . Wed-P-6-3-5 195Tatman, Rachael . . . . . . . . Tue-SS-5-11-9 115Tavarez, David . . . . . . . . . . Thu-SS-10-10-4 222Teixeira, António . . . . . . . Mon-P-2-2-1 101ten Bosch, L. . . . . . . . . . . . . Tue-O-4-2-3 123Teraoka, Takehiro . . . . . . Tue-P-4-3-5 144te Rietmolen, Noémie . . Wed-O-7-1-5 176Ternström, Sten . . . . . . . . Thu-SS-9-11-5 221Thangthai, Kwanchiva . . Thu-O-9-8-2 228Thomas, Samuel . . . . . . . . Mon-O-1-1-5 81

Thu-O-10-1-4 230Tidelius, Henrik . . . . . . . . Wed-SS-6-11-2 161Tihelka, Daniel . . . . . . . . . . Wed-P-7-3-4 201

Wed-S&T-6-A-3 216Wed-S&T-6-A-4 216

Tiwari, Gautam . . . . . . . . . Tue-P-5-3-15 154Tjaden, Kris . . . . . . . . . . . . . Tue-O-4-2-1 123Tjalve, Michael . . . . . . . . . . Tue-O-5-8-5 135

Wed-P-6-1-3 189Tkachman, Oksana . . . . . Wed-SS-6-11-1 161Toda, Tomoki . . . . . . . . . . . Mon-P-1-1-4 93

Tue-O-3-8-3 120Tue-O-4-1-1 122Tue-O-4-1-5 123Tue-O-5-4-1 131

Tue-P-5-4-10 156Todisco, Massimiliano . . Mon-SS-1-8-1 76Töger, Johannes . . . . . . . . Tue-O-3-2-6 117Togneri, Roberto . . . . . . . Mon-P-1-2-4 95

Tue-P-5-3-4 152Tokuda, Keiichi . . . . . . . . . Mon-O-1-10-2 85Tomashenko, Natalia . . . Wed-P-6-3-3 194

Thu-O-9-4-3 226Tong, Audrey . . . . . . . . . . . Tue-O-5-2-6 131Tong, Rong . . . . . . . . . . . . . Wed-SS-7-11-1 166Tong, Sibo . . . . . . . . . . . . . . Mon-P-2-3-3 104Toribio, Almeida J. . . . . . Mon-SS-1-11-7 78Torres-Carrasquillo, PA Tue-O-5-2-2 130Toshniwal, Shubham . . . Thu-O-9-1-1 223Tóth, László . . . . . . . . . . . . Tue-P-4-1-8 142

Tue-P-4-1-9 142Thu-SS-10-10-5 222

Thu-O-9-8-5 229Toutios, Asterios . . . . . . . Mon-O-1-10-4 86

Mon-P-2-2-2 101Mon-P-2-2-5 102Tue-O-3-2-6 117

Townsend, Greg . . . . . . . . Thu-P-9-4-12 242Toyama, Shohei . . . . . . . . . Mon-P-1-4-8 97

Tue-O-5-8-2 134Tran, Dung T. . . . . . . . . . . . Tue-P-4-1-3 141

Thu-P-9-1-5 236Tran, Huy Dat . . . . . . . . . . . Thu-SS-9-10-9 220

Thu-O-10-11-2 234Trancoso, Isabel . . . . . . . . Tue-SS-5-11-8 114

Wed-P-8-4-11 216Travadi, Ruchir . . . . . . . . . Tue-P-3-2-9 140

Wed-SS-6-2-1 160Thu-P-9-1-3 235

Trigeorgis, George . . . . . . Thu-SS-9-10-1 219Trmal, Jan . . . . . . . . . . . . . . Wed-O-7-4-6 179

Wed-P-6-3-13 196Thu-O-9-4-2 226

Trnka, Marian . . . . . . . . . . . Wed-O-6-6-2 171Tronstad, Tron V. . . . . . . Tue-P-5-4-4 155Trouvain, Jürgen . . . . . . . Wed-SS-8-11-4 168Truong, Khiet P. . . . . . . . . Tue-O-3-10-6 122Tsao, Yu . . . . . . . . . . . . . . . . Mon-P-1-1-7 94

Tue-P-5-4-1 155Wed-P-8-4-1 214Thu-O-9-1-6 224

Tschiatschek, S. . . . . . . . . Tue-O-5-1-5 130Tschöpe, Constanze . . . . Wed-S&T-6-B-3 217Tseng, Shao-Yen . . . . . . . . Wed-P-8-2-10 211Tseng, Xian-Hong . . . . . . . Wed-O-6-8-2 173Tsiaras, Vassilios . . . . . . . Wed-O-7-4-2 178Tsiartas, Andreas . . . . . . . Tue-O-5-10-1 135

Tsiknakis, Manolis . . . . . . Wed-O-8-8-5 187Tsourakis, Nikos . . . . . . . . Wed-S&T-6-B-2 217Tsuchiya, Masatoshi . . . . Wed-P-6-3-7 195Tsuji, Sho . . . . . . . . . . . . . . . Tue-S&T-3-A-6 158

Wed-SS-6-11-5 162Wed-SS-6-11-6 162

Tsujimura, Shoko . . . . . . . Wed-P-6-1-6 190Tsunoo, Emiru . . . . . . . . . . Wed-P-6-3-10 196Tu, Jung-Yueh . . . . . . . . . . Wed-P-8-1-1 206Tu, Ming . . . . . . . . . . . . . . . . Tue-P-5-2-9 151Tu, Yan-Hui . . . . . . . . . . . . . Mon-O-2-10-4 92Tür, Gokhan . . . . . . . . . . . . Wed-O-7-4-1 178Turco, Giuseppina . . . . . . Mon-O-1-6-4 84Türker, Bekir Berker . . . . Tue-SS-3-11-3 111Turnbull, Rory . . . . . . . . . . Wed-P-7-2-13 200Tüske, Zoltán . . . . . . . . . . . Tue-P-4-2-5 143Tykalová, Tereza . . . . . . . Tue-P-5-2-8 150

Thu-P-9-3-5 238Tzimiropoulos, G. . . . . . . Thu-O-9-8-1 228Tzirakis, Panagiotis . . . . Thu-SS-9-10-1 219

UUchida, Hidetsugu . . . . . . Tue-O-3-2-5 117

Tue-O-4-10-3 128Uenohara, Shingo . . . . . . . Wed-O-7-2-4 177Uezu, Yasufumi . . . . . . . . Wed-O-6-1-6 170Ultes, Stefan . . . . . . . . . . . . Tue-P-4-3-13 145Umbert, Martí . . . . . . . . . . . Wed-O-6-6-5 172Umesh, S. . . . . . . . . . . . . . . . Mon-P-2-3-8 105

Wed-SS-7-1-9 164Wed-SS-7-1-10 164

Wed-O-8-8-4 186Unoki, Masashi . . . . . . . . . Wed-O-6-4-4 171Uramoto, Takanobu . . . . Wed-O-7-2-4 177Uther, Maria . . . . . . . . . . . . Wed-S&T-6-A-6 217

VVachhani, Bhavik . . . . . . . Mon-S&T-2-A-6 109

Tue-P-5-2-10 151Vainio, Martti . . . . . . . . . . . Tue-O-4-6-2 126Vair, Claudio . . . . . . . . . . . . Tue-O-5-2-3 130Vaizman, Yonatan . . . . . . Wed-O-7-2-6 178Valentini Botinhao, C. . . Mon-P-2-1-10 100

Tue-O-5-4-6 132Wed-P-6-4-2 197

VanDam, Mark . . . . . . . . . . Mon-S&T-2-A-4 108Wed-SS-6-11-4 162

van den Heuvel, Henk . . Mon-SS-1-11-1 77Mon-SS-1-11-2 77

van der Vloed, David . . . Wed-P-6-2-10 193van der Westhuizen, E. . Mon-SS-1-11-8 78Van de Velde, Hans . . . . . Mon-SS-1-11-1 77

Thu-S&T-9-A-5 243van Esch, Daan . . . . . . . . . Thu-P-9-4-13 242Van hamme, Hugo . . . . . . Tue-P-5-3-10 153van Heerden, Charl . . . . . Wed-SS-7-1-14 165Van Leeuwen, David . . . . Mon-SS-1-11-1 77

Mon-SS-1-11-2 77van Maastricht, Lieke . . . Mon-O-2-6-4 90van Niekerk, Daniel . . . . . Wed-SS-7-1-14 165van Santen, Jan . . . . . . . . . Mon-O-1-10-6 86Variani, Ehsan . . . . . . . . . . Mon-O-2-10-5 92

Tue-P-4-2-3 142Vásquez-Correa, J.C. . . . Mon-O-2-2-6 88

Tue-P-5-2-7 150Vasudevan, Arvind . . . . . Thu-SS-9-11-3 221Vattam, Swaroop . . . . . . . Tue-O-5-2-2 130Vaz, Colin . . . . . . . . . . . . . . . Wed-SS-6-2-1 160Venneri, Annalena . . . . . . Wed-P-7-4-7 205Verma, Sakshi . . . . . . . . . . Wed-O-7-1-3 176Veselý, Karel . . . . . . . . . . . . Mon-P-2-3-4 105

Wed-SS-6-2-5 161Thu-O-10-1-2 229

Vestman, Ville . . . . . . . . . . Tue-P-3-1-8 138Vetchinnikova, S. . . . . . . . Mon-S&T-2-A-2 108Vialatte, F.-B. . . . . . . . . . . . . Mon-S&T-2-B-1 109Viggen, Erlend Magnus . Tue-P-5-4-4 155Vignesh, Rupak . . . . . . . . . Thu-O-10-11-4 234Viitanen, Vertti . . . . . . . . . Wed-S&T-6-A-6 217Vijayan, Karthika . . . . . . . Mon-O-2-4-3 89Vikram, C.M. . . . . . . . . . . . . Tue-P-5-2-5 150Villalba, Jesús . . . . . . . . . . . Tue-O-3-4-2 117

Wed-P-6-2-6 192Viñals, Ignacio . . . . . . . . . . Wed-P-6-2-6 192Virpioja, Sami . . . . . . . . . . . Wed-O-7-8-4 181

254

Vít, Jakub . . . . . . . . . . . . . . . Tue-O-4-1-6 123Wed-S&T-6-A-3 216

Vitaladevuni, Shiv . . . . . . Thu-O-9-4-4 226Vlasenko, Bogdan . . . . . . Wed-P-8-2-5 210Vogel, Carl . . . . . . . . . . . . . . Tue-P-4-3-9 145

Wed-O-6-8-6 174Wed-P-8-2-9 211

Vogel, Irene . . . . . . . . . . . . . Tue-O-5-6-3 133Voisin, Sylvie . . . . . . . . . . . Wed-SS-7-1-6 164Volín, Jan . . . . . . . . . . . . . . . Wed-P-7-2-6 199

Thu-P-9-3-5 238Voße, Jana . . . . . . . . . . . . . . Tue-O-3-8-6 120Vu, Ngoc Thang . . . . . . . . . Tue-O-4-8-6 127

Wed-O-6-6-1 171Vukotic, Vedran . . . . . . . . Wed-O-7-4-4 178Vuppala, Anil Kumar . . . Mon-SS-2-8-6 80

Wed-O-8-1-1 182Vyas, Manan . . . . . . . . . . . . Wed-S&T-6-B-5 218

WWagner, Johannes . . . . . . Thu-SS-9-10-7 220Wagner, Michael . . . . . . . . Mon-P-1-2-9 96

Wed-P-8-1-5 207Wagner, Petra . . . . . . . . . . . Tue-O-3-8-6 120

Wed-P-8-1-11 208Waibel, Alex . . . . . . . . . . . . Mon-P-1-4-2 96

Tue-SS-4-11-4 112Wed-O-8-4-5 184

Walker, Kevin . . . . . . . . . . . Wed-O-8-1-6 183Walker, Marilyn . . . . . . . . . Wed-P-8-3-8 213Walker, Traci . . . . . . . . . . . Wed-P-7-4-7 205Walsh, Michael . . . . . . . . . . Tue-SS-5-11-6 114Walter, Oliver . . . . . . . . . . . Wed-SS-7-1-7 164Wan, Vincent . . . . . . . . . . . Tue-O-4-1-6 123Wand, Michael . . . . . . . . . . Thu-O-9-8-3 228Wang, Chengxia . . . . . . . . Wed-SS-8-11-8 168Wang, DeLiang . . . . . . . . . . Tue-P-5-4-14 157Wang, Dong . . . . . . . . . . . . . Mon-SS-2-8-3 79

Tue-P-3-2-2 139Wang, Dongmei . . . . . . . . . Mon-O-1-4-4 83Wang, Hsin-Min . . . . . . . . . Mon-P-1-1-7 94

Tue-P-5-4-1 155Wed-P-6-3-4 195Wed-P-8-4-1 214Thu-O-9-1-6 224

Wang, Jun . . . . . . . . . . . . . . . Mon-O-1-10-6 86Wed-P-6-1-7 190

Wang, Lan . . . . . . . . . . . . . . . Wed-O-6-10-3 174Wang, Lei . . . . . . . . . . . . . . . Mon-P-2-1-1 98Wang, Lixin . . . . . . . . . . . . . Tue-P-5-1-9 147Wang, Qiongqiong . . . . . . Thu-O-10-2-4 231Wang, Shi-yu . . . . . . . . . . . . Mon-P-2-1-2 98Wang, Shuai . . . . . . . . . . . . . Tue-P-3-1-5 137Wang, Syu-Siang . . . . . . . . Mon-P-1-1-7 94

Tue-P-5-4-1 155Wang, Tianzhou . . . . . . . . Wed-P-6-2-11 193Wang, Weiran . . . . . . . . . . . Tue-P-4-2-6 143Wang, Wenwu . . . . . . . . . . . Wed-P-7-3-7 201

Wed-P-7-3-10 202Wang, Xi . . . . . . . . . . . . . . . . Wed-O-6-10-1 174Wang, Xianliang . . . . . . . . Mon-SS-1-8-7 77Wang, Xianyun . . . . . . . . . . Tue-P-5-3-7 153

Thu-O-9-6-2 227Wang, Xiao . . . . . . . . . . . . . . Wed-O-7-1-2 175Wang, Xihao . . . . . . . . . . . . Tue-O-5-8-3 134Wang, Xin . . . . . . . . . . . . . . . Tue-O-3-8-1 119

Thu-P-9-4-1 240Wang, Xinhao . . . . . . . . . . . Tue-O-5-8-1 134

Wed-O-7-10-5 182Wed-P-6-1-4 189

Wang, Xinyue . . . . . . . . . . . Thu-P-9-4-4 241Wang, Xuan . . . . . . . . . . . . . Wed-P-7-3-7 201Wang, Y. . . . . . . . . . . . . . . . . Wed-P-6-1-8 190Wang, Yannan . . . . . . . . . . Tue-O-4-4-1 124Wang, Yiming . . . . . . . . . . . Tue-P-4-2-1 142

Thu-O-9-4-2 226Wang, Yu-Hsuan . . . . . . . . Thu-O-10-11-5 235Wang, Yun . . . . . . . . . . . . . . Wed-P-7-3-13 203Wang, Yuxuan . . . . . . . . . . Thu-P-9-4-11 242Wang, Zhibin . . . . . . . . . . . . Wed-P-6-2-11 193Wankerl, Sebastian . . . . . Wed-P-7-4-10 205Wanner, Leo . . . . . . . . . . . . Mon-S&T-2-A-1 108

Wed-P-8-1-6 207Wed-S&T-6-A-2 216

Ward, Lauren . . . . . . . . . . . Wed-O-8-8-3 186Wed-P-6-4-5 197

Ward, Nigel G. . . . . . . . . . . Tue-O-5-10-1 135Wardle, Margaret . . . . . . . Wed-P-7-4-6 205Warlaumont, Anne S. . . . Mon-S&T-2-A-4 108

Wed-SS-6-11-3 162Wed-SS-6-11-4 162Thu-SS-9-10-1 219Thu-SS-9-10-4 219

Watanabe, Hayato . . . . . . Mon-S&T-2-B-6 110Watanabe, Hiroki . . . . . . . Wed-O-7-1-4 176Watanabe, Shinji . . . . . . . . Tue-O-3-1-3 115

Wed-O-7-2-4 177Wed-O-7-8-3 181

Watson, C.I. . . . . . . . . . . . . . Wed-SS-6-2-3 160Watt, Dominic . . . . . . . . . . Mon-P-2-1-7 99Watts, Oliver . . . . . . . . . . . . Mon-P-2-4-10 108

Tue-O-4-1-4 122Wed-SS-7-1-16 165

Weber, Andrea . . . . . . . . . . Tue-O-5-6-1 133Weber, Philip . . . . . . . . . . . Mon-O-2-4-1 89Websdale, Danny . . . . . . . Tue-P-5-4-11 157Weiner, Jochen . . . . . . . . . Wed-P-7-4-1 203Weintraub, Mitchel . . . . . Mon-O-2-10-5 92Weirich, Melanie . . . . . . . . Tue-SS-5-11-7 114Weiss, Benjamin . . . . . . . . Tue-SS-5-11-3 113Weiss, Ron J. . . . . . . . . . . . . Mon-O-2-10-5 92

Wed-O-8-4-1 184Thu-P-9-4-11 242

Wen, Tsung-Hsien . . . . . . Tue-P-4-3-13 145Wen, Zhengqi . . . . . . . . . . . Mon-P-2-4-7 107

Wed-P-6-1-9 190Wendemuth, Andreas . . Wed-O-6-8-1 173Wendler, Christoph . . . . . Wed-SS-7-1-1 163Weninger, Felix . . . . . . . . . Wed-P-7-4-8 205Werner, Tina . . . . . . . . . . . . Wed-S&T-6-B-5 218Wester, Mirjam . . . . . . . . . Thu-P-9-4-9 242Wieling, Martijn . . . . . . . . . Tue-O-3-2-2 116Wiener, Seth . . . . . . . . . . . . Tue-P-5-1-7 147Wiesner, Matthew . . . . . . Wed-O-7-4-6 179

Thu-O-9-4-2 226Wijenayake, Chamith . . . Wed-O-8-1-3 183Williams, Ian . . . . . . . . . . . . Mon-P-1-4-1 96Williams, Shanna . . . . . . . Wed-O-8-10-5 188Williamson, Becci . . . . . . . Thu-P-9-4-12 242Williamson, James R. . . . Wed-P-8-1-7 207Wilson, Ian . . . . . . . . . . . . . . Mon-S&T-2-B-6 110Wilson, Kevin W. . . . . . . . . Mon-O-2-10-5 92Winarsky, David . . . . . . . . Thu-P-9-4-12 242Winata, Genta Indra . . . . Wed-S&T-6-B-4 217Winkler, Jana . . . . . . . . . . . Mon-P-2-1-11 100Wirén, Mats . . . . . . . . . . . . . Wed-SS-7-11-3 166Wisler, Alan . . . . . . . . . . . . . Tue-P-5-2-1 149Wisniewksi, Guillaume . Thu-O-9-2-5 225Witkowski, Marcin . . . . . . Mon-SS-1-8-6 77Włodarczak, Marcin . . . . Mon-P-2-2-11 103

Tue-P-4-3-2 143Wolf, Arthur . . . . . . . . . . . . Mon-O-1-4-2 83Wolff, Matthias . . . . . . . . . Wed-S&T-6-B-3 217Wong, Janice Wing-Sze . Wed-P-8-1-1 206Wong, Jeremy H.M. . . . . . Mon-O-1-1-2 80Woo, Jonghye . . . . . . . . . . . Wed-O-6-1-4 170Wood, Sean U.N. . . . . . . . . Tue-S&T-3-B-5 159

Wed-O-8-6-4 185Wörtwein, Torsten . . . . . . Wed-P-8-2-3 209Wright, Jonathan . . . . . . . Wed-O-8-1-6 183Wright, Richard A. . . . . . . Tue-O-5-10-2 135Wu, Bo . . . . . . . . . . . . . . . . . . Thu-P-9-1-4 235Wu, Chia-Lung . . . . . . . . . . Mon-P-1-1-7 94Wu, Chunyang . . . . . . . . . . Tue-P-4-1-6 141Wu, Dan . . . . . . . . . . . . . . . . . Mon-SS-2-8-2 79Wu, Ji . . . . . . . . . . . . . . . . . . . Thu-O-9-4-1 225Wu, Jie . . . . . . . . . . . . . . . . . . Wed-P-8-4-4 214Wu, Tsung-Chen . . . . . . . . Mon-O-1-4-6 84Wu, Yaru . . . . . . . . . . . . . . . . Thu-O-10-8-3 233Wu, Ya-Tse . . . . . . . . . . . . . . Wed-P-8-2-4 210Wu, Yi-Chiao . . . . . . . . . . . . Tue-P-5-4-1 155

Wed-P-8-4-1 214Wu, Yonghui . . . . . . . . . . . . Wed-O-8-4-1 184

Thu-P-9-4-11 242Wu, Zhiyong . . . . . . . . . . . . Mon-P-2-4-6 107

Tue-O-4-8-1 126Wed-P-8-4-10 216

Wu, Zhizheng . . . . . . . . . . . Thu-P-9-4-12 242Wuth, Jorge . . . . . . . . . . . . . Tue-SS-3-11-2 111

XXia, Xianjun . . . . . . . . . . . . . Mon-P-1-2-4 95Xiang, Bing . . . . . . . . . . . . . . Wed-P-8-3-7 213Xiang, Xu . . . . . . . . . . . . . . . . Mon-P-1-4-6 97Xiao, Xiong . . . . . . . . . . . . . . Tue-P-5-3-5 152Xiao, Yanhong . . . . . . . . . . Mon-SS-1-8-7 77Xiao, Ying . . . . . . . . . . . . . . . Thu-P-9-4-11 242Xiao, Yujia . . . . . . . . . . . . . . Tue-P-5-1-5 147Xie, Lei . . . . . . . . . . . . . . . . . . Mon-P-1-4-5 97

Wed-P-8-4-4 214Xie, Xurong . . . . . . . . . . . . . Wed-O-6-10-3 174Xie, Yanlu . . . . . . . . . . . . . . . Mon-P-2-2-6 102

Wed-P-8-1-3 206Xie, Zhifeng . . . . . . . . . . . . . Mon-SS-2-8-5 80Xu, Anqi . . . . . . . . . . . . . . . . . Wed-P-8-1-12 208Xu, Bo . . . . . . . . . . . . . . . . . . . Mon-P-2-3-1 104Xu, Chenglin . . . . . . . . . . . . Tue-P-5-3-5 152Xu, Hainan . . . . . . . . . . . . . . Tue-P-4-2-1 142

Thu-O-9-4-2 226Xu, Li . . . . . . . . . . . . . . . . . . . . Tue-O-5-6-5 133Xu, Mingxing . . . . . . . . . . . . Tue-O-4-8-1 126Xu, Mingyu . . . . . . . . . . . . . . Mon-O-1-2-5 82Xu, Ning . . . . . . . . . . . . . . . . . Mon-P-1-2-3 95Xu, Shuang . . . . . . . . . . . . . . Mon-P-2-3-1 104Xu, Xiangmin . . . . . . . . . . . Mon-SS-2-8-5 80Xu, Yi . . . . . . . . . . . . . . . . . . . . Wed-SS-8-11-8 168Xu, Yong . . . . . . . . . . . . . . . . Wed-P-7-3-10 202Xu, Yong . . . . . . . . . . . . . . . . Wed-P-6-4-4 197Xue, Jian . . . . . . . . . . . . . . . . Wed-P-6-2-11 193Xue, Wei . . . . . . . . . . . . . . . . . Mon-P-1-2-1 94

YYadav, Shivani . . . . . . . . . . Thu-SS-10-10-1 222Yamagishi, Junichi . . . . . Mon-SS-1-8-1 76

Mon-P-2-1-10 100Mon-P-2-4-10 108

Tue-O-3-8-1 119Tue-O-4-1-3 122Tue-O-5-4-3 132Wed-P-6-4-2 197Wed-P-8-4-6 215Thu-P-9-4-1 240

Thu-P-9-4-14 243Yamaguchi, Tetsutaro . . Thu-SS-9-11-1 220Yamamoto, Hitoshi . . . . . Thu-O-10-2-3 231Yamamoto, Katsuhiko . . Wed-P-6-4-3 197Yamamoto, Kazumasa . Wed-P-6-1-6 190Yamamoto, Kodai . . . . . . Mon-P-2-1-9 100Yamauchi, Yutaka . . . . . . Tue-O-5-8-2 134Yan, Bi-Cheng . . . . . . . . . . . Thu-O-9-6-4 227Yan, Yonghong . . . . . . . . . Tue-O-4-4-6 125

Tue-P-4-1-1 140Tue-P-5-3-2 152Thu-P-9-1-4 235Thu-P-9-1-6 236

Yang, Bing . . . . . . . . . . . . . . Tue-P-5-3-11 153Yang, IL-ho . . . . . . . . . . . . . . Tue-P-3-1-12 138Yang, Jing . . . . . . . . . . . . . . . Tue-O-5-6-5 133Yang, Jun . . . . . . . . . . . . . . . Wed-P-6-4-4 197Yang, Ming-Han . . . . . . . . . Thu-O-9-1-6 224Yang, Xuesong . . . . . . . . . . Tue-P-5-4-13 157Yang, Yang . . . . . . . . . . . . . . Wed-S&T-6-B-4 217Yang, Yike . . . . . . . . . . . . . . Tue-P-5-1-6 147Yang, Zongheng . . . . . . . . Thu-P-9-4-11 242Yanushevskaya, Irena . . Tue-O-3-6-2 118

Tue-O-4-8-5 127Thu-P-9-3-8 238

Yegnanarayana, B. . . . . . . Tue-S&T-3-A-4 158Wed-O-6-4-1 170

Wed-P-7-2-11 200Yemez, Yücel . . . . . . . . . . . Tue-SS-3-11-3 111Yeung, Gary . . . . . . . . . . . . . Tue-P-3-1-10 138Yi, Hua . . . . . . . . . . . . . . . . . . Wed-P-7-3-11 202Yi, Jiangyan . . . . . . . . . . . . . Wed-P-6-1-9 190Yılmaz, Emre . . . . . . . . . . . Mon-SS-1-11-1 77

Mon-SS-1-11-2 77Wed-O-8-8-2 186

Yin, Jiao . . . . . . . . . . . . . . . . . Tue-P-5-2-13 151Yin, Ruiqing . . . . . . . . . . . . Thu-O-10-11-6 235Ying, Dongwen . . . . . . . . . Tue-P-5-3-2 152Ying, Jia . . . . . . . . . . . . . . . . . Wed-P-7-2-2 198Ylinen, Sari . . . . . . . . . . . . . Wed-S&T-6-A-6 217Yoma, Nestor Becerra . . Tue-SS-3-11-2 111Yoneyama, Kiyoko . . . . . . Mon-O-2-6-3 90

255

Yoon, Sung-hyun . . . . . . . Tue-P-3-1-12 138Yoon, Su-Youn . . . . . . . . . . Tue-O-5-8-3 134

Wed-P-6-1-4 189Yoshii, Kazuyoshi . . . . . . Wed-O-7-2-2 177Yoshimura, Takenori . . . Mon-O-1-10-2 85Young, Steve . . . . . . . . . . . . Tue-P-4-3-13 145Yu, Chengzhu . . . . . . . . . . . Tue-O-5-2-4 131Yu, Dong . . . . . . . . . . . . . . . . Mon-P-1-4-5 97

Wed-O-7-2-3 177Yu, Ha-jin . . . . . . . . . . . . . . . Tue-P-3-1-12 138Yu, Hong . . . . . . . . . . . . . . . . Tue-P-3-1-4 137Yu, Kai . . . . . . . . . . . . . . . . . . Mon-P-1-4-6 97

Mon-P-2-4-8 107Mon-P-2-4-9 107Tue-P-3-1-5 137

Yu, Mingzhi . . . . . . . . . . . . . Tue-P-4-3-8 144Yu, Seunghak . . . . . . . . . . . Wed-O-7-6-2 179Yu, Shi . . . . . . . . . . . . . . . . . . Mon-P-2-1-8 100Yu, Xinguo . . . . . . . . . . . . . . Mon-O-1-2-5 82Yuan, Xiaobing . . . . . . . . . Tue-P-5-3-6 152Yue, Junwei . . . . . . . . . . . . . Tue-O-5-8-2 134Yuen, Chun Wah . . . . . . . . Tue-P-5-1-6 147Yun, Sungrack . . . . . . . . . . Wed-P-6-2-5 192Yunusova, Yana . . . . . . . . Mon-P-2-2-14 104

Tue-P-5-2-3 149

ZZafeiriou, Stefanos . . . . . Thu-SS-9-10-1 219Zahner, Katharina . . . . . . Tue-O-4-6-4 126

Tue-O-5-6-1 133Zahorian, Stephen A. . . . Mon-O-1-4-1 83

Wed-O-8-8-6 187Zajíc, Zbynek . . . . . . . . . . . Thu-O-9-2-1 224Zañartu, Matías . . . . . . . . . Tue-O-5-4-5 132Zane, Emily . . . . . . . . . . . . . Mon-O-1-2-2 82Zappi, Victor . . . . . . . . . . . . Thu-SS-9-11-3 221Zarrieß, Sina . . . . . . . . . . . . Tue-O-3-8-6 120Zatvornitsky, A. . . . . . . . . Thu-O-9-4-3 226Zee, Tim . . . . . . . . . . . . . . . . Mon-O-2-6-4 90Zegers, Jeroen . . . . . . . . . . Tue-P-5-3-10 153Zeghidour, Neil . . . . . . . . . Wed-SS-7-11-6 167Zelasko, Piotr . . . . . . . . . . . Mon-SS-1-8-6 77Zellers, Margaret . . . . . . . Wed-O-6-6-3 172Zenkel, Thomas . . . . . . . . Mon-P-1-4-2 96Zequeira Jiménez, R. . . . Wed-SS-8-11-3 168Zeyer, Albert . . . . . . . . . . . . Tue-O-3-1-2 115Zhang, Binbin . . . . . . . . . . . Mon-P-1-4-5 97Zhang, Boliang . . . . . . . . . . Wed-SS-6-2-1 160Zhang, Chunlei . . . . . . . . . Tue-O-5-2-4 131

Tue-P-3-1-3 137Zhang, Gaoyan . . . . . . . . . . Wed-O-7-1-6 176Zhang, Hepeng . . . . . . . . . Thu-P-9-4-12 242Zhang, Hua . . . . . . . . . . . . . Tue-P-5-2-12 151Zhang, Hui . . . . . . . . . . . . . . Tue-P-5-4-2 155Zhang, Jinsong . . . . . . . . . Mon-P-2-2-6 102

Wed-P-8-1-3 206Zhang, Kaile . . . . . . . . . . . . Tue-P-5-1-14 148Zhang, Pengyuan . . . . . . . Thu-P-9-1-6 236Zhang, Qi . . . . . . . . . . . . . . . Mon-P-2-2-6 102Zhang, Qian . . . . . . . . . . . . . Wed-O-7-10-3 182Zhang, Ruo . . . . . . . . . . . . . Tue-O-3-10-2 121Zhang, Shiliang . . . . . . . . . Thu-O-10-1-3 229Zhang, Wei . . . . . . . . . . . . . . Wed-P-8-1-3 206Zhang, Weibin . . . . . . . . . . Mon-SS-2-8-5 80Zhang, Xiaohui . . . . . . . . . Tue-P-4-2-1 142

Wed-O-7-8-2 180Thu-O-9-4-2 226

Zhang, Xueliang . . . . . . . . Tue-P-5-4-2 155Tue-P-5-4-14 157

Zhang, Yang . . . . . . . . . . . . Tue-P-5-4-13 157Wed-O-8-6-6 186

Zhang, Yanhui . . . . . . . . . . Wed-O-7-1-2 175Zhang, Yuanyuan . . . . . . . Mon-O-2-6-6 91

Wed-P-7-2-12 200Zhang, Yue . . . . . . . . . . . . . . Wed-P-7-4-8 205

Thu-SS-9-10-1 219Zhang, Yu . . . . . . . . . . . . . . . Tue-O-5-6-5 133Zhang, Yu . . . . . . . . . . . . . . . Thu-P-9-1-6 236Zhang, Yu . . . . . . . . . . . . . . . Tue-O-3-1-3 115

Tue-O-4-10-2 128Zhang, Zixing . . . . . . . . . . . Thu-P-9-3-15 240Zhao, Bin . . . . . . . . . . . . . . . . Wed-O-7-1-6 176Zhao, Faru . . . . . . . . . . . . . . Mon-SS-2-8-2 79Zhao, Kai . . . . . . . . . . . . . . . . Wed-P-8-3-7 213Zhao, Qingen . . . . . . . . . . . Wed-P-6-2-11 193Zhao, Rui . . . . . . . . . . . . . . . . Wed-O-6-10-1 174Zhao, Tuo . . . . . . . . . . . . . . . Wed-P-6-2-11 193

Zhao, Yuanyuan . . . . . . . . Mon-P-2-3-1 104Zheng, Thomas Fang . . . Mon-SS-2-8-3 79Zheng, Yibin . . . . . . . . . . . . Mon-P-2-4-7 107Zhong, Jinghua . . . . . . . . . Tue-P-3-1-7 137Zhou, Bowen . . . . . . . . . . . . Wed-P-8-3-7 213Zhou, Shiyu . . . . . . . . . . . . . Mon-P-2-3-1 104Zhu, Manman . . . . . . . . . . . Wed-P-7-3-11 202Zhu, Shenghuo . . . . . . . . . . Wed-P-6-2-11 193Zhu, Weiwu . . . . . . . . . . . . . Wed-O-8-10-6 188Zhu, Xuan . . . . . . . . . . . . . . . Mon-SS-1-8-7 77Zhu, Yinghua . . . . . . . . . . . Mon-P-2-2-5 102Zhuang, Xiaodan . . . . . . . Wed-SS-7-1-8 164Zibrek, Katja . . . . . . . . . . . . Mon-O-1-10-1 85Zihlmann, Urban . . . . . . . Thu-P-9-3-1 237Zimmerer, Frank . . . . . . . . Wed-SS-8-11-4 168

Wed-P-7-2-5 199Zinman, Lorne . . . . . . . . . . Tue-P-5-2-3 149Ziółko, Bartosz . . . . . . . . . Thu-S&T-9-A-1 243Zisserman, Andrew . . . . . Wed-O-8-1-5 183Žmolíková, Katerina . . . . Tue-O-4-4-2 124

Wed-O-8-6-2 185Zöhrer, Matthias . . . . . . . . Mon-P-1-2-8 96

Wed-O-8-6-3 185Zorila, Tudor-Catalin . . . Tue-P-5-4-15 157Zovato, Enrico . . . . . . . . . . Thu-P-9-4-3 241Zygis, Marzena . . . . . . . . . Tue-O-4-6-1 125

256

Fres

cati

Cam

pus

1

2

34

1U

nder

grou

ndst

atio

n2

Aul

aM

agna

3Sö

dra

Hus

et4

Allh

uset

23Aug

ust20

17Wednesday

Sund

ay20

Aug

ust20

1721

Aug

ust20

17Mon

day

22Aug

ust20

17Tu

esday

24Aug

ust20

17Thu

rsday

ISCAMedalist

WelcomeWords

andRefreshments

ISCAGeneral

Assem

bly

Oral,Poster&

SpecialS

ession

s&

Show

Tell

Exhibition

Oral,Poster&

SpecialS

ession

s&

Show

Tell

Exhibition Exhibition

Oral,Poster&

SpecialS

ession

s

Exhibition

Oral,Poster&

SpecialS

ession

s

Oral,Poster&

SpecialS

ession

s&

Show

Tell

Exhibition

Oral,Poster&

SpecialS

ession

s&

Show

Tell

Exhibition

Oral,Poster&

SpecialS

ession

s&

Show

Tell

Exhibition

Oral,Poster&

SpecialS

ession

s&

Show

Tell

Exhibition

Oral,Poster&

SpecialS

ession

s&

Show

Tell

Exhibition

9:00

11:00

12:00

13:00

14:00

15:00

16:00

17:00

18:00

19:00

20:00

22:00

23:00

21:00

8:00

7:00

INT

ER

SP

EE

CH

2017

Prog

ram

ata

Gla

nce

10:00

Afterno

onTutorials

T6–T9

Registration

WelcomeReception

Keyno

te:Catherin

ePelachaud

Keyno

te:Björn

Lind

blom

Keyno

te:James

Allen

LunchBreak

LunchBreak

Coff

eeBreak

Coff

eeBreak

Coff

eeBreak

Coff

eeBreak

Coff

eeBreak

Stud

ent

Reception

Closing

Session

Morning

Tutorials

T1–T5

Coff

eeBreak

continued

LunchBreak

Coff

eeBreak

Afterno

onTutorials

continued

Morning

Tutorials

Coff

eeBreak

LunchBreak

Registration

Registration

Registration

Registration

Standing

Banqu

et

Coff

eeBreak

LunchBreak

&Sh

ow

Tell

Oral&

Special

Sessions

Exhibition