language and speech technology: introduction

78
Language and Speech Technology: Introduction Jan Odijk January 2011 LOT Winter School 2011 1

Upload: mina

Post on 11-Jan-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Language and Speech Technology: Introduction. Jan Odijk January 2011 LOT Winter School 2011. Overview. What is language and speech technology (LST)? (3-7) Major Subfields of LST (8-25) Characterization of the last 30 years (26-27) 80s (28-36), 90s (37-49), 00s (50-56) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Language and Speech Technology: Introduction

Language and Speech Technology: Introduction

Jan OdijkJanuary 2011

LOT Winter School 2011

1

Page 2: Language and Speech Technology: Introduction

Overview

• What is language and speech technology (LST)? (3-7)

• Major Subfields of LST (8-25)

• Characterization of the last 30 years (26-27)– 80s (28-36), 90s (37-49), 00s (50-56)– Current Status (57-69)

• CLARIN infrastructure (70-75)

• This week’s programme (76)2

Page 3: Language and Speech Technology: Introduction

Language Technology

• Language Technology is the study of computational systems that process natural language

• Alternative names:– Human Language Technology (HLT)– Natural Language Processing (NLP)

3

Page 4: Language and Speech Technology: Introduction

Speech Technology

• Speech Technology is the study of computational systems that process speech

• Is a part of Language Technology

• Often– Term “Language technology” reserved for the

study of computational systems that process written language

4

Page 5: Language and Speech Technology: Introduction

Computational Linguistics

• Computational Linguistics (CL) is the study of language from a computational perspective

• Often used interchangeably with language technology

• Often grouped under Artificial Intelligence (AI) , although CL predates AI– AI: the study and design of intelligent systems

5

Page 6: Language and Speech Technology: Introduction

Computational Systems

• Computational systems to process natural language do not exist naturally (except in the human brain)– They must be designed, implemented, and

evaluated– Therefore it is a kind of engineering

6

Page 7: Language and Speech Technology: Introduction

Computational Systems

• LST is NOT

• the study of processing of natural language by humans in – cognition, – (cognitive) psychology,– (psycho)linguistics– phonetics

7

Page 8: Language and Speech Technology: Introduction

Language Technology Subfields

• Orthographic processing– Text = sequence of characters

– Tokenization• Text => sequence of tokens• Token= occurrence of a word form• Relatively simple for languages that uses interpunction

(space, dot, comma, etc.) for separating tokens• More difficult for languages such as Chinese, Thai, etc.

8

Page 9: Language and Speech Technology: Introduction

Language Technology Subfields

• Orthographic processing– Orthographic normalization– Token => (token, normalized token)– Normalized token = canonical orthographic

representation for a set of orthographic variants– Examples:

• Contemporary spelling variants: aktie => actie• Older spelling variants: vleesch => vlees• Typos: actei => actie• OCR errors: raarn => raam

9

Page 10: Language and Speech Technology: Introduction

Language Technology Subfields

• Morphological processing– Lemmatization: token => (token, lemma)

• Lemma = canonical orthographic representation for an inflectional paradigm

• Often ambiguities

• Examples– lemma(walked) = walk; Lemma(men) = man

– Lemma (graven) = {graf, graaf, graven} (Dutch)

10

Page 11: Language and Speech Technology: Introduction

Language Technology Subfields

• Morphological processing– Inflection analysis/generation

• Word form (lemma, inflectional features)

• Examples: – graven (graf, PoS=Noun, number=plural)

– graven (graaf, PoS=Noun, number=plural)

– graven (graven, PoS=Verb, form=infinitive)

– graven (graven, PoS=Verb, form= indicative, tense=present, number = plural)

11

Page 12: Language and Speech Technology: Introduction

Language Technology Subfields

• Morphological processing– Compound processing– word form ((word form,affix?)+, word form)– lemma ((word form,affix?)+, lemma)– Example:– Vleeskoeienhouders ([vlees,koeien], houders)

‘meat cow farmers’– gebiedsbepaling ([(gebied, s)], bepaling)

12

Page 13: Language and Speech Technology: Introduction

Language Technology Subfields

• Morphological processing– Derivational morphology processing– word form (prefix*, lemma, suffix*) – Example:

• Characterization ([], characterize, [ation])

13

Page 14: Language and Speech Technology: Introduction

Language Technology Subfields

• (PoS-)tagging– Assignment of a grammatical tag to a token in

context (tag=label for grammatical properties)– Token => (token, tag) in context – Usually assignment of PoS-tags– Often more detailed grammatical (inflectional)

tags

14

Page 15: Language and Speech Technology: Introduction

Language Technology Subfields

• (PoS-)tagging– Context: usually:

• Some words and/or tags preceding

• Some words following

– Examples:• (graven, Zij __ een graf) => Vindprespl

• (graven, De __ zijn boos) => Npl

15

Page 16: Language and Speech Technology: Introduction

Language Technology Subfields

• Chunking– identifying major phrases in a sentence– Example

• The man bought a present for his wife =>

• [NP The man] bought [NP a present] [PP for his wife]

16

Page 17: Language and Speech Technology: Introduction

Language Technology Subfields

• Parsing– Assign a syntactic structure to a sentence– Example: The man bought a present for his wife =>

[S

[subj/NP The man]

[pred/VP bought [obj/NP a present]

[pobj/PP for [obj/NP his wife]]

]

]17

Page 18: Language and Speech Technology: Introduction

Language Technology Subfields

• Machine Translation– Automatic translation of an input text– Example

• The man bought a present for his wife =>

• L’homme a acheté un cadeau pour sa femme

18

Page 19: Language and Speech Technology: Introduction

Language Technology Subfields

• Content extraction and processing– Named entity recognition– Question-answering– Information retrieval– Information extraction– Sentiment/ opinion mining– Reasoning/Inference on semantic representation– …

19

Page 20: Language and Speech Technology: Introduction

Speech Technology Subfields

• Speech Synthesis– Artificial production of human speech– Text => speech– Often called Text-To-Speech (TTS)– TTS system usually contains two components

• Grapheme to Phoneme (G2P) component– Text => symbolic speech representation (phonetic

representation)

• Speech Synthesis component– Symbolic speech representation => speech

20

Page 21: Language and Speech Technology: Introduction

Speech Technology Subfields

• Speech Synthesis (cont.)– Term Speech Synthesis often reserved for this

second component– Meaning => speech– Usually called Speech Generation, or Concept-

To-Speech, or Data-to-Speech

21

Page 22: Language and Speech Technology: Introduction

Speech Technology Subfields

• Speech Recognition– Recognition of human speech– Audio containing speech => text – Often called automatic speech recognition

(ASR)

• Speech Understanding– Understanding of human speech– Audio containing speech => meaning or action

22

Page 23: Language and Speech Technology: Introduction

Speech Technology Subfields

• Speaker Recognition– Recognition of a speaker given a speech signal– Speech => person identity

• Speaker Verification– Verification of the identity of a person– Speech + claimed identity => Boolean

23

Page 24: Language and Speech Technology: Introduction

Speech Technology Subfields

• Speech Compression– Reduction of the size of speech representations

(speech encoding), or– Time-compression of speech representations

(so that they sound faster to the listener)

24

Page 25: Language and Speech Technology: Introduction

Related fields

• Speech often used in dialogues– Study of spoken dialogues (human-human,

human-machine)

• Speech often combined with other modalities– Study of Multimodal Interaction

• Speech part of an man-machine interface– Study of Human - Machine Interaction

25

Page 26: Language and Speech Technology: Introduction

Introduction

• Three decades:– “80s”= 1980-1994– “90s”= 1990-2005– “00s” = 2000-2011

26

Page 27: Language and Speech Technology: Introduction

Overview

• 80s: Language Technology

• 80s: Speech Technology

• 90s Language and Speech Technology

• 90s Commercial Activity

• 90s Importance of Data

• 00s Language and Speech Technology

27

Page 28: Language and Speech Technology: Introduction

80s: Language Technology

• Focus on MT (in Europe)– Eurotra (Europe)– Rosetta (Philips, Netherlands)– Distributed Translation (BSO, Netherlands)

28

Page 29: Language and Speech Technology: Introduction

80s: Language Technology

• Linguistic “Research Approach”• Focus on Research

– not/less on Technology Development• Knowledge-based approach

– hand-crafted lexicons and rules– based on a theory / grammatical formalism

• Focus on linguistically interesting complex phenomena– less on phenomena that occur often– not strongly data-driven

29

Page 30: Language and Speech Technology: Introduction

80s: Language Technology

• Focus on an idealized language– not on actual language use– no focus on robustness

• Computational approach seen (in research) as a way to gain insight into language, grammar and grammar formalisms– no focus on developing a working system– no pragmatic solutions

30

Page 31: Language and Speech Technology: Introduction

80s: Language Technology

• Little formal (quantitative) evaluation– only with test suites

• constructed sentences illustrating linguistic phenomena

• E.g. the HP Test Suite (Flickinger et al. 1987)

• computational linguistics rather than language technology

31

Page 32: Language and Speech Technology: Introduction

80s: Language Technology

Major Problems (from a technology point of view):• Ambiguity

– Real– Temporary

• Computational Complexity– computation-intensive grammar formalisms

• Complexity of language– handcrafting lexicons and rules

• requires linguistic and computational expertise• requires a lot of effort and time

32

Page 33: Language and Speech Technology: Introduction

80s: Language Technology

• Major problems (cont.):

• Idealized Language v. actual Language Use

• Require large and rich lexicons, suited to the application domain: difficult/ large effort to make them, and to tune (adapt) to specific domains

33

Page 34: Language and Speech Technology: Introduction

80s: Speech Technology

• Automatic Speech Recognition (ASR)

• Statistical “Engineering Approach”

• approach based on Noisy Channel Model

• derive acoustic models from a lot of annotated speech examples

• derive statistical language models from large text corpora (n-gram probabilities)

34

Page 35: Language and Speech Technology: Introduction

80s: Speech Technology

• Focus on making (small) working systems

• Statistical approach: system uses probabilities derived from data

• Focus initially on limited, “simple” tasks (e.g. digit recognition), and increasingly on more complex tasks

35

Page 36: Language and Speech Technology: Introduction

80s: Speech Technology

• Focus on real language use under realistic conditions

• Progress made by making concrete systems and evaluating them rigorously

36

Page 37: Language and Speech Technology: Introduction

90s: Language Technology

• Statistical MT– derive language models from monolingual

corpora (probabilities of word ( sequence)s– align “sentences” with their translations– derive translation model from parallel corpora:

• estimate translation probabilities for words and word sequences from the aligned “sentences”

• use these probabilities to compute translations for new “sentences”

37

Page 38: Language and Speech Technology: Introduction

90: Language Technology

• Ambiguity: resolved by probabilities based on statistics• Computational Complexity

– computationally feasible formalisms– proven in speech recognition

• Complexity of language– language and translation model automatically derived from data

• Strong focus on actual language use– Highly data driven

• Lexicons can be simpler and are derived automatically from the data; adaptation to specific domains easy once the data are available

38

Page 39: Language and Speech Technology: Introduction

90s: Language Technology

• Rise of Internet• increasing need for information retrieval• approximated by search for word and word

sequence strings• Information Retrieval

– strongly statistically based– Limited linguistics– formal evaluation (recall, precision, F-score)

39

Page 40: Language and Speech Technology: Introduction

90s: Language Technology

• Resulted in– strongly data-driven approach in language

technology– increasing use of machine learning techniques– explicit focus on formal, esp. quantative

evaluation– re-examination of simpler/computationally less

intensive formalisms (finite-state) for syntax

40

Page 41: Language and Speech Technology: Introduction

90s: Speech Technology

• Continued working under the established paradigm

• increasingly improving performance and extending environments and application areas

41

Page 42: Language and Speech Technology: Introduction

90s: Companies

• many companies active in Speech technology– IBM, Microsoft, Siemens, Nokia, Philips,

Motorola, Matra Nortel, Nortel,..– Dragon, Kurzweil, Lernout & Hauspie,

SpeechWorks, Nuance, Babel, Loquendo, Rhetorical, Vocalis, Telisma, Elan, ...

42

Page 43: Language and Speech Technology: Introduction

90s: Companies

• many companies in Language technology– IBM, Microsoft, INSO, Novell, ...– GMS, Apptek, Globalink, Lernout & Hauspie,

Systran, LANT (Xplanation), ...

43

Page 44: Language and Speech Technology: Introduction

90s: Companies

• MT systems:– knowledge based systems, – developed under an engineering approach

• grammatical formalism simple or pruning in search space– to reduce ambiguity– to reduce computational resource requirements– to reduce hand-crafting of rules

44

Page 45: Language and Speech Technology: Introduction

90s: Companies

• resulted in low quality MT systems– still useful in many circumstances

• Differentiating factors– rapid adaptation to (multi-word) terms /

vocabulary of new domain– good performance on named entity recognition

45

Page 46: Language and Speech Technology: Introduction

90s: Data

• Knowledge Based NLP realized cooperation on lexicons was required

• ASR Methodology requires a lot of data:– “There is no data like more data”

• This led to – Data creation projects– Set-up of data distribution centers– Projects for developing standards for data

46

Page 47: Language and Speech Technology: Introduction

90s: Data

• Projects– Lexicon projects

• Multilex, • Genelex• Acquilex• Parole• WordNet, EuroWordNet

– SpeechDat projects• SpeechDat, SpeechDat-Car, SpeechDat-East, SPEECON,

Orientel– National / Local projects

• Spoken Dutch Corpus (Netherlands and Flanders)

47

Page 48: Language and Speech Technology: Introduction

90s: Data

• Data distribution Centers are set up– LDC (1993)– ELRA (1995)

• Standards:– TEI for text corpora

• CES, XCES

– Eagles, ISLE for grammatical properties

48

Page 49: Language and Speech Technology: Introduction

Automating Data Production

• Usually existing (imperfect) tools are used to create data (semi-)automatically– G2P for creating phonetic dictionaries– PoS-tagging for PoS-tagged text corpora– Parsers for treebanks

• For bootstrapping annotations– Faster and more consistent results

• Followed by (partial) manual correction49

Page 50: Language and Speech Technology: Introduction

00s

• Early 00s– Many data and research initiatives, nationally– Netherlands

• IMIX 2001-2008• STEVIN 2004-2011• TST-Centrale (HLT Agency) 2005-..

– France• EVALDA• Technolangue

50

Page 53: Language and Speech Technology: Introduction

00s

• More recent projects

• FLaReNet

• META-NET

53

Page 54: Language and Speech Technology: Introduction

00s

• Companies offer services via the internet and via mobile (smart) phones– Search: Google, Bing, Yahoo!, etc.– Social networks: FaceBook, LinkedIn, Youtube– Cloud Computing: Amazon, Google, Salesforce

• Companies gain access to huge amounts of data (text, pictures, movies, etc,) including user behavior

54

Page 55: Language and Speech Technology: Introduction

00s

• Data are used– to improve existing services– To create new services– To personalize services and advertisements

55

Page 56: Language and Speech Technology: Introduction

00s

• New Services relevant for LST– Google: Translation, search by voice, open platform

for mobile devices (Android) – Amazon: Mechanical Turk

• Allows large scale distribution of work, e.g. on manual annotation of language resources

– Apple: several iPhone Apps• Dragon Dictate (for SMS, e-mail)• Jibbigo

– ReCaptcha: transcription of (hand-written) documents (now part of Google)

56

Page 57: Language and Speech Technology: Introduction

Current Status

• Language and Speech Technology in 2011:– Exciting area!

• A lot of commercial activity, and expanding

• A large and active research community

• A lot of interesting topics are open for research

57

Page 58: Language and Speech Technology: Introduction

Commercial Activity

• many companies in Language technology– Google, Yahoo!, IBM, Microsoft, ...– Apptek, Linguatec, Systran, Knowledge

Concepts, Q-go, ...

• applications– MT, content management, information

retrieval, dealing with customer questions, sentiment and opinion mining, ...

58

Page 59: Language and Speech Technology: Introduction

Commercial Activity

• many companies in Speech technology– Google, IBM, Microsoft, Motorola, Nokia, ...– Nuance, Loquendo, Acapela, SVOX,

Telisma, ...

• even more in application development and system integration

59

Page 60: Language and Speech Technology: Introduction

Commercial Activity

• applications– Network IVR applications (Call centers,

banking, information services,...)– Embedded applications

• in-car applications, e.g. voice activated dialing, navigation (voice destination entry)

• mobile phone/PDA applications– multimodal output e.g. for navigation– command and control– (SMS) dictation coming soon

60

Page 61: Language and Speech Technology: Introduction

Commercial Activity

• applications– Office Applications

• Dictation, horizontal and vertical (medical, legal)

• Language learning

– Audiomining• information retrieval from recorded speech

(possibly incl. other modalities): Radio/TV-broadcasts, parliamentary sessions, ...

61

Page 62: Language and Speech Technology: Introduction

Research Topics?

• Speech Technology (Recognition)– new paradigms?

• cf . FLAVOR project http://www.esat.kuleuven.be/psi/spraak/projects/FLaVoR/

– Combination with other modalities• AMI http://www.amiproject.org

• CHIL http://chil.server.de/servlet/is/101/

• IMIX (Interactive Multimodal Information eXtraction)

62

Page 63: Language and Speech Technology: Introduction

Research Topics?

• Speech Technology (Recognition)– robustness against noise and other speakers

• increasing use in car and in public places on PDAs and mobile phones

• MIDAS project

– pronunciation of names• Autonomata I and TOO (incl. Nuance, Ghent,

Nijmegen and Utrecht)

63

Page 64: Language and Speech Technology: Introduction

Research Topics?

• Speech technology (Text-to-Speech)– better control over prosody in corpus-based

TTS?– Combination with other modalities

64

Page 65: Language and Speech Technology: Introduction

Research Topics?

• Language Technology– Semantic Lexical databases created– WordNet and EuroWordNet – Cornetto

65

Page 66: Language and Speech Technology: Introduction

Research Topics?

• Language Technology– Focus now on Semantic Annotation of Corpora

• OntoNotes http://www.isi.edu/natural-language/people/hovy/papers/06HLT-NAACL-OntoNotes-short.pdf

• STEVIN D-COI and SONAR

• DutchSemCor

– How to use this semantic annotation in practical systems?

66

Page 67: Language and Speech Technology: Introduction

Research Topics?

• Language Technology– (Semi-)automatic lexicon creation/adaptation – Sophisticated information retrieval

• Information extraction, summarization and merging, opinion and sentiment mining,

67

Page 68: Language and Speech Technology: Introduction

Research Topics?

• Language And Speech Technology– Speech to Speech Translation

• TC-STAR http://www.tc-star.org/

68

Page 69: Language and Speech Technology: Introduction

Research Topics?

• Dutch-Flemish STEVIN programme– running from 2004-2011 – 11.4M€ budget

• resources• research• applications• demonstration projects

– Most projects finished– some projects are still running– http://www.taalunieversum.nl/stevin

69

Page 70: Language and Speech Technology: Introduction

CLARIN

• aims to design, construct, validate, and exploit – a research infrastructure that is needed to provide

a sustainable and persistent eScience working environment

– for researchers in the Social Sciences & Humanities

– who want to make use of language data and tools

70

Page 71: Language and Speech Technology: Introduction

CLARIN

• Make data and tools on different locations easily accessible – via web interfaces and services– CLARIN-portal(s) with intelligent searching,

browsing, viewing and querying services)

• make it possible for non-technical researchers to extract / combine/ enrich data (supported by dissemination and training)

71

Page 72: Language and Speech Technology: Introduction

CLARIN

• Will make available interoperable data and tools based on existing standards and best practices– Formal interoperability and– Semantic interoperability

72

Page 73: Language and Speech Technology: Introduction

CLARIN

• For researchers that work with language data and tools– Humanities and Social Sciences

• Linguistics (broadly construed)

• Literary and Theatrical Studies

• Media en Culture

• History

• Political Sciences

• …73

Page 74: Language and Speech Technology: Introduction

CLARIN

• Preparatory Project (CLARIN-prep)– Funded by EU– 2008-2011– >33 partners from >23 countries– Goals

• Get commitments from EU countries to contribute to the CLARIN infrastructure after CLARIN-prep

• Investigate needs, requirements• Make initial specification (and prototype implementations)

74

Page 75: Language and Speech Technology: Introduction

CLARIN

• Current Status– Most countries in the process– CLARIN infrastructure to start in Mid 2011– Netherlands committed and has leading role

• CLARIN-NL– Funded by NWO– 2009-2015– Many subprojects running– Focus on Humanities

75

Page 76: Language and Speech Technology: Introduction

This week’s Programme

• Tuesday: Parsing• Wednesday: Machine Learning• Thursday: Speech Recognition

– Guest lecturer: Arjan van Hessen

• Friday: Machine Translation

76

Page 77: Language and Speech Technology: Introduction

Thanks for Your Attention!

77

Page 78: Language and Speech Technology: Introduction

References

• Flickinger D., Nerbonne J., Sag I., Wasow T., "Toward Evaluation of NLP Systems", Hewlett-Packard Laboratories, Palo Alto, CA, 1987.

78