acl hlt 2011 - aclweb.org · lems. i gave a similar tutorial at acl 2010. the rst half of these two...

15
Tutorial Abstracts 19 June, 2011 Portland, Oregon, USA ACL HLT 2011 The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies Proceedings of

Upload: others

Post on 27-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Tutorial Abstracts

19 June, 2011Portland, Oregon, USA

ACL HLT 2011

The 49th Annual Meeting of theAssociation for Computational Linguistics:

Human Language Technologies

Proceedings of

c©2011 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ii

ISBN-139781937284077

Production and Manufacturing byOmnipress, Inc.2600 Anderson StreetMadison, WI 53704 USA

Tutorials Chairs:

Andy Way, Dublin City University, IrelandPatrick Pantel, Microsoft Research, USA

iii

Table of Contents

Beyond Structured Prediction: Inverse Reinforcement LearningHal Daume III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Formal and Empirical Grammatical InferenceJeffrey Heinz, Colin de la Higuera and Menno van Zaanen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Automatic SummarizationAni Nenkova, Sameer Maskey and Yang Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Web Search Queries as a CorpusMarius Pasca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Rich Prior Knowledge in Learning for Natural Language ProcessingGregory Druck, Kuzman Ganchev and Joao Graca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Dual Decomposition for Natural Language ProcessingMichael Collins and Alexander M. Rush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

v

Conference Program

Sunday, June 19, 2011

Morning Tutorials

9:00–12:30 Beyond Structured Prediction: Inverse Reinforcement LearningHal Daume III

9:00–12:30 Formal and Empirical Grammatical InferenceJeffrey Heinz, Colin de la Higuera and Menno van Zaanen

9:00–12:30 Automatic SummarizationAni Nenkova, Sameer Maskey and Yang Liu

Afternoon Tutorials

2:00–5:30 Web Search Queries as a CorpusMarius Pasca

2:00–5:30 Rich Prior Knowledge in Learning for Natural Language ProcessingGregory Druck, Kuzman Ganchev and Joao Graca

2:00–5:30 Dual Decomposition for Natural Language ProcessingMichael Collins and Alexander M. Rush

vii

Tutorial Abstracts of ACL/HLT 2011, page 1,Portland, Oregon, USA, June 2011. c©2011 Association for Computational Linguistics

Beyond Structured Prediction: Inverse Reinforcement Learning

Hal Daume IIIComputer Science and UMIACS

University of Maryland, College [email protected]

1 IntroductionMachine learning is all about making predictions;language is full of complex rich structure. Struc-tured prediction marries these two. However, struc-tured prediction isn’t always enough: sometimes theworld throws even more complex data at us, and weneed reinforcement learning techniques. This tuto-rial is all about the how and the why of structuredprediction and inverse reinforcement learning (akainverse optimal control): participants should walkaway comfortable that they could implement manystructured prediction and IRL algorithms, and havea sense of which ones might work for which prob-lems. I gave a similar tutorial at ACL 2010. Thefirst half of these two tutorials is 90% the same; thesecond half is only 50% the same.

2 Content OverviewThe first half of the tutorial will cover the “ba-sics” of structured prediction: the structured per-ceptron and Magerman’s incremental parsing algo-rithm. It will then build up to more advanced algo-rithms that are shockingly reminiscent of these sim-ple approaches: maximum margin techniques andsearch-based structured prediction.

The second half of the tutorial will ask the ques-tion: what happens when our standard assumptionsabout our data are violated? This is what leads usinto the world of reinforcement learning (the basicsof which we’ll cover) and then to inverse reinforce-ment learning and inverse optimal control.

Throughout the tutorial, we will see examplesranging from simple (part of speech tagging, namedentity recognition, etc.) through complex (parsing,machine translation).

The tutorial does not assume attendees know any-thing about structured prediction or reinforcementlearning (though it will hopefully be interesting evento those who know some!), but does assume someknowledge of simple machine learning.

3 Tutorial OutlinePart I: Structured prediction

• What is structured prediction?• Refresher on binary classification

– What does it mean to learn?– Linear models for classification– Batch versus stochastic optimization

• From perceptron to structured perceptron– Linear models for structured prediction– The “argmax” problem– From perceptron to margins

• Search-based structured prediction– Stacking– Training classifiers to make parsing deci-

sions– Searn and generalizations

Part II: Inverse reinforcement learning• Refersher on reinforcement learning

– Markov decision processes– Q learning

• Inverse optimal control and A* search– Maximum margin planning– Learning to search– Dagger

• Apprenticeship learning• Open problems

References

See http://www.cs.utah.edu/˜suresh/mediawiki/index.php/MLRG/spring10.

1

Tutorial Abstracts of ACL/HLT 2011, page 2,Portland, Oregon, USA, June 2011. c©2011 Association for Computational Linguistics

Formal and Empirical Grammatical Inference

Jeffrey HeinzUniversity of Delaware

42 E. Delaware AveNewark DE 19716

[email protected]

Colin de la HigueraUniversite de Nantes

2 rue de la Houssiniere44322 Nantes Cedex 03

[email protected]

Menno van ZaanenTilburg UniversityP.O. Box 90153

NL-5000 LE TilburgThe Netherlands

[email protected]

1 Introduction

Computational linguistics (CL) often faces learningproblems: what algorithm takes as input some fi-nite dataset, such as (annotated) corpora, and out-puts a system that behaves “correctly” on some task?Data chunking, dependency parsing, named entityrecognition, part-of-speech tagging, semantic rolelabelling, and the more general problem of decid-ing which out of all possible (structured) strings aregrammatical are all learning problems in this sense.

The subfield of theoretical computer scienceknown as grammatical inference (GI) studies howgrammars can be obtained from data. Althoughthere has been limited interaction between the fieldsof GI and CL to date, research by the GI commu-nity directly impacts the study of human languageand in particular CL. The purpose of this tutorial isto introduce GI to computational linguists.

2 Content Overview

This tutorial shows how the theories, algorithms,and models studied by the GI community can benefitCL. Particular attention is paid to both foundationalissues (e.g. how learning ought to be defined), whichare relevant to all learning problems even those out-side CL, and issues specific to problems within CL.One theme that runs throughout the tutorial is howproperties of natural languages can be used to reducethe instance space of the learning problem, whichcan lead to provably correct learning solutions. E.g.some natural language patterns are mildly context-sensitive (MCS), but many linguists agree that notall MCS languages are possible natural languages.

So instead of trying to learn all MCS languages (orcontext-free, regular or stochastic variants thereof),one approach is to try to define a smaller problemspace that better approximates natural languages.Interestingly, many of the regions that the GI com-munity has carved out for natural language appear tohave the right properties for feasible learning evenunder the most stringent definitions of learning.

3 Tutorial Outline

1. Formal GI and learning theory. Those learn-ability properties which should be regarded asnecessary (not sufficient) conditions for “good”language learning are discussed. Additionally,the most important proofs that use similaralgorithmic ideas will be described.

2. Empirical approaches to regular and sub-regular natural language classes. This focuseson sub-regular classes that are learnable undermany (including the hardest) settings. Generallearning strategies are introduced as well asprobabilistic variants, which are illustratedwith natural language examples, primarily inthe domain of phonology.

3. Empirical approaches to non-regular natu-ral language classes. This treats learnabilityof context-free/sensitive formalisms where theaim is to approximate the grammar of a specificlanguage from data, not to show learnabilityof classes of languages. An overview of well-known empirical grammatical inference sys-tems will be given in the context of natural lan-guage syntax and semantics.

2

Tutorial Abstracts of ACL/HLT 2011, page 3,Portland, Oregon, USA, June 2011. c©2011 Association for Computational Linguistics

Automatic Summarization

Ani NenkovaUniv. of Pennsylvania

Philadelphia, [email protected]

Sameer MaskeyIBM Research

Yorktown Heights, [email protected]

Yang LiuUniv. of Texas at Dallas

Richardson, [email protected]

1 IntroductionIn the past decade, we have seen that the amount of dig-ital data, such as news, scientific articles, blogs, conver-sations, increases at an exponential pace. The need toaddress ‘information overload’ by developing automaticsummarization systems has never been more pressing. Atthe same time, approaches and algorithms for summa-rization have matured and increased in complexity, andinterest in summarization research has intensified, withnumerous publications on the topic each year. A new-comer to the field may find navigating the existing liter-ature to be a daunting task. In this tutorial, we aim togive a systematic overview of traditional and more recentapproaches for text and speech summarization.

2 Content OverviewA core problem in summarization research is devisingmethods to estimate the importance of a unit, be it a word,clause, sentence or utterance, in the input. A few classicalmethods will be introduced, but the overall emphasis willbe on most recent advances. We will cover log-likelihoodtest for topic word discovery and graph-based models forsentence importance, and will discuss semantically richapproaches based on latent semantic analysis, lexical re-sources. We will then turn to the most recent Bayesianmodels of summarization. For supervised machine learn-ing approaches, we will discuss the suite of traditionalfeatures used in summarization, as well as issued withdata annotation and acquisition.

Ultimately, the summary will be a collection of impor-tant units. The summary can be selected in a greedy man-ner, choosing the most informative sentence, one by one;or the units can be selected jointly, and optimized for in-formativeness. We discuss both approaches, with empha-sis on recent optimization work.

In the part on evaluation we will discuss the standardmanual and automatic metrics for evaluation, as well asvery recent work on fully automatic evaluation.

We then turn to domain specific summarization, par-ticularly summarization of scientific articles and speechdata (telephone conversations, broadcast news, meetingsand lectures). In speech, the acoustic signal brings moreinformation that can be exploited as features in summa-rization, but also poses unique problems which we dis-cuss related to disfluencies, lack of sentence or clauseboundaries, and recognition errors.

We will only briefly touch on key but under-researchedissues of linguistic quality of summaries, deeper seman-tic analysis for summarization, and abstractive summa-rization.

3 Tutorial Outline1: Computing informativeness

- Frequency-driven: topic words, clustering, graphapproaches

- Semantic approaches: lexical chains, latent seman-tic analysis

- Probabilistic (Bayesian) models- Supervised approaches

2: Optimizing informativeness and minimizing redundancy- Maximal marginal relevance- Integer linear programming- Redundancy removal

3: Evaluation- Manual evaluation: Responsivness and Pyramid- Automatic: Rouge- Fully automatic

4: Domain specific summarization- Scientific articles- Biographical- Speech summarization

* Utterance segmentation* Acoustic features* Dealing with recognition errors* Disfluency removal and compression

3

Tutorial Abstracts of ACL/HLT 2011, page 4,Portland, Oregon, USA, June 2011. c©2011 Association for Computational Linguistics

Web Search Queries as a Corpus

Marius PascaGoogle Inc.

Mountain View, California [email protected]

1 Introduction

As noisy and unreliable as they may be, Web searchqueries indirectly convey knowledge just as theyrequest knowledge. Indeed, queries specify con-straints, even if as brittle as the mere presence of anadditional keyword or phrase, that loosely describewhat knowledge is being requested. In the processof asking “how many calories are burned during ski-ing”, one implies that skiing may be an activity dur-ing which the body consumes calories, whereas themore condensed “amg latest album” still suggeststhat amg may be a musician or band, even to some-one unfamiliar with the respective topics. As such,search queries are cursory reflections of knowledgeencoded deeply within unstructured and structuredcontent available in documents on the Web and else-where.

The notion that inherently-noisy Web searchqueries may collectively serve as a text corpus isan intriguing alternative to using document cor-pora. This tutorial gives an overview of the char-acteristics of, and types of knowledge available in,queries as a corpus. It reviews extraction methodsdeveloped recently for extracting such knowledge.Considering the building blocks that would con-tribute towards the automatic construction of knowl-edge bases, queries lend themselves as a usefuldata source in the acquisition of classes of instances(e.g.,palo alto, santa barbara, twentynine palms),where the classes are unlabeled or labeled (e.g.,cal-ifornia cities), possibly organized as hierarchies ofsearch intents; as well as relations, including classattributes (e.g.,population density, mayor).

2 Content Overview

The tutorial covers characteristics of search queries,when considered as an input data source in open-domain information extraction, and their impact onextraction methods operating over queries as op-posed to documents; types of knowledge for whichqueries lend themselves as a useful data source ininformation extraction; detailed methods for extract-ing classes, instances and relations from queries; andimplications in semantic annotation of queries, un-derstanding query intent, and information access andretrieval in general.

3 Tutorial Outline. Introduction. - Overview of knowledge acquisition from text. - Goals of open-domain information extraction. - Extraction from documents vs. queries. Queries as a corpus. - Intrinsic aspects: distribution, lexical structure. - Extrinsic aspects: temporality, demographics. - Beyond individual queries: sessions, clicks. Methods for knowledge acquisition from queries. - Extraction of instances and classes. - Extraction of attributes and relations. Discussion. - Implications and limitations. - Applications

4

Tutorial Abstracts of ACL/HLT 2011, page 5,Portland, Oregon, USA, June 2011. c©2011 Association for Computational Linguistics

Rich Prior Knowledge in Learning for Natural Language Processing

Gregory DruckU. of Massachusetts Amherst

[email protected]

Kuzman GanchevGoogle Inc.

[email protected]

Joao GracaUniversity of Pennsylvania

[email protected]

1 Introduction

We possess a wealth of prior knowledge about mostprediction problems, and particularly so for many ofthe fundamental tasks in natural language process-ing. Unfortunately, it is often difficult to make useof this type of information during learning, as it typ-ically does not come in the form of labeled exam-ples, may be difficult to encode as a prior on pa-rameters in a Bayesian setting, and may be impos-sible to incorporate into a tractable model. Instead,we usually have prior knowledge about the values ofoutput variables. For example, linguistic knowledgeor an out-of-domain parser may provide the loca-tions of likely syntactic dependencies for grammarinduction. Motivated by the prospect of being ableto naturally leverage such knowledge, four differ-ent groups have recently developed similar, generalframeworks for expressing and learning with side in-formation about output variables.

2 Content Overview

This tutorial describes how to encode side informa-tion about output variables, and how to leverage thisencoding and an unannotated corpus during learn-ing. We survey the different frameworks, explain-ing how they are connected and the trade-offs be-tween them. We also survey several applications thathave been explored in the literature, including ap-plications to grammar and part-of-speech induction,word alignment, information extraction, text classi-fication, and multi-view learning. Prior knowledgeused in these applications ranges from structural in-formation that cannot be efficiently encoded in the

model, to knowledge about the approximate expec-tations of some features, to knowledge of some in-complete and noisy labellings. These applicationsalso address several different problem settings, in-cluding unsupervised, lightly supervised, and semi-supervised learning, and utilize both generative anddiscriminative models.

Additionally, we discuss issues that come up inimplementation, and describe a toolkit that pro-vides out-of-the-box support for the applications de-scribed in the tutorial, and is extensible to other ap-plications and new types of prior knowledge.

3 Tutorial Outline

1. Introduction: prior knowledge in NLP, previ-ous approaches for leveraging, motivation forconstraining output variables, demonstration

2. Learning with Prior Knowledge: Constraint-Driven Learning (UIUC), Posterior Regulariza-tion (UPenn), Generalized Expectation Crite-ria (UMass Amherst), Learning from Measure-ments (UC Berkley)

3. Applications: document classification (labeledfeatures), information extraction (long-rangedependencies), word alignment (symmetry),POS tagging (posterior sparsity), dependencyparsing (linguistic knowledge, noisy labels)

4. Implementation: implementation guidance,tutorial on existing software packages

4 References

see: http://sideinfo.wikkii.com

5

Tutorial Abstracts of ACL/HLT 2011, page 6,Portland, Oregon, USA, June 2011. c©2011 Association for Computational Linguistics

Dual Decomposition for Natural Language Processing

Michael CollinsDepartment of Computer Science

Columbia UniversityNew York, NY, USA

[email protected]

Alexander M. RushCSAIL

Massachusetts Institute of TechnologyCambridge, MA, [email protected]

1 Introduction

For many tasks in natural language processing,finding the best solution requires a search over alarge set of possible choices. Solving this decod-ing problem exactly can be complex and ineffi-cient, and so researchers often use approximatetechniques at the cost of model accuracy. In thistutorial, we present dual decomposition as analternative method for decoding in natural lan-guage problems. Dual decomposition producesexact algorithms that rely only on basic combi-natorial algorithms, like shortest path or min-imum spanning tree, as building blocks. Sincethese subcomponents are straightforward to im-plement and are well-understood, the resultingdecoding algorithms are often simpler and signif-icantly faster than exhaustive search, while stillproducing certified optimal solutions in practice.

Dual decomposition is a variant of Lagrangianrelaxation, a general method for combinatorialoptimization, and is also closely related to be-lief propagation for both graphical models andcombinatorial structures. Lagrangian relaxationhas been applied to a wide variety of problemsin other domains as diverse as network routingand auction pricing. Recently, this techniquehas been used in natural language processingto solve decoding problems in combined parsingand tagging, non-projective dependency pars-ing, MT system combination, CCG supertag-ging, and syntactic translation.

2 Content Overview

This tutorial presents dual decomposition as ageneral inference technique, while utilizing ex-

amples and applications from natural languageprocessing. The goal is for attendees to learn toderive algorithms for problems in their domainand understand the formal guarantees and prac-tical considerations in using these algorithms.

We begin the tutorial with a step-by-step con-struction of an algorithm for combined CFGparsing and trigram tagging. To analyze thisalgorithm, we derive formal properties for thetechniques used in the construction. We thengive further examples for how to derive simi-lar algorithms for other difficult tasks in NLPand discuss some of the practical considerationsfor using these algorithms on real-world prob-lems. In the last section of the tutorial, we ex-plore the relationship between these algorithmsand the theory of linear programming, focusingparticularly on the connection between linearprogramming and dynamic programming algo-rithms. We conclude by using this connection tolinear programming to develop practical tighten-ing techniques that help obtain exact solutions.

3 Outline

1. Step-By-Step Example Derivation

2. Formal Properties

3. Further Examples

(a) Dependency Parsing

(b) Translation Decoding

(c) Document-level Tagging

4. Practical Considerations

5. Linear Programming Interpretation

6. Connections between DP and LP

7. Tightening Methods for Exact Solutions

6

Author Index

Collins, Michael, 6

Daume III, Hal, 1de la Higuera, Colin, 2Druck, Gregory, 5

Ganchev, Kuzman, 5Graca, Joao, 5

Heinz, Jeffrey, 2

Liu, Yang, 3

Maskey, Sameer, 3

Nenkova, Ani, 3

Pasca, Marius, 4

Rush, Alexander M., 6

van Zaanen, Menno, 2

7