machine learning for language...

69
Lecture 7: Learning from Massive Datasets October 2013 Machine Learning for Language Technology Marina Santini, Uppsala University Department of Linguistics and Philology

Upload: others

Post on 25-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Lecture 7:

Learning from Massive Datasets

October 2013

Machine Learning for Language Technology

Marina Santini, Uppsala University

Department of Linguistics and Philology

Page 2: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Outline

Lect. 7: Learning from Massive Datasets

Watch the pitfalls

Learning from massive datasets

Data Mining

Text Mining – Text Analytics

Web Mining

Big Data

Programming Languages and Framework for Big Data

Big Textual Data & Commercial Applications

Events, MeetUps, Coursera

2

Page 3: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Practical Machine Learning

Lect. 7: Learning from Massive Datasets 3

Page 4: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Data Mining

Lect. 7: Learning from Massive Datasets

Data mining is the extraction of implicit, previously

unknown and potentially useful information from data

(Witten and Frank, 2005)

4

Page 5: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Watch out!

Lect. 7: Learning from Massive Datasets

Machine Learning is not just about:

1. Finding data and blindly applying learning algorithms to it

2. Blindly compare machine learning methods:

1. Model complexity

2. Representativeness of training data distribution

3. Reliability of class labels

Remember: Practitioners’ expertise counts!

5

Page 6: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Massive Datasets

Lect. 7: Learning from Massive Datasets

Space and Time

Three ways to make learning feasible (the old way)

Small subset

Parallelization

Data chunks

The new way:

Develop new algorithms with lower computational complexity

Increase background knowledge

6

Page 7: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Domain Knowledge

Lect. 7: Learning from Massive Datasets

Metadata

Semantic relation

Causal relation

Functional dependencies

7

Page 8: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Text Mining

Lect. 7: Learning from Massive Datasets

Actionable information

Comprehensible information

Problems

Text Analytics

8

Page 9: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Definition: Text Analytics

A set of NLP techniques that provide some structure to

textual documents and help identify and extract

important information.

Lect. 7: Learning from Massive Datasets 9

Page 10: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Set of NLP (Natural Language Processing )

techniques

Common components of a text analytic package are:

Tokenization

Morphological Analysis

Syntactic Analysis

Named Entity Recognition

Sentiment Analysis

Automatic Summarization

Etc.

Lect. 7: Learning from Massive Datasets 10

Page 11: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

NLP at Coursera (www.coursera.org)

Lect. 7: Learning from Massive Datasets 11

Page 12: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

NLP is pervasive

Ex: spell-checkers

Google Search

Google Mail

Facebook

Office Word

[…]

Lect. 7: Learning from Massive Datasets 12

Page 14: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Sentiment Analysis

Lect. 7: Learning from Massive Datasets 14

Page 15: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Text Analytics Products and Frameworks

Commercial Products:

Attensity

Clarabridge

Temis

Lexalytics

Texify

SAS

SPSS

IBM Cognos

etc.

Lect. 7: Learning from Massive Datasets

Open Source Frameworks: • GATE

• NLTK

• UIMA

• openNLP

• etc.

15

Page 16: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

However… (I)

NLP tools and applications (both commercial and open

source) are not perfect. Research is still very active in all

NLP fields.

Lect. 7: Learning from Massive Datasets 16

Page 17: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Ex: Syntactic Parser

Connexor

What about parsing a tweet?

“My son, Ky/o, asked me for the first time today how my DAY was . . . I about melted. Told him that I had pizza for lunch. Response? No fair “ (Twitter Tutorial 1: How to Tweet Well)

17

Page 18: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Why NLP and Text Analytics for Text Mining?

Why is it important to know that a word is a noun, or a

verb or the name of brand?

Broadly speaking (Think about these as features for a

classification problem!)

Nouns and verbs (a.k.a. content words): Nouns are important

for topic detection; verbs are important if you want to identify

actions or intentions.

Adjectives = sentiment identification.

Function words (a.k.a. stop words) are important for

authorship attribution, plagiarism detection, etc.

etc.

Lect. 7: Learning from Massive Datasets 18

Page 19: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

However… (II)

At present, the main pitfall of many NLP applications is that they are not flexible enough to:

Completly disambiguate language

Identify how language is used in different types of documents (a.k.a. genres).

For instance, in tweets langauge is used in a different way than an emails, language used in email is different from the language used in academic papers, etc. )

Often tweaking NLP tools to different types of text or solve language ambiguity in an ad-hoc manner is time-consuming, difficult and unrewarding…

Lect. 7: Learning from Massive Datasets 19

Page 20: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

What for?

Lect. 7: Learning from Massive Datasets

Text summarization

Document clustering

Authorship attribution

Automatic medadata extraction

Entity extraction

Information extraction

Information discovery

ACTIONABLE INTELLIGENCE

20

Page 21: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Actionable Textual Intelligence

Business Intelligence (BI) + Customer Analytics + Social Network Analytics + Crisis Intelligence […] = Actionable Intelligence

Actionable Intelligence is information that:

1. must be accurate and verifiable

2. must be timely

3. must be comprehensive

4. must be comprehensible

5. !!! give the power to make decisions and to act straightaway !!!

6. !!! must handle BIG BIG BIG UNSTRUCTURED TEXTUAL DATA !!!

Lect. 7: Learning from Massive Datasets 21

Page 22: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Big Data BIG DATA [Wikipedia]:

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. With this difficulty, new platforms of "big data" tools are being developed to handle various aspects of large quantities of data.

Examples include Big Science, web logs, RFID, sensor networks, social networks, social data (due to the social data revolution), Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and often interdisciplinary scientific research, military surveillance, medical records, photography archives, video archives, and large-scale e-commerce.

Lect. 7: Learning from Massive Datasets 22

Page 23: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Big Unstructured

TEXTUAL Data

“Merrill Lynch estimates that more than 85 percent of all business

information exists as unstructured data –commonly appearing in

e‐mails, memos, notes from call centers and support

operations, news, user groups, chats, reports, letters,

surveys, white papers, marketing material, research,

presentations and web pages.” [DM Review Magazine,

February 2003 Issue]

ECONOMIC LOSS!

Lect. 7: Learning from Massive Datasets

Merrill Lynch is one of the world's leading financial management and advisory companies, providing financial advice.

23

Page 24: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Simple search is not enough…

Of course, it is possible to use simple search. But simple

search is unrewarding, because is based on single terms.

”a search is made on the term felony. In a simple search, the

term felony is used, and everywhere there is a reference to

felony, a hit to an unstructured document is made. But a simple

search is crude. It does not find references to crime, arson,

murder, embezzlement, vehicular homicide, and such, even though

these crimes are types of felonies” [ Source: Inmon, B. & A.

Nesavich, "Unstructured Textual Data in the Organization"

from "Managing Unstructured data in the organization",

Prentice Hall 2008, pp. 1–13]

Lect. 7: Learning from Massive Datasets 24

Page 25: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Programming languages and

frameworks for big data

Lect. 7: Learning from Massive Datasets 25

Page 26: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

R R is a statistical programming language. It is a free software

programming language and a software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls and surveys of data miners are showing R's popularity has increased substantially in recent years (wikipedia)

http://www.r-project.org/

26

Page 27: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Lect. 7: Learning from Massive Datasets

27

Page 28: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

MeetUps: R in Stockholm

Lect. 7: Learning from Massive Datasets

28

Page 29: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Can R help out?

Lect. 7: Learning from Massive Datasets

Can R help overcome NLP shortcomings and open a new

direction in order to extract useful information from Big

TEXTUAL Data?

29

Page 30: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Existing literature for linguists

Stefan Th. Gries (2013) Statistics for linguistics With R: A Practical Introduction. De Gruyter Mouton. New Edition.

Stefan Th. Gries (2009) Quantitative corpus linguistics with R: a practical introduction. Routledge, Taylor & Francis Group (companion website).

Harald R. Baayen (2008) Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge.

….

Lect. 7: Learning from Massive Datasets 30

Page 31: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Companion website by Stefan Th. Gries BNC=British National Corpus (PoS tagged)

Lect. 7: Learning from Massive Datasets 31

Page 32: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

BNC

The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the BNC XML Edition, released in 2007.

The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification, contextual and bibliographic information is also included with each text in the form of a TEI-conformant header.

Lect. 7: Learning from Massive Datasets 32

Page 33: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

R & the BNC: Excerpt from Google Books

Lect. 7: Learning from Massive Datasets 33

Page 34: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

What about Big Textual Data?

Non standardized language

Non standard texts

Electronic documents of all kinds, eg. formal, informal,

short, long, private, public, etc.

Lect. 7: Learning from Massive Datasets 34

Page 35: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Not distributed system

Lect. 7: Learning from Massive Datasets

Open Source

R

Scala (also distributed systems)

Rapid Miner

Weka

Commercial

SPSS

SAS

MatLab

35

The name Scala is a portmanteau of "scalable" and "language", signifying that it is designed to grow with the demands of its users. James Strachan, the creator of Groovy, described Scala as a possible successor to Java

Page 36: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

From The Economist:

The Big Data scenario

Lect. 7: Learning from Massive Datasets 36

Page 37: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Commercial applications for Big Textual Data

Lect. 7: Learning from Massive Datasets

Recorded Future web intelligence (anticipating

emerging threats, future trends, anticipating competitors’

actions, etc.)

Gavagai large-scale textual analysis (prediction and

future trends)

37

Page 38: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Thanks to Staffan Truffe’ for the ff slides

Lect. 7: Learning from Massive Datasets

38

Page 39: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Size

Lect. 7: Learning from Massive Datasets

39

Page 40: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

In a few pictures…

Lect. 7: Learning from Massive Datasets 40

Page 41: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Metrics, structure and time

Lect. 7: Learning from Massive Datasets 41

Page 42: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Metric

Lect. 7: Learning from Massive Datasets 42

Page 43: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Structure

Lect. 7: Learning from Massive Datasets 43

Page 44: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Time

Lect. 7: Learning from Massive Datasets 44

Page 45: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Facts

Lect. 7: Learning from Massive Datasets 45

Page 46: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Pipeline

Lect. 7: Learning from Massive Datasets 46

Page 47: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Multi-Language

Lect. 7: Learning from Massive Datasets 47

Page 48: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Text Analytics

Lect. 7: Learning from Massive Datasets

48

Page 49: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Predictions

Lect. 7: Learning from Massive Datasets

49

Page 50: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Gavagai

Lect. 7: Learning from Massive Datasets

Jussi Karlgren (PhD in Stylistics in Information Retrieval)

Magnus Sahlgren (PhD thesis in distributional semantics)

Fredrick Olsson (PhD thesis in Active Learning) (co-workers at SICS)

The indeterminacy of translation is a thesis propounded by 20th-century American analytic philosopher W. V. Quine.

Quine uses the example of the word "gavagai" uttered by a native speaker of the unknown language Arunta upon seeing a rabbit. A speaker of English could do what seems natural and translate this as "Lo, a rabbit." But other translations would be compatible with all the evidence he has: "Lo, food"; "Let's go hunting"; "There will be a storm tonight" (these natives may be superstitious)… (wikipedia)

50

Page 51: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Ethersource presented Thanks to F. Olsson for the ff slides

Lect. 7: Learning from Massive Datasets 51

Page 52: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Associations

Lect. 7: Learning from Massive Datasets 52

Page 53: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Language is flux

Lect. 7: Learning from Massive Datasets 53

Page 54: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Learning from use

Lect. 7: Learning from Massive Datasets 54

Page 55: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Scope

Lect. 7: Learning from Massive Datasets 55

Page 56: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Architecture

Lect. 7: Learning from Massive Datasets 56

Page 57: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Web vs printed world

Lect. 7: Learning from Massive Datasets 57

Page 58: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Noise…

58

Page 59: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Multi-linguality

Lect. 7: Learning from Massive Datasets 59

Page 60: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

SICS

Lect. 7: Learning from Massive Datasets Watch the videos! 60

Page 61: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Big Data MeetUp, Stockholm

Lect. 7: Learning from Massive Datasets

61

Page 62: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

BIG DATA

communities

Lect. 7: Learning from Massive Datasets 62

Page 63: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Future Directions in Machine Learning for

Language Technology

Lect. 7: Learning from Massive Datasets

Deluge of data

Little linguistic analysis in the realm of big-data real-world

platforms and applications

Top-down systems cannot efficiently deal with irregularity

and unpredictability of big textual data

Data-driven systems can make it. However,

…we know that computers are not at ease with natural

languages used by humans, unless they learn how to learn

linguistic structure underlying natual language from data…

63

Page 64: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

For a data-driven approach…

Lect. 7: Learning from Massive Datasets

Annotated datasets that are needed for complete supervised machine learning are costly, time-comsuming and require specialist expertise.

Is complete supervision even thinkable when we talk about tera-, peta- or yottabytes? How big should then be the training set?

Alternative solutions:

Semi-supervised methods (combination of labelled and unlabelled data)

Weakly supervised methods (human-constructed rules are typically used to guide the unsupervised learner)

Unsupervised learning results cannot still compete with suprevised learning in many tasks…

64

Page 65: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

A new way to explore: Incomplete Supervision

Lect. 7: Learning from Massive Datasets

Relies on partially labelled data:

” Human experts — or possibly a crowd of laymen — annotate

text with some linguistic structure related to the structure

that one wants to predict. This data is then used for partially

supervised learning with a statistical model that exploits the

annotated structure to infer the linguistic structure of interest.” p. 4

65

Page 66: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Example

Lect. 7: Learning from Massive Datasets

”…it is possible to construct accurate and robust part-of-speech taggers

for a wide range of languages, by combining (1) manually annotated

resources in English, or some other language for which such resources are

already available, with (2) a crowd-sourced target-language specific

lexicon, which lists the potential parts of speech that each word may take

in some context, at least for a subset of the words.

Both (1) and (2) only provide partial information for the part-of-speech

tagging task. However, taken together they turn out to provide substantially

more information than either taken alone. “ p. 4-6

Oscar Täckström “Predicting Linguistic Structure with Incomplete and

Cross-Lingual Supervision” PhD Thesis, Uppsala University, 2013

(http://soda.swedish-ict.se/5513/)

66

Page 67: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Conclusions

Lect. 7: Learning from Massive Datasets

This course is an introduction to Machine leaning for Language Technology”.

You get a flavour of the problems we come across when devising models for enabling machines to analyse and make sense of natural human language.

The next big big big step is to bring as much linguistic awareness as possible into big data.

67

Page 68: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Reading

Lect. 7: Learning from Massive Datasets

Witten and Frank (2005) Ch. 8

68

Page 69: Machine Learning for Language Technologysantini.se/teaching/ml/Lecture07_ML4LT_MarinaSantini_2013.pdfLittle linguistic analysis in the realm of big-data real-world platforms and applications

Thanx for your attention!

Lect. 7: Learning from Massive Datasets

69