machine learning for language technology -...
TRANSCRIPT
![Page 1: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/1.jpg)
Lecture 7:
Learning from Massive Datasets
October 2013
Machine Learning for Language Technology
Marina Santini, Uppsala University
Department of Linguistics and Philology
![Page 2: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/2.jpg)
Outline
Lect. 7: Learning from Massive Datasets
Watch the pitfalls
Learning from massive datasets
Data Mining
Text Mining – Text Analytics
Web Mining
Big Data
Programming Languages and Framework for Big Data
Big Textual Data & Commercial Applications
Events, MeetUps, Coursera
2
![Page 3: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/3.jpg)
Practical Machine Learning
Lect. 7: Learning from Massive Datasets 3
![Page 4: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/4.jpg)
Data Mining
Lect. 7: Learning from Massive Datasets
Data mining is the extraction of implicit, previously
unknown and potentially useful information from data
(Witten and Frank, 2005)
4
![Page 5: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/5.jpg)
Watch out!
Lect. 7: Learning from Massive Datasets
Machine Learning is not just about:
1. Finding data and blindly applying learning algorithms to it
2. Blindly compare machine learning methods:
1. Model complexity
2. Representativeness of training data distribution
3. Reliability of class labels
Remember: Practitioners’ expertise counts!
5
![Page 6: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/6.jpg)
Massive Datasets
Lect. 7: Learning from Massive Datasets
Space and Time
Three ways to make learning feasible (the old way)
Small subset
Parallelization
Data chunks
The new way:
Develop new algorithms with lower computational complexity
Increase background knowledge
6
![Page 7: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/7.jpg)
Domain Knowledge
Lect. 7: Learning from Massive Datasets
Metadata
Semantic relation
Causal relation
Functional dependencies
7
![Page 8: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/8.jpg)
Text Mining
Lect. 7: Learning from Massive Datasets
Actionable information
Comprehensible information
Problems
Text Analytics
8
![Page 9: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/9.jpg)
Definition: Text Analytics
A set of NLP techniques that provide some structure to
textual documents and help identify and extract
important information.
Lect. 7: Learning from Massive Datasets 9
![Page 10: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/10.jpg)
Set of NLP (Natural Language Processing )
techniques
Common components of a text analytic package are:
Tokenization
Morphological Analysis
Syntactic Analysis
Named Entity Recognition
Sentiment Analysis
Automatic Summarization
Etc.
Lect. 7: Learning from Massive Datasets 10
![Page 11: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/11.jpg)
NLP at Coursera (www.coursera.org)
Lect. 7: Learning from Massive Datasets 11
![Page 12: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/12.jpg)
NLP is pervasive
Ex: spell-checkers
Google Search
Google Mail
Office Word
[…]
Lect. 7: Learning from Massive Datasets 12
![Page 13: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/13.jpg)
NLP is parvasive
Ex: Name Entity Recognition
Opinion mining
Brand Trends
Conversation clouds
on web magazines
and online
newspapers…
Lect. 7: Learning from Massive Datasets 13
![Page 14: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/14.jpg)
Sentiment Analysis
Lect. 7: Learning from Massive Datasets 14
![Page 15: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/15.jpg)
Text Analytics Products and Frameworks
Commercial Products:
Attensity
Clarabridge
Temis
Lexalytics
Texify
SAS
SPSS
IBM Cognos
etc.
Lect. 7: Learning from Massive Datasets
Open Source Frameworks: • GATE
• NLTK
• UIMA
• openNLP
• etc.
15
![Page 16: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/16.jpg)
However… (I)
NLP tools and applications (both commercial and open
source) are not perfect. Research is still very active in all
NLP fields.
Lect. 7: Learning from Massive Datasets 16
![Page 17: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/17.jpg)
Ex: Syntactic Parser
Connexor
What about parsing a tweet?
“My son, Ky/o, asked me for the first time today how my DAY was . . . I about melted. Told him that I had pizza for lunch. Response? No fair “ (Twitter Tutorial 1: How to Tweet Well)
17
![Page 18: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/18.jpg)
Why NLP and Text Analytics for Text Mining?
Why is it important to know that a word is a noun, or a
verb or the name of brand?
Broadly speaking (Think about these as features for a
classification problem!)
Nouns and verbs (a.k.a. content words): Nouns are important
for topic detection; verbs are important if you want to identify
actions or intentions.
Adjectives = sentiment identification.
Function words (a.k.a. stop words) are important for
authorship attribution, plagiarism detection, etc.
etc.
Lect. 7: Learning from Massive Datasets 18
![Page 19: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/19.jpg)
However… (II)
At present, the main pitfall of many NLP applications is that they are not flexible enough to:
Completly disambiguate language
Identify how language is used in different types of documents (a.k.a. genres).
For instance, in tweets langauge is used in a different way than an emails, language used in email is different from the language used in academic papers, etc. )
Often tweaking NLP tools to different types of text or solve language ambiguity in an ad-hoc manner is time-consuming, difficult and unrewarding…
Lect. 7: Learning from Massive Datasets 19
![Page 20: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/20.jpg)
What for?
Lect. 7: Learning from Massive Datasets
Text summarization
Document clustering
Authorship attribution
Automatic medadata extraction
Entity extraction
Information extraction
Information discovery
ACTIONABLE INTELLIGENCE
20
![Page 21: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/21.jpg)
Actionable Textual Intelligence
Business Intelligence (BI) + Customer Analytics + Social Network Analytics + Crisis Intelligence […] = Actionable Intelligence
Actionable Intelligence is information that:
1. must be accurate and verifiable
2. must be timely
3. must be comprehensive
4. must be comprehensible
5. !!! give the power to make decisions and to act straightaway !!!
6. !!! must handle BIG BIG BIG UNSTRUCTURED TEXTUAL DATA !!!
Lect. 7: Learning from Massive Datasets 21
![Page 22: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/22.jpg)
Big Data BIG DATA [Wikipedia]:
Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. With this difficulty, new platforms of "big data" tools are being developed to handle various aspects of large quantities of data.
Examples include Big Science, web logs, RFID, sensor networks, social networks, social data (due to the social data revolution), Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and often interdisciplinary scientific research, military surveillance, medical records, photography archives, video archives, and large-scale e-commerce.
Lect. 7: Learning from Massive Datasets 22
![Page 23: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/23.jpg)
Big Unstructured
TEXTUAL Data
“Merrill Lynch estimates that more than 85 percent of all business
information exists as unstructured data –commonly appearing in
e‐mails, memos, notes from call centers and support
operations, news, user groups, chats, reports, letters,
surveys, white papers, marketing material, research,
presentations and web pages.” [DM Review Magazine,
February 2003 Issue]
ECONOMIC LOSS!
Lect. 7: Learning from Massive Datasets
Merrill Lynch is one of the world's leading financial management and advisory companies, providing financial advice.
23
![Page 24: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/24.jpg)
Simple search is not enough…
Of course, it is possible to use simple search. But simple
search is unrewarding, because is based on single terms.
”a search is made on the term felony. In a simple search, the
term felony is used, and everywhere there is a reference to
felony, a hit to an unstructured document is made. But a simple
search is crude. It does not find references to crime, arson,
murder, embezzlement, vehicular homicide, and such, even though
these crimes are types of felonies” [ Source: Inmon, B. & A.
Nesavich, "Unstructured Textual Data in the Organization"
from "Managing Unstructured data in the organization",
Prentice Hall 2008, pp. 1–13]
Lect. 7: Learning from Massive Datasets 24
![Page 25: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/25.jpg)
Programming languages and
frameworks for big data
Lect. 7: Learning from Massive Datasets 25
![Page 26: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/26.jpg)
R R is a statistical programming language. It is a free software
programming language and a software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls and surveys of data miners are showing R's popularity has increased substantially in recent years (wikipedia)
http://www.r-project.org/
26
![Page 27: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/27.jpg)
Lect. 7: Learning from Massive Datasets
27
![Page 28: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/28.jpg)
MeetUps: R in Stockholm
Lect. 7: Learning from Massive Datasets
28
![Page 29: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/29.jpg)
Can R help out?
Lect. 7: Learning from Massive Datasets
Can R help overcome NLP shortcomings and open a new
direction in order to extract useful information from Big
TEXTUAL Data?
29
![Page 30: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/30.jpg)
Existing literature for linguists
Stefan Th. Gries (2013) Statistics for linguistics With R: A Practical Introduction. De Gruyter Mouton. New Edition.
Stefan Th. Gries (2009) Quantitative corpus linguistics with R: a practical introduction. Routledge, Taylor & Francis Group (companion website).
Harald R. Baayen (2008) Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge.
….
Lect. 7: Learning from Massive Datasets 30
![Page 31: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/31.jpg)
Companion website by Stefan Th. Gries BNC=British National Corpus (PoS tagged)
Lect. 7: Learning from Massive Datasets 31
![Page 32: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/32.jpg)
BNC
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the BNC XML Edition, released in 2007.
The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification, contextual and bibliographic information is also included with each text in the form of a TEI-conformant header.
Lect. 7: Learning from Massive Datasets 32
![Page 33: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/33.jpg)
R & the BNC: Excerpt from Google Books
Lect. 7: Learning from Massive Datasets 33
![Page 34: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/34.jpg)
What about Big Textual Data?
Non standardized language
Non standard texts
Electronic documents of all kinds, eg. formal, informal,
short, long, private, public, etc.
Lect. 7: Learning from Massive Datasets 34
![Page 35: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/35.jpg)
Not distributed system
Lect. 7: Learning from Massive Datasets
Open Source
R
Scala (also distributed systems)
Rapid Miner
Weka
…
Commercial
SPSS
SAS
MatLab
…
35
The name Scala is a portmanteau of "scalable" and "language", signifying that it is designed to grow with the demands of its users. James Strachan, the creator of Groovy, described Scala as a possible successor to Java
![Page 36: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/36.jpg)
From The Economist:
The Big Data scenario
Lect. 7: Learning from Massive Datasets 36
![Page 37: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/37.jpg)
Commercial applications for Big Textual Data
Lect. 7: Learning from Massive Datasets
Recorded Future web intelligence (anticipating
emerging threats, future trends, anticipating competitors’
actions, etc.)
Gavagai large-scale textual analysis (prediction and
future trends)
37
![Page 38: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/38.jpg)
Thanks to Staffan Truffe’ for the ff slides
Lect. 7: Learning from Massive Datasets
38
![Page 39: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/39.jpg)
Size
Lect. 7: Learning from Massive Datasets
39
![Page 40: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/40.jpg)
In a few pictures…
Lect. 7: Learning from Massive Datasets 40
![Page 41: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/41.jpg)
Metrics, structure and time
Lect. 7: Learning from Massive Datasets 41
![Page 42: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/42.jpg)
Metric
Lect. 7: Learning from Massive Datasets 42
![Page 43: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/43.jpg)
Structure
Lect. 7: Learning from Massive Datasets 43
![Page 44: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/44.jpg)
Time
Lect. 7: Learning from Massive Datasets 44
![Page 45: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/45.jpg)
Facts
Lect. 7: Learning from Massive Datasets 45
![Page 46: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/46.jpg)
Pipeline
Lect. 7: Learning from Massive Datasets 46
![Page 47: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/47.jpg)
Multi-Language
Lect. 7: Learning from Massive Datasets 47
![Page 48: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/48.jpg)
Text Analytics
Lect. 7: Learning from Massive Datasets
48
![Page 49: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/49.jpg)
Predictions
Lect. 7: Learning from Massive Datasets
49
![Page 50: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/50.jpg)
Gavagai
Lect. 7: Learning from Massive Datasets
Jussi Karlgren (PhD in Stylistics in Information Retrieval)
Magnus Sahlgren (PhD thesis in distributional semantics)
Fredrick Olsson (PhD thesis in Active Learning) (co-workers at SICS)
The indeterminacy of translation is a thesis propounded by 20th-century American analytic philosopher W. V. Quine.
Quine uses the example of the word "gavagai" uttered by a native speaker of the unknown language Arunta upon seeing a rabbit. A speaker of English could do what seems natural and translate this as "Lo, a rabbit." But other translations would be compatible with all the evidence he has: "Lo, food"; "Let's go hunting"; "There will be a storm tonight" (these natives may be superstitious)… (wikipedia)
50
![Page 51: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/51.jpg)
Ethersource presented Thanks to F. Olsson for the ff slides
Lect. 7: Learning from Massive Datasets 51
![Page 52: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/52.jpg)
Associations
Lect. 7: Learning from Massive Datasets 52
![Page 53: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/53.jpg)
Language is flux
Lect. 7: Learning from Massive Datasets 53
![Page 54: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/54.jpg)
Learning from use
Lect. 7: Learning from Massive Datasets 54
![Page 55: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/55.jpg)
Scope
Lect. 7: Learning from Massive Datasets 55
![Page 56: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/56.jpg)
Architecture
Lect. 7: Learning from Massive Datasets 56
![Page 57: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/57.jpg)
Web vs printed world
Lect. 7: Learning from Massive Datasets 57
![Page 58: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/58.jpg)
Noise…
58
![Page 59: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/59.jpg)
Multi-linguality
Lect. 7: Learning from Massive Datasets 59
![Page 60: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/60.jpg)
SICS
Lect. 7: Learning from Massive Datasets Watch the videos! 60
![Page 61: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/61.jpg)
Big Data MeetUp, Stockholm
Lect. 7: Learning from Massive Datasets
61
![Page 62: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/62.jpg)
BIG DATA
communities
Lect. 7: Learning from Massive Datasets 62
![Page 63: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/63.jpg)
Future Directions in Machine Learning for
Language Technology
Lect. 7: Learning from Massive Datasets
Deluge of data
Little linguistic analysis in the realm of big-data real-world
platforms and applications
Top-down systems cannot efficiently deal with irregularity
and unpredictability of big textual data
Data-driven systems can make it. However,
…we know that computers are not at ease with natural
languages used by humans, unless they learn how to learn
linguistic structure underlying natual language from data…
63
![Page 64: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/64.jpg)
For a data-driven approach…
Lect. 7: Learning from Massive Datasets
Annotated datasets that are needed for complete supervised machine learning are costly, time-comsuming and require specialist expertise.
Is complete supervision even thinkable when we talk about tera-, peta- or yottabytes? How big should then be the training set?
Alternative solutions:
Semi-supervised methods (combination of labelled and unlabelled data)
Weakly supervised methods (human-constructed rules are typically used to guide the unsupervised learner)
Unsupervised learning results cannot still compete with suprevised learning in many tasks…
64
![Page 65: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/65.jpg)
A new way to explore: Incomplete Supervision
Lect. 7: Learning from Massive Datasets
Relies on partially labelled data:
” Human experts — or possibly a crowd of laymen — annotate
text with some linguistic structure related to the structure
that one wants to predict. This data is then used for partially
supervised learning with a statistical model that exploits the
annotated structure to infer the linguistic structure of interest.” p. 4
65
![Page 66: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/66.jpg)
Example
Lect. 7: Learning from Massive Datasets
”…it is possible to construct accurate and robust part-of-speech taggers
for a wide range of languages, by combining (1) manually annotated
resources in English, or some other language for which such resources are
already available, with (2) a crowd-sourced target-language specific
lexicon, which lists the potential parts of speech that each word may take
in some context, at least for a subset of the words.
Both (1) and (2) only provide partial information for the part-of-speech
tagging task. However, taken together they turn out to provide substantially
more information than either taken alone. “ p. 4-6
Oscar Täckström “Predicting Linguistic Structure with Incomplete and
Cross-Lingual Supervision” PhD Thesis, Uppsala University, 2013
(http://soda.swedish-ict.se/5513/)
66
![Page 67: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/67.jpg)
Conclusions
Lect. 7: Learning from Massive Datasets
This course is an introduction to Machine leaning for Language Technology”.
You get a flavour of the problems we come across when devising models for enabling machines to analyse and make sense of natural human language.
The next big big big step is to bring as much linguistic awareness as possible into big data.
67
![Page 68: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/68.jpg)
Reading
Lect. 7: Learning from Massive Datasets
Witten and Frank (2005) Ch. 8
68
![Page 69: Machine Learning for Language Technology - …stp.lingfil.uu.se/~santinim/ml/Lecture07_ML4LT_MarinaSantini_2013.pdf · Machine Learning for Language Technology ... (Twitter Tutorial](https://reader035.vdocuments.mx/reader035/viewer/2022062909/5b5b32c27f8b9ab8578d8afe/html5/thumbnails/69.jpg)
Thanx for your attention!
Lect. 7: Learning from Massive Datasets
69