session 2 - akyildiz, beinecke, yee at mlconf nyc

Contact: Ben Beinecke | [email protected] | 646-918-6435 | Copyright © 2015

Scaling Your Machine LearningMLConf 2015 NYC

2

Innovation relies on iteratively testing ideas quickly and easily

3

But Big Data Has Broken The Iterative Workflow

Prototyping is essential for data analysis, but big data has made prototyping expensive, painful, and slow.

The Problem:

1. Small data tools don’t scale• e.g. Matlab

2. Fixed Frameworks are not customizable enough for real-world problems• e.g. Hadoop MapReduce

3. Customized solutions break, are hard to modify, and expensive to maintain• e.g. C++ with MPI

4

Business Logic (Algorithm Code)

Implementation Logic (Infrastructure Code)

Hand-Coded Infrastructure Isn’t Practical

Data Science 1.0(Business Logic Encumbered with Implementation)

Data Science 2.0

Automatic

(Business Logic free from Implementation)

5

Apply Learning Techniques to Data Distribution and Parallelization

Data Science 2.0Data Science 1.0

CPU’s RAM

Smart Compute

Part-of-Speech Tagging for Noisy Data Sets

Connie Yee

Text Analytics and Machine Learning (TAML)

Financial & Risk

Part-of-Speech Tagging

• Many uses including:

Input to a full parser in order to

facilitate deep processing

1

Plays well with others INPUT

AMBIGUITY

OUTPUT VBZ RB IN NNS

NNS/VBZ UH/JJ/NN/RB IN NNS

Named-entity recognition

– How to

– pronounce “lead”?

Supervised Classification

Trainer using

Parameter

Estimation

Classifier

Model

Feature

Generator

Decoder using

Beam Search

2

Tag

Sequence

Training

Data

Input

Sentence

Feature

Generator

A. Training

B. Decoding

(Prediction on unseen data)

features

A model includes

parameter values for

an event and all its

possible outcomes

Tagging News and Twitter Data

• Wall St. Journal treebank from UPenn

(PTB)

– Training: 38k sentences

– Test: 5k sentences

• Features

– Preceding tags

– Words surrounding target word

– Word shape, such as case, prefix,

and suffix

3

System Accuracy

TAML 96.6

• Twitter dataset from CMU sampled from

10/27/2010

– Training: 1000 tweets

– Test: 500 tweets

• Build features on top of News features

– Word clustering 111010100010 : "lmao", "lmfao", "lmaoo", …

111010100011 : "haha", "hahaha", "hehe", …

– Use PTB as a soft-constraint tag

dictionary

System Accuracy

TAML – news features 74.56

+ normalization 84.84

+ word clustering 88.37

+ tag dictionary 88.53

Sample Tagged Twitter Data

4

• Spending_V the_D day_N

withhh_P mommma_N !_,

•Its_L hard_A for_P me_O

when_R I_O have_V too_R

ask_V ,_, is_V it_O really_R

that_P dull_A !?_,

•@JBieberzLuvies_@ LOL_! i_O

ranther_R go_V see_V

payton_^ rae_^ and_& MAYBE_R

caitlin_^ beadles_N XDD_E

N Common noun

O Pronoun

^ Proper noun

V Verb

D Determiner

P Pre- or

postposition, or

subordinating

conjunction

R Adverb

A Adjective

L Nominal + verbal

@ At-mention

E Emoticon

, Punctuation

! Interjection

session 2 - akyildiz, beinecke, yee at mlconf nyc

Technology

data distribution

data analysis

unseen data

twitter data wall

easily3but big data

news features word

v payton

r caitlin