short course robust optimization and machine learning [.2 ...people.eecs.berkeley.edu › ~elghaoui...

34
Robust Optimization & Machine Learning 7. Text Analytics Information Overload Sparse Machine Learning Topic imaging Dynamic images Cross-language imaging ASRS Study References Short Course Robust Optimization and Machine Learning Lecture 7: Sparse Machine Learning for Text Analytics Laurent El Ghaoui EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012

Upload: others

Post on 04-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Short CourseRobust Optimization and Machine Learning

Lecture 7: Sparse Machine Learning for Text Analytics

Laurent El Ghaoui

EECS and IEOR DepartmentsUC Berkeley

Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012

January 19, 2012

Page 2: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Outline

Information Overload

Sparse Machine Learning

Topic imagingDynamic imagesCross-language imaging

ASRS Study

References

Page 3: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Outline

Information Overload

Sparse Machine Learning

Topic imagingDynamic imagesCross-language imaging

ASRS Study

References

Page 4: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Information Overload

Avalanche of “information” in text format, e.g.I News articles, press releases, RSS feeds, TV captioning data.I 10-K filings, marketing brochures, financial analyst reports, and

other company-related documents.I Consumer reviews, blogs, emails, and other social media content.I Scientific papers, patents, law documents, bills, medical reports,

literature.

Page 5: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Information Overload

Avalanche of “information” in text format, e.g.I News articles, press releases, RSS feeds, TV captioning data.I 10-K filings, marketing brochures, financial analyst reports, and

other company-related documents.I Consumer reviews, blogs, emails, and other social media content.I Scientific papers, patents, law documents, bills, medical reports,

literature.

The top 20 most important news sources have generated ∼ 40,000news articles yesterday.

Page 6: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

What might be useful?I Summarize large text databases.I Detect and visualize trends in term usage.I Compare how topics of interest are treated across different

sources.I Group similar text documents.I Provide interpretable visualizations.

Page 7: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

What might be useful?I Summarize large text databases.I Detect and visualize trends in term usage.I Compare how topics of interest are treated across different

sources.I Group similar text documents.I Provide interpretable visualizations.

Approach: sparse machine learning tools to help in these tasks.

Page 8: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Outline

Information Overload

Sparse Machine Learning

Topic imagingDynamic imagesCross-language imaging

ASRS Study

References

Page 9: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Sparse Machine Learning

Let X be the term-by-document matrix, y a label vector.I Cardinality-penalized least-squares :

minw,b‖X T w + b1− y‖2

2 + λCard(w)

I Cardinality-penalized low-rank approximations (or, sparse PCA):

minw,v‖X − wvT‖F + λCard(w) + µCard(v).

Despite the hardness of these problems, we can solve themheuristically, at very high scale.

Page 10: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Why are these problems scalable?

Both of these problems allow to eliminate features (or documents)prior to solving the problem at very cheap cost, in an embarrassinglyparallel way. This is known as a SAFE elimination procedure.

For example, for the sparse low-rank approximation problem it can beproven that no feature appears if the corresponding variance is lessthan the penalty parameter λ.

Page 11: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

SAFE for sparse PCA

The sparse PCA problem can be expressed as (n = number offeatures)

maxz : ‖z‖2=1

n∑i=1

max((xT

i z)2 − λ, 0).

I xi is the data for the i-th feature (i-th column of matrix X ).I For no cardinality penalty (λ = 0), reduces to an eigenvalue

problem.I When λ > 0, i-th feature can be removed safely when its

variance = ‖xi‖2 < λ.

In our text applications, cardinality penalty λ is very high , allowing togreatly reduce the size of the problem.

Page 12: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

SAFE for the LASSO

LASSO is a convex approximation to cardinality-penalizedleast-squares. There is a SAFE procedure for it.

0.05 0.07 0.1 0.2 0.3 0.4 0.6 0.8 10

10

100

1,000

10,000

100,000140,000

!/!max

Fe

atu

re

s

Original

M

Lasso

λ = 0.04λmax

25LF M = 1, 000 LF ≤ 1, 000

PubMed data has 3M documents,150K words in dictionary. SAFE forLASSO brings downs that numberto 1000. This allows to load thedata and solve the LASSOproblem.

Page 13: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Outline

Information Overload

Sparse Machine Learning

Topic imagingDynamic imagesCross-language imaging

ASRS Study

References

Page 14: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

What is topic imaging?

Topic image: A small set of terms that are semantically related to agiven topic (“the query”).

As a predictive problem: predict appearance of query term in adocument given the term use in that document.

I Predictive model must be interpretable: number of predictors(other terms) must be few (sparse classification).

I Model must be obtained fast .

Page 15: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

What is topic imaging?

Topic image: A small set of terms that are semantically related to agiven topic (“the query”).

As a predictive problem: predict appearance of query term in adocument given the term use in that document.

I Predictive model must be interpretable: number of predictors(other terms) must be few (sparse classification).

I Model must be obtained fast .

Page 16: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

What is topic imaging?

Topic image: A small set of terms that are semantically related to agiven topic (“the query”).

As a predictive problem: predict appearance of query term in adocument given the term use in that document.

I Predictive model must be interpretable: number of predictors(other terms) must be few (sparse classification).

I Model must be obtained fast .

Page 17: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Occurence vs. ClassificationTwo NYT op-ed columnists

Occurence analysis : top 10 words

Nicholas Kristof Roger Cohenmr obama

people iranobama said

said americanpresident president

world iraniannew israel

american statesyears newunited united

Page 18: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Occurence vs. ClassificationTwo NYT op-ed columnists

Occurence analysis : top 10 words Sparse classification

Nicholas Kristof Roger Cohenmr obama

people iranobama said

said americanpresident president

world iraniannew israel

american statesyears newunited united

Nicholas Kristof Roger Cohenvideos olmertdarfur persian

antibiotics chemicalfacebook mohammadsudanese alijanjaweed dialogueyoutube ceasesudan iranian

sweatshops tehraninvite holocaust

Page 19: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Image across time: “Microsoft”Data: The New York Times headlines, 1985-2007

Page 20: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

“China”Data: The New York Times headlines, 1985-2007

Page 21: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

“Cancer”Data: The New York Times headlines, 1985-2007

Page 22: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

“Diabetes”Data: The New York Times headlines, 1985-2007

Page 23: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Topic imaging in foreign languagesI Translate query term.I Run topic imaging task on foreign press data in original language.I Translate the few terms in the resulting list.

Avoids huge translation task!

Page 24: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Topic imaging in foreign languagesI Translate query term.I Run topic imaging task on foreign press data in original language.I Translate the few terms in the resulting list.

Avoids huge translation task!

Query: can you guess?Source: People’s Daily, Feb-Apr 2011.

file:///C|/Users/Dai%20Xinyu/Desktop/result1.txt[2011-04-05 12:43:34]

LIBYA:

(The first column is chinese words of LIBYA, the second colume is the most related chinese words

of LIBYA from China DAILY NEWSPAPER in 2011,

the third column is the translation of the second column)

利比亚 欧佩克 opec

利比亚 武力 force

利比亚 局势 situation

利比亚 行动 action

利比亚 平民 civilians

利比亚 撤出 withdrawal

利比亚 空袭 airstrike

利比亚 北非 french-speaking

利比亚 瓦莱塔 valletta

利比亚 撤离 evacuate

利比亚 军机 planes

利比亚 人道主义 humanitarianism

利比亚 卡扎菲 qadhafi

DALAI:

达赖 中央政府 government

达赖 访 visit

达赖 藏传 tibetan

达赖 祖国 motherland

达赖 分裂 split

达赖 台 taiwan

达赖 宗教 religion

达赖 集团 clique

Page 25: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Outline

Information Overload

Sparse Machine Learning

Topic imagingDynamic imagesCross-language imaging

ASRS Study

References

Page 26: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

ExampleDiscovery of emerging issues in flight security

After each commercial flight in the US, pilots generate “ASRS reports”to document flight-related issues.

Key problem: detect emerging issues that are not being classified intoexisting categories, e.g.:

I “Wake vortex” problem of the Boeing 757.I Increased number of runway incursions at LAX.

Page 27: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

ExampleDiscovery of emerging issues in flight security

After each commercial flight in the US, pilots generate “ASRS reports”to document flight-related issues.

Key problem: detect emerging issues that are not being classified intoexisting categories, e.g.:

I “Wake vortex” problem of the Boeing 757.I Increased number of runway incursions at LAX.

Don’t search for a needle — picture the haystack!

Page 28: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Sparse PCA Imaging

CLE

DFW

ORDMIA

BOSLAX

STLPHLMDW

DCA SFOZZZEWRATL

LGA

LAS

PIT

HOU BWICYYZ

SEAJFK

clevelandProximitytaillinesconfusingnonstandardrptrorangelinesnowCoordinatedUniversalTimeGMTRouteerrordepictedoperatingbrasilia

Most distinguishing words for "CLE"

rwy19lrwy31ltxwydrwy34lrwy16lrwy34rtxwyerwy28crwy28rrwy28l

Specificstakeoff

rwyustaxi

clearancehold

aircraftclearshorttower

runway

Aviate

apologizedrwy33lrwy16rvehicletxwyhrwy15rtxwyfrwy1rrwy1lrwy12rS

pecifics

stoppedlanding

controllercross

intermediatefixfirstofficer

timepositionrwy23lline

Communicate

CLE DFW ORD MIA BOS LAX STL PHL MDW DCA SFO ZZZ EWR ATL LGA LAS PIT HOU BWI CYYZ SEA JFK

file:///Users/elghaoui/Dropbox/ASRS/Visualization/Runway%20Plot%20-%20v1/view.html

1 of 1 6/14/11 10:53 AM

Figure 2: A sparse PCA plot of the runway ASRS data. Here, each data point is an airport, withsize of the circles consistent with the number of reports for each airport.

CLEDFW

ORD

MIA

BOSLAX

STLPHLMDW

DCASFO

ZZZ

EWR

ATL

LGA

LASPIT

HOUBWICYYZ

SEA JFK clevelandProximitytaillinesconfusingnonstandardrptrorangelinesnowCoordinatedUniversalTimeGMTRouteerrordepictedoperatingbrasilia

Most distinguishing words for "CLE"

frequentlymotionpermissionopenworkloadportmiddleapologizeddirectvehicleM

anage

linetakeofftaxi

clearancehold

aircraftclearshorttower

runway

Aviate

soundflowsrejoineddoublechkingmarkingislandcarefulmaintainedanglevisionN

avigate

captaincrossing

secondofficerposition

intermediatefixtime

crossfirstofficercontrollerlanding

Communicate

CLE DFW ORD MIA BOS LAX STL PHL MDW DCA SFO ZZZ EWR ATL LGA LAS PIT HOU BWI CYYZ SEA JFK

file:///Users/elghaoui/Dropbox/ASRS/Visualization/Runway%20Plot%20-%20v1%20without%20Rwy-Txwy/view...

1 of 1 6/14/11 10:52 AM

Figure 3: A sparse PCA plot of the runway ASRS data, with runway features removed.

10

!"#$%&'("#)"*+()%,-%'&+("#+'(&.&+

+ /(01+-2+'1$&$+#-'&+%$,%$&$"'&+%$,-%'&+'1('+-%)*)"('$+2%-3+(+,(%4056(%+()%,-%'7+

Page 29: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Sparse PCA Imaging

CLE

DFW

ORDMIA

BOSLAX

STLPHLMDW

DCA SFOZZZEWRATL

LGA

LAS

PIT

HOU BWICYYZ

SEAJFK

clevelandProximitytaillinesconfusingnonstandardrptrorangelinesnowCoordinatedUniversalTimeGMTRouteerrordepictedoperatingbrasilia

Most distinguishing words for "CLE"

rwy19lrwy31ltxwydrwy34lrwy16lrwy34rtxwyerwy28crwy28rrwy28l

Specificstakeoff

rwyustaxi

clearancehold

aircraftclearshorttower

runway

Aviate

apologizedrwy33lrwy16rvehicletxwyhrwy15rtxwyfrwy1rrwy1lrwy12rS

pecifics

stoppedlanding

controllercross

intermediatefixfirstofficer

timepositionrwy23lline

Communicate

CLE DFW ORD MIA BOS LAX STL PHL MDW DCA SFO ZZZ EWR ATL LGA LAS PIT HOU BWI CYYZ SEA JFK

file:///Users/elghaoui/Dropbox/ASRS/Visualization/Runway%20Plot%20-%20v1/view.html

1 of 1 6/14/11 10:53 AM

Figure 2: A sparse PCA plot of the runway ASRS data. Here, each data point is an airport, withsize of the circles consistent with the number of reports for each airport.

CLEDFW

ORD

MIA

BOSLAX

STLPHLMDW

DCASFO

ZZZ

EWR

ATL

LGA

LASPIT

HOUBWICYYZ

SEA JFK clevelandProximitytaillinesconfusingnonstandardrptrorangelinesnowCoordinatedUniversalTimeGMTRouteerrordepictedoperatingbrasilia

Most distinguishing words for "CLE"

frequentlymotionpermissionopenworkloadportmiddleapologizeddirectvehicleM

anage

linetakeofftaxi

clearancehold

aircraftclearshorttower

runway

Aviate

soundflowsrejoineddoublechkingmarkingislandcarefulmaintainedanglevisionN

avigate

captaincrossing

secondofficerposition

intermediatefixtime

crossfirstofficercontrollerlanding

Communicate

CLE DFW ORD MIA BOS LAX STL PHL MDW DCA SFO ZZZ EWR ATL LGA LAS PIT HOU BWI CYYZ SEA JFK

file:///Users/elghaoui/Dropbox/ASRS/Visualization/Runway%20Plot%20-%20v1%20without%20Rwy-Txwy/view...

1 of 1 6/14/11 10:52 AM

Figure 3: A sparse PCA plot of the runway ASRS data, with runway features removed.

10

!"#$%&'("#)"*+()%,-%'&+("#+'(&.&+

+ /(01+-2+'1$+2-3%+#)%$04-"&+0-%%$&,-"#&+'-+-"$+-2+'1$+2-3%+5(&)0+,)6-'+'(&.&+7$*8+99:;)('$<<=>+

+ ++ +?1$+'$%@&+A$%$+

!"#$%!&'!(()*2-3"#>+

Page 30: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Sparse PCA Imaging

CLE

DFW

ORDMIA

BOSLAX

STLPHLMDW

DCA SFOZZZEWRATL

LGA

LAS

PIT

HOU BWICYYZ

SEAJFK

clevelandProximitytaillinesconfusingnonstandardrptrorangelinesnowCoordinatedUniversalTimeGMTRouteerrordepictedoperatingbrasilia

Most distinguishing words for "CLE"

rwy19lrwy31ltxwydrwy34lrwy16lrwy34rtxwyerwy28crwy28rrwy28l

Specificstakeoff

rwyustaxi

clearancehold

aircraftclearshorttower

runway

Aviate

apologizedrwy33lrwy16rvehicletxwyhrwy15rtxwyfrwy1rrwy1lrwy12rS

pecifics

stoppedlanding

controllercross

intermediatefixfirstofficer

timepositionrwy23lline

Communicate

CLE DFW ORD MIA BOS LAX STL PHL MDW DCA SFO ZZZ EWR ATL LGA LAS PIT HOU BWI CYYZ SEA JFK

file:///Users/elghaoui/Dropbox/ASRS/Visualization/Runway%20Plot%20-%20v1/view.html

1 of 1 6/14/11 10:53 AM

Figure 2: A sparse PCA plot of the runway ASRS data. Here, each data point is an airport, withsize of the circles consistent with the number of reports for each airport.

CLEDFW

ORD

MIA

BOSLAX

STLPHLMDW

DCASFO

ZZZ

EWR

ATL

LGA

LASPIT

HOUBWICYYZ

SEA JFK clevelandProximitytaillinesconfusingnonstandardrptrorangelinesnowCoordinatedUniversalTimeGMTRouteerrordepictedoperatingbrasilia

Most distinguishing words for "CLE"

frequentlymotionpermissionopenworkloadportmiddleapologizeddirectvehicleM

anage

linetakeofftaxi

clearancehold

aircraftclearshorttower

runway

Aviate

soundflowsrejoineddoublechkingmarkingislandcarefulmaintainedanglevisionN

avigate

captaincrossing

secondofficerposition

intermediatefixtime

crossfirstofficercontrollerlanding

Communicate

CLE DFW ORD MIA BOS LAX STL PHL MDW DCA SFO ZZZ EWR ATL LGA LAS PIT HOU BWI CYYZ SEA JFK

file:///Users/elghaoui/Dropbox/ASRS/Visualization/Runway%20Plot%20-%20v1%20without%20Rwy-Txwy/view...

1 of 1 6/14/11 10:52 AM

Figure 3: A sparse PCA plot of the runway ASRS data, with runway features removed.

10

!"#$%&'("#)"*+()%,-%'&+("#+'(&.&+

+ !"#$%&'()*+%,()%,-%'&+(%$+/-&'01+$2,-&$#+'-+(3)(4-"+5$*6+'(.$7-89+("#+:-//;"):(4-"+)&&;$&<+

Page 31: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Sparse PCA Imaging

CLE

DFW

ORDMIA

BOSLAX

STLPHLMDW

DCA SFOZZZEWRATL

LGA

LAS

PIT

HOU BWICYYZ

SEAJFK

clevelandProximitytaillinesconfusingnonstandardrptrorangelinesnowCoordinatedUniversalTimeGMTRouteerrordepictedoperatingbrasilia

Most distinguishing words for "CLE"

rwy19lrwy31ltxwydrwy34lrwy16lrwy34rtxwyerwy28crwy28rrwy28l

Specificstakeoff

rwyustaxi

clearancehold

aircraftclearshorttower

runway

Aviate

apologizedrwy33lrwy16rvehicletxwyhrwy15rtxwyfrwy1rrwy1lrwy12rS

pecifics

stoppedlanding

controllercross

intermediatefixfirstofficer

timepositionrwy23lline

Communicate

CLE DFW ORD MIA BOS LAX STL PHL MDW DCA SFO ZZZ EWR ATL LGA LAS PIT HOU BWI CYYZ SEA JFK

file:///Users/elghaoui/Dropbox/ASRS/Visualization/Runway%20Plot%20-%20v1/view.html

1 of 1 6/14/11 10:53 AM

Figure 2: A sparse PCA plot of the runway ASRS data. Here, each data point is an airport, withsize of the circles consistent with the number of reports for each airport.

CLEDFW

ORD

MIA

BOSLAX

STLPHLMDW

DCASFO

ZZZ

EWR

ATL

LGA

LASPIT

HOUBWICYYZ

SEA JFK clevelandProximitytaillinesconfusingnonstandardrptrorangelinesnowCoordinatedUniversalTimeGMTRouteerrordepictedoperatingbrasilia

Most distinguishing words for "CLE"

frequentlymotionpermissionopenworkloadportmiddleapologizeddirectvehicleM

anage

linetakeofftaxi

clearancehold

aircraftclearshorttower

runway

Aviate

soundflowsrejoineddoublechkingmarkingislandcarefulmaintainedanglevisionN

avigate

captaincrossing

secondofficerposition

intermediatefixtime

crossfirstofficercontrollerlanding

Communicate

CLE DFW ORD MIA BOS LAX STL PHL MDW DCA SFO ZZZ EWR ATL LGA LAS PIT HOU BWI CYYZ SEA JFK

file:///Users/elghaoui/Dropbox/ASRS/Visualization/Runway%20Plot%20-%20v1%20without%20Rwy-Txwy/view...

1 of 1 6/14/11 10:52 AM

Figure 3: A sparse PCA plot of the runway ASRS data, with runway features removed.

10

!"#$%&'("#)"*+()%,-%'&+("#+'(&.&+

+ !"#$$%&'()$*"%+()%,-%'&+(%$+/-&'01+$2,-&$#+'-+/("(*$/$"'+34-%0-(#5+("#+"(6)*(7-"+3$*8+/(%.)"*&+-"+%9"4(15+)&&9$&:+

Page 32: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

LASSO for DFW!"#$%&'(()(*+&&,-..-/01"*2&3"*24&5#*)"*2&

airport term 1 term 2 term 3 term 4 term 5 term 6 term 7 term 8CLE Rwy23L Rwy24L Rwy24C Rwy23R Rwy5R Line Rwy6R Rwy5LDFW Rwy35C Rwy35L Rwy18L Rwy17R Rwy18R Rwy17C cross TowerORD Rwy22R Rwy27R Rwy32R Rwy27L Rwy32L Rwy22L Rwy9L Rwy4LMIA Rwy9L TxwyQ Rwy8R Line Rwy9R PilotInCommand TxwyM TakeoffBOS Rwy4L Rwy33L Rwy22R Rwy4R Rwy22L TxwyK Frequency CaptainLAX Rwy25R Rwy25L Rwy24L Rwy24R Speed cross Line TowerSTL Rwy12L Rwy12R Rwy30L Rwy30R Line cross short TxwyPPHL Rwy27R Rwy9L Rwy27L TxwyE amass TxwyK AirCarrier TxwyYMDW Rwy31C Rwy31R Rwy22L TxwyP Rwy4R midway Rwy22R TxwyYDCA TxwyJ Airplane turn Captain Line Traffic Landing shortSFO Rwy28L Rwy28R Rwy1L Rwy1R Rwy10R Rwy10L b747 CaptainZZZ hangar radio Rwy36R gate Aircraft Line Ground TowerERW Rwy22R Rwy4L Rwy22L TxwyP TxwyZ Rwy4R papa TxwyPBATL Rwy26L Rwy26R Rwy27R Rwy9L Rwy8R atlanta dixie crossLGA TxwyB4 ILS Line notes TxwyP hold vehicle TaxiwayLAS Rwy25R Rwy7L Rwy19L Rwy1R Rwy1L Rwy25L TxwyA7 Rwy19RPIT Rwy28C Rwy10C Rwy28L TxwyN1 TxwyE TxwyW Rwy28R TxwyVHOU Rwy12R Rwy12L citation Takeoff Heading Rwy30L Line TowerBWI TxwyP Rwy15R Rwy33L turn TxwyP1 Intersection TxwyE Taxiway

CYYZ TxwyQ TxwyH Rwy33R Line YYZ Rwy24R short torontoSEA Rwy34R Rwy16L Rwy34L Rwy16R AirCarrier FirstOfficer TxwyJ SMAJFK Rwy31L Rwy13R Rwy22R Rwy13L vehicle Rwy4L amass Rwy31R

Table 3: The terms recovered with LASSO image analysis of a few airports in the “runway” ASRSdata set.

DFW

Rwy35Cairline

reconfirmed

final

TxwyZ

a319

Rwy17C

RwyERlahso

chargeCommunication

Rwy35L

mu2 cbasimultaneouslylima

TxwyEL

Rwy18L

TxwyWM

xxxx

Rwy18R

TxwyA

xyz

Rwy13R

Rwy36Lswitch

AirlineTransportRating

Rwy17R

RwyXX

spot

abed

Rwy36R

Week

cleared

TxwyWLTxwyF

turn

md80

b757abcdRwy22RcbaTxwyEL

committed

downfield

received

cba

b727

danger

Crossing

TxwyY

statnews.org http://statnews.org/tree_dfw

1 of 1 6/14/11 9:40 PM

CYYZ

YYZdisplaced

may

Crossing

across

along

toronto

ClearAirTurbulencelooks

holdoverold bleed Report

TxwyQ

needed

TxwyH

separationholdover

Rwy33R

Knot

TxwyE

TxwyR

radio

initial

Ground

Line

protected

never

area

proceeded

stop

landed

final

length

fulldata

Minutealongcloseout

previous

rightfollowingwhile

conflictsign

inadvertently

distance

Knot

bleed

placed

toward

statnews.org http://statnews.org/tree_cyyz

1 of 1 6/14/11 9:39 PM

Figure 5: A tree LASSO analysis of the DFW (left panel) and CYYZ (right panel) airports,showing the LASSO image (inner circle) and for each term in that image, a further image.

Figure 6: Diagram of DFW.

12

• &6'($78(/&9*#79-.&*:$;-</=&#$&)-*79:.-*&*:$;-<>2-?#;-<&#$2(*/(97"$/&24-2&9-:/(&2*":@.(A&

• &5./"&#'($78(/&9"BB:$#9-7"$&#//:(/A&

C-?#&3-<&3D&

E:$;-<&FGE&

Page 33: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Outline

Information Overload

Sparse Machine Learning

Topic imagingDynamic imagesCross-language imaging

ASRS Study

References

Page 34: Short Course Robust Optimization and Machine Learning [.2 ...people.eecs.berkeley.edu › ~elghaoui › Teaching › Zinal2012 › Lect7.pdfLaurent El Ghaoui EECS and IEOR Departments

Robust Optimization &Machine Learning7. Text Analytics

Information Overload

Sparse MachineLearning

Topic imagingDynamic images

Cross-language imaging

ASRS Study

References

Xinyu Dai, Jinzhu Jia, Laurent El Ghaoui, and Bin Yu.

SBA-term: Sparse bilingual association for terms.In Fifth IEEE International Conference on Semantic Computing, Palo Alto, CA, USA, 2011.

Laurent El Ghaoui, Vivian Viallon, and Tarek Rabbani.

Safe feature elimination for the LASSO.Submitted to Journal of Machine Learning Research, April 2011.Early draft: EECS Technical Report no. 126, September 2010.

B. Gawalt, J. Jia, L. Miratrix, L. El Ghaoui, B. Yu, and S. Clavier.

Discovering word associations in news media via feature selection and sparse classification.In Proc. 11th ACM SIGMM International Conference on Multimedia Information Retrieval, 2010.

Luke Miratrix, Jinzhu Jia, Brian Gawalt, Bin Yu, and Laurent El Ghaoui.

Summarizing large-scale, multiple-document news data: sparse methods and human validation.submitted to JASA.

Haipeng Shen and Jianhua Z. Huang.

Sparse principal component analysis via regularized low rank matrix approximation.J. Multivar. Anal., 99:1015–1034, July 2008.

Y. Zhang, A. d’Aspremont, and L. El Ghaoui.

Sparse PCA: Convex relaxations, algorithms and applications.In M. Anjos and J.B. Lasserre, editors, Handbook on Semidefinite, Cone and Polynomial Optimization: Theory,Algorithms, Software and Applications. Springer, 2011.To appear.