research exploring massive learning via a prediction system omid madani yahoo! research

ResearchResearch

Exploring Massive Learning via a Prediction System

Omid Madani

Yahoo! Research www.omadani.net

ResearchResearch

Goal

Convey a taste of the: • motivations/considerations/

assumptions/speculations/hopes,…• The game, a 1st system, and its

algorithms

ResearchResearch

Talk Overview

1. Motivational part

2. The approach:• The game (categories, …)• Algorithms• Some experiments

ResearchResearch

Fill in the Blank(s)!

Would ---- like ------ ------- ----- ------ ?your coffee with sugaryou

ResearchResearch

What is this object?

ResearchResearch

“Well, categorization is one of the most basic functions of living creatures. We live in a categorized world – table, chair, male, female, democracy, monarchy – every object and event is unique, but we act towards them as members of classes.” From an interview with Eleanor Rosch (Psychologist, a

pioneer on the phenomenon of “basic level” concepts)

“Concepts are the glue that holds our mental world together.” From “The Big Book of Concepts”, Gregory Murphy

Categorization is Fundamental!

ResearchResearch

“Rather, the formation and use of categories is the stuff of experience.”

Philosophy in the Flesh, Lakoff and Johnson.

ResearchResearch

• Repeated and rapid classification…

• … in the presence of myriad classes

)1(x

classification system

)2(x

In the presence of myriad categories:1. How to categorize efficiently?2. How to efficiently learn to

categorize efficiently?

x ?

Two Questions Arise

ResearchResearch

Now, a 3rd Question ..

• How can so many inter-related categories be acquired?

• Programming them unlikely to be successful/scale:• Limits of our explicit/conscious knowledge• Unknown/unfamiliar domains• The required scale..• Making the system operational..

ResearchResearch

Learn? … How?

• “Supervised” learning (explicit human involvement) likely inadequate:• Required scale, or a good sign post:

• ~millions of categories and beyond..• Billions of weights, and beyond..

• Inaccessible “knowledge” (see last slide!)• Other approaches likely do not meet the

needs (incomplete, different goals, etc): active learning, semi-supervised learning, clustering, density learning, RL, etc..

ResearchResearchDesiderata/Requirements(or Speculations)

• Higher intelligence, such as advanced “advanced” pattern recognition/generation (e.g. vision), may require• Long term learning (weeks, months, years,…)• Cumulative learning (learn these first, then these,

then these,…)• Massive Learning: Myriad inter-related

categories/concepts• Systems learning• Autonomy (relatively little human involvement)

What’s the learning task?

?

ResearchResearch

This Work: An Exploration

• An avenue: “prediction games in infinitely rich worlds”

• Exciting part: • World provides unbounded learning opportunity!

(world is the validator, the system is the experimenter!.. and actively builds much of its own concepts)

• World enjoys many regularities (e.g. “hierarchical”)• Based in part on “supervised” techniques!! (“discriminative”, “feedback driven”,

supervisory signal doesn’t originate from humans )

ResearchResearchIn a Nutshell

Prediction System

…. 0011101110000….

After a while(much learning)

predict observe & update

Prediction System

observe & updatepredict

low level or “hard-wired” categories

higher level categories(bigger chunks)

(Text: characters, .. Vision: edges, curves,…)

(e.g. words, digits, phrases, phone numbers, faces, visual objects, home pages, sites,…)

ResearchResearch

The Game

• Repeat • Hide part(s) of the stream• Predict (use context)• Update• Move on

• Objective: predict better ... subject to efficiency constraints

• In the process: categories at different levels of size and abstraction should be learned

ResearchResearch

Research Goals

• Conjecture: There is much value to be attained from this task

• Beyond language modeling: more advanced pattern recognition/generation

• If so, should yield a wealth of new problems (=> Fun)

ResearchResearch

Overview

• Goal: Convey a taste of the motivations/considerations, the system and algorithms,..

• Motivation• The approach:

• The game (categories, …)• Algorithms• Some experiments

ResearchResearch

Upshot

• Takes streams of text• Make categories (strings)• Approx three hours on 800k

documents• Large-scale discriminative

learning (evidence better than than language modeling)

ResearchResearch

Caveat Emptor!

• Exploratory research

• Many open problems (many I’m not aware of … )

• Chosen algorithms, system org, or objective/performance measures, etc., etc… are likely not even near the best possible

ResearchResearch

Categories

• Building blocks (atoms!) of intelligence?

• Patterns that frequently occur• External • Internal..• Useful for predicting other categories!• They can have structure/regularities, in

particular:1. Composition (~conjunctions) of other categories (Part-Of)2. Grouping (~disjunctions)(Is-A relations)

ResearchResearch

Categories

• Low level “primitive” examples: 0 and 1 or characters (“a”, “b”, .. ,“0”, “-”,..) • Provided to the system (easy to detect)

• Higher/composite levels:• Sequence of bits/characters• Words• Phrases• More general: Phone number, contact

info, resume, ...

ResearchResearch

Example Concept

• Area code is a concept that involves both composition and grouping:• Composition of 3 digits• A digit is a grouping, i.e., the set {0,1,2,

…,9} ( 2 is a digit )

• Other example concepts: phone number, address, resume page, face (in visual domain), etc.

ResearchResearch

Again, our goal, informally, is to build a system that acquires millions of useful concepts on its own.

ResearchResearch

Questions for a First System

• Functionality? Architecture? Org?• Would many-class learning scale

to millions of concepts?• Choice of concept building

methods? • How would various learning

processes interact?

ResearchResearch

Expedition: a First System

• Plays the game in text

• Begins at character level

• No segmentation, just a stream

• Makes and predicts larger sequences, via composition

• No grouping yet

ResearchResearch

… New Jersey in …

predictors (active categories)

window containing contextand target

target (category to predict)

… New Jersey in …

next time step

predictors

target

Learning Episodes

In this example, context contains one category on each side

ResearchResearch

… loves New York life …

predictors

window containing contextand target

target (category to predict)

.. Some Time Later ..

In terms of supervised learning/classification, in this learning activity (prediction games):• The set of concepts grows over time• Same for features/predictors (concepts ARE the predictors!)• Instance representation (segmentation of the data stream) changes/grows over time ..

ResearchResearch

Prediction/Recall

}f,f{x 32

1. Features are “activated”

features categories

c1

c2

c3

c4

c5

f1

f2

f3

f42. Edges are activated

3. Receiving categories are activated4. Categories sorted/ranked

).,c(),.,c(),.,c(),.,c(

:list sorted

10104050 1534

40.

30.20.

10.

10.

1. Like use of inverted indices2. Sparse dot products

ResearchResearchUpdating a Feature’s Connectionsfeatures categories

c1

c2

c3

c4

c5

f1

f2

f3

f4

3

2

Cx

xf

1. Identify connection

2. Increase weight

3. Normalize/weaken weights

4. Drop tiny weights

Degrees are constrained

10 ,1

][, :updatesuch One ,

,

whereccw

wc xcfcf

Kronecker delta

ResearchResearch

“ther ”

Example Category Node (from Jane Austen’s)

“and ”

“heart”

“nei”

“toge”

“ far”

“ bro”

0.087

0.07

0.057

0.052

0.13

0.11

“love ”0.10

“by ”

A category nodes keeps track of various weights, such as edge (or prediction) weights, and predictiveness weights, and other statistics (e.g. frequency,

first/last time seen), and updates them when it is activated as a predictor or target..

7.1 0.41(keep local statistics)

prediction weights

categories appearing before

ResearchResearch Network

• Categories and their edges form a network(a directed weighted graph, with different kinds of edges ... )

• The network grows over time: millions of nodes and beyond

ResearchResearch

When and How to Compose?

• Two major approaches: 1. Pre-filter: don’t compose if certain

conditions are not met (simplest: only consider possibilities that you see)

2. Post-filter: compose and use, but remove if certain conditions are not met (e.g. if not seen recently enough, remove)

• I expect both are needed …

ResearchResearchSome Composition (Prefilter) Heuristics

• FRAC: If you see c1 then c2 in the stream, then, with some probability, add c=c1c2

• MU: use the pointwise mutual

information between c1 and c2

• IMPROVE: take string lengths into account

and see whether joining is better

• BOUND: Generate all strings under length Lt.

)(

)|(

2

12

cp

ccp

ResearchResearch

Prediction Objective

• Desirable: learn higher level categories (bigger/abstract categories are useful externally)

• Question: how does this relate to improving predictions?

1. Higher level categories improve “context” and can save memory

2. Bigger, save time in playing the game (categories are atomic)

ResearchResearch

Objective (evaluation criterion)

• The Matching Performance:

Number of bits (characters) correctly predicted per unit time or per prediction

action

• Subject to constraints (space, time,..)• How about entropy/perplexity? Categories are structured, so perplexity

seems difficult to use..

ResearchResearchLinearity and Non-Linearity (a motivation for new concept creation)

n

e

w

new

Versus Which one predicts better?(better constrains what comes next)

Aggregate the votes of“n”, “e”, and “w” to predict

what comes next

new????

ResearchResearch

Data

• Reuters RCV1 800k news articles• Several online books of Jane Austen,

etc.• Web search query logs

ResearchResearch

Some Observations

• Ran on Reuters RCV1 (text body) ( simply zcat dir/file* )

• ~800k articles• >= 150 million learning/prediction

episodes• Over 10 million categories built• 3-4 hours each pass (depends on

parameters)

ResearchResearch

Observations• Performance on held out (one of the

Reuters files):• 8-9 characters long to predict on average• Almost two characters correct on

average, per prediction action

• Can overfit/memorize! (long categories)

• Current: stop category generation after first pass

ResearchResearch

ResearchResearch Some Example Categories(in order of first time

appearance and increasing length)cat name= "<" cat name= " t" cat name= ".</" cat name= "p>- " cat name= " the " cat name= "ation " cat name= "of the " cat name= "ing the " cat name= ""The " cat name= "company said " cat name= ", the company " cat name= "said on Tuesday" cat name= " said on Tuesday" cat name= "," said one " cat name= "," he said. cat name= "--------------------------------" cat name= "--------------------------------------------------------" cat name= "--------------------------------------------------------------- cat name= ". Reuters has not verified these stories and does not vouch for their accuracy. cat name= "press on Tuesday. Reuters has not verified these stories and does not vouch for their accuracy. cat name= "press on Thursday. Reuters has not verified these stories and does not vouch for their accuracy. cat name= "press on Wednesday. Reuters has not verified these stories and does not vouch for their accuracy. cat name= "within 10 percentage points in either direction of the key 225-share Nikkei average over the next six

month" cat name= "ing and selling rates for leading world currencies and gold against the dollar on the London foreign

exchange and bullion "

ResearchResearchExample “Recall” Paths

From processing one month of Reuters:

"Sinn Fei" (0.128) "n a seat" (0.527) " in the " (0.538) "talks." (0.468) "

B" (0.0185) "rokers " **** The end: connection weight less than: 0.04

" Gas in S" (1) "cotland" (1.04) " and north" (1.18) "ern E"(0.572) "ngland" (0.165) "," a " (0.0542) "spokeswo" (0.551)

"mansaid " (0.044) "the idea" (0.0869) " was to " (0.144) "quot" (0.164)"e the d" (0.0723) "ivision" (0.0671) " in N" (0.397) "ew York"(0.062) " where " (0.0557) "the main " (0.0474) "marque" (0.229) "swere " (0.253) "base" (0.264) "d. "" (0.0451) "It will " (0.117)"certain" (0.0691) "ly b" (0.0892) "e New " (0.353) "York" (0.112) "party" (0.0917) "s is goin" (0.559) "g to " (0.149) "end.""(0.239) " T" (0.104) "wedish " (0.125) "Export" (0.0211) "Credi" **** The end: connection weight less than: 0.04

ResearchResearchSearch Query Logs

"bureoofi" (1) "migration" (1.13) "andci" (1.04) "tizenship." (0.31) "com

www," (0.11) "ictions" (0.116) "zenship." **** The end: this concept wasn't seen in last 1000000 time points.

Random Recall:"bureoofi" (1) "migration" (0.0129) "dept.com"

**** The end: this concept wasn't seen in last 1000000 time points.

ResearchResearch

Much Related Work!

• Online learning, cumulative learning, feature and concept induction, neural networks, clustering, Bayesian methods, language modeling, deep learning, “hierarchical” learning, importance/ubiquity of predictions/anticipations in the brain (“On Intelligence”, “natural computations”,…), models of neocortex (“circuits of the mind”), concepts and conceptual phenomena (e.g. “big book of concepts”), compression, ….

ResearchResearch

Summary

• Large-scale learning and classification (data hungry, efficiency paramount)

• A systems approach: Integration of multiple learning processes

• The system makes it own classes• Driving objective: Improve prediction

(currently: “matching” performance)• The underlying goal: effectively acquire

complex concepts• See www.omadani.net

ResearchResearch

Current/Future

• Much work:• Integrate learning of groupings• Recognize/use “structural” categories? (learn

to “parse”/segment?)• Prediction objective.. ok?• Control over input stream, etc..• Category generation.. What are good

methods?• Other domains (vision,…)

• Compare: language modeling, etc

research exploring massive learning via a prediction system omid madani yahoo! research

Documents