patch based prediction techniques university of houston by: paul amalaman from: uh-dmml lab...

Patch Based Prediction TechniquesUniversity of Houston

By: Paul AMALAMANFrom: UH-DMML LabDirector: Dr. Eick

Introduction1. Research Goals2. Problem Setting3. Solutions: TPRTI-A & TPRTI-B4. Results5. ConclusionFuture Work

UH-DMML Lab 1

Research Goals

To improve Machine Learning techniques for inducing predictive models based on efficient subdivisions of the input space (patches)

Areas of Focus: Linear Regression Tree Induction Classification Tree Induction

UH-DMML Lab 2

Linear regression is a global model, where there is a single predictive formula holding over the entire data-space. Y = β0 + βTX + ϵ

Linear Regression TreeWhen the data has lots of input attributes which interact in complicated, nonlinear ways, assembling a single global model can be very difficult. An alternative approach to nonlinear regression is to split, or partition, the space into smaller regions, where the interactions are more manageable. We then partition the sub-divisions again - this is called recursive partitioning - until finally we get to chunks of the space which can fit simple models to them.

Splitting Method selecting the pair {split variable, split value} minimizing some error/objective function

Background (Research Goals Continued)

UH-DMML Lab 3

Popular approaches:1-Variance-based {split variable, split value} selection: try each mean value of

each input attribute objective function: variance minimization scalable, complex trees, often less accurate

2-RSS-based {split variable, split value} selection: try each value for each

input attribute (Exhaustive search) objective function: RSS minimization (Residual Sum of

Squared Errors)Less scalable, smaller trees, better accuracy

Our Research Goals:

To induce smaller trees with better accuracy while improving on scalability by designing better splitting methods (patches), and objective functions

UH-DMML Lab 4

Background (Research Goals Continued)

UH-DMML Lab 5

(A) (D) (G)(B) (C) (E)(F)

Problem Setting

Exhaustive search

Variance-based

(A) (B) (C)

Problem Setting

Our Research Goals: To induce smaller trees with better accuracy while improving on scalability by designing better splitting methods (patches), and objective functions

UH-DMML Lab 6

1-Variance –based approaches like M5 will miss the optimum split point

2-Exhaustive search approaches like RETIS will find optimum split point but at cost of expensive search (not scalable)

x1 and x2=0, and y

Example (Problem Setting Continued)

UH-DMML Lab 7

Current Proposed Solutiono Detect areas in the dataset where the general

trend makes sharp turns (Turning Points)

o Use Turning Points as potential split points in a Linear Regression Tree induction algorithm

Challenges: Determining the turning points Balancing accuracy, model complexity, and

runtime complexity

UH-DMML Lab 8

Solutions

UH-DMML Lab 9

Determining Turning Points (Solutions continued)

Two algorithms: TPRTI-A and TPRTI-BBoth rely on 1. detecting potential split points in the dataset

(turning points)2. then feed a tree induction algorithm with the split

points

TPRTI-A and TPRTI-B differ by their objective functionso TPRTI-A RSS based node evaluation approacho TPRTI-B uses a two steps node evaluation function Select split point based on distance Use RSS computation to select the pair {split

variable/split value}

UH-DMML Lab 10

Two New Algorithms (Solutions continued)

TPRTI-A RSS based node evaluation approach.

Does a look-ahead split for each turning point and select the split that best minimizes RSS

UH-DMML Lab 11

TPRTI-B uses a two steps node evaluation function Select split point based on

distance Use RSS computation to

select the pair {split variable/split value}

Two New Algorithms (Solutions continued)

M5 TPRTI-A

TPRTI-B RETIS GUIDE SECRE

TTPRTI-

A (6/5/1) - (4/6/2) (4/6/0) (5/1/2) (4/2/2)

TPRTI-B (4/6/2) (2/6/4) - (3/7/0) (5/1/2) (1/4/3)

Table1. Comparison between TPRTI-A, TPRTI-B and state-of-the-art approaches with respect to accuracy (wins/ties/loses)

Results On Accuracy

UH-DMML Lab 12

Results On Complexity

Table2. Number of times an approach obtained the combination (Best accuracy, fewest leaf-nodes)

M5 TPRTI-A

TPRTI-B RETIS GUIDE SECRE

0 5 3 5 N.A. N.A.

UH-DMML Lab 13

Results On Scalability

UH-DMML Lab 14

We propose a new approach for Linear Regression Tree construction called Turning Point Regression Tree Induction (TPRTI) that infuses turning points into a regression tree induction algorithm to achieve

improved scalability while maintaining high accuracy and low model complexity.

Two novel linear regression tree induction algorithms called TPRTI-A and TPRTI-B which incorporate turning points into the node evaluation were introduced and experimental results indicate that TPRTI is a scalable algorithm that is capable of obtaining a high predictive accuracy using smaller decision trees than other approaches.

Conclusion

UH-DMML Lab 15

FUTURE WORKWe are investigating how turning point detection can also be used to induce better classification trees.

UH-DMML Lab 16

Thank You

patch based prediction techniques university of houston by: paul amalaman from: uh-dmml lab...

Documents

relational algebra, r. ramakrishnan and j. gehrke (with...

julie pizona – e adriana strazzantif – k lynsey eick ...

something interesting about finding something interesting...

data mining and machine learning (dmml) group members (left...

christoph f. eick, walter d. sanz, and ruijian zhang...

research focus of uh-dmml

eick: reinforcement learning. reinforcement learning...

1 modeling evolution in spatial datasets paul amalaman...

data mining and machine learning: fundamental concepts...

running the big sur marathon april 30, 2006 christoph eick

seesys: space-filling software visualization marla j. baker...

ingrid e. skistad, jostein sæthre og colin eick: brygg mer...

cables, code, and computer history · • slide 6: image of...

department of computer science research areas and projects...

name: sujing wang advisor: dr. christoph f. eick data mining...

ch. eick: supervised clustering --- algorithms and...

chapter 9: decision trees slides corrected and new slides...

utilisation des fumures organique et minerale sur cultures...

department of computer science research focus of uh-dmml...

eick - rnav (gnss) standard departure chart rwy 35 cat a,b -...