u sing a hierarchy in place of flat data for s equential p attern m ining matt ramsey 100063967...

23
USING A HIERARCHY IN PLACE OF FLAT DATA FOR SEQUENTIAL PATTERN MINING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

Upload: cecil-johnson

Post on 03-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

USING A HIERARCHY IN PLACE OF FLAT DATA FOR SEQUENTIAL

PATTERN MINING

Matt Ramsey

100063967

Supervisor: Jan Stanek

Minor Thesis Presentation

Page 2: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

PRESENTATION STRUCTURE

Introduction Concepts Research Question

Methodology Results Conclusion Discussion

Page 3: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

INTRODUCTION

What the project is: Performing sequential pattern mining on patient

prescription data using the ATC drug hierarchy

Some Important concepts Sequential Pattern Mining ATC drug hierarchy

Page 4: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

CONCEPT 1 - SEQUENTIAL PATTERN MINING

Sequential pattern mining is trying to find the relationships between occurrences of sequential events, to find if there exists any specific order of the occurrences (Zhao, Q & Bhowmick, S 2003).

Page 5: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

CONCEPT 3 - ATC CODES Anatomical Therapeutic Chemical (ATC): A

drug classification that creates distinct groups of drugs EG: Propicillin (J01CE03)

Page 6: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

RESEARCH QUESTION

How will the use of a hierarchy affect the success of the sequential pattern mining?

Page 7: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

THEORY

Theory is: there will be more patterns, but may experience “contamination” and “dilution”

Contamination Occurs when a level of the hierarchy that is too high is used –

it contains too many unrelated subgroups E.G: If we have a pattern at level 2, and we move up to

pattern mine at level 1, we still have that pattern, but the meaning might be lost.

Pattern is contaminated

Dilution Occurs when a level of the hierarchy that is too low is used -

meaning information becomes too specific E.G: At Level 5 of the ATC hierarchy may be too many

possible states (prescriptions) for strong patterns to emerge Pattern is diluted by the level of detail

Page 8: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

METHODOLOGY – PRE-PROCESSING

Remove Prescriptions with no drug recorded Calculate Duration = Packsize * (Repeats+1) /

Dose Convert Drug names to ATC codes

Patient Key

Provider Key

Date Script number

Drug Name

Dose Packsize

Repeats

A2rJFF8mDe

Amrr6MjVUS

22/04/2008

A006507 Propicillin 1 daily

20 2

Patient Key

Date Drug Code Duration (days)

A2rJFF8mDe 22/04/2008 J10CE03 60

^Prescription(1)="A2rJFF8mDe,Amrr6MjVUS,22/07/2004,A006507, Propicillin,1 daily,20,2,Tablets,.....”^Prescription(1,"ATC Code")="J10CE03"^Prescription(1,"Patient ID")="A2rJFF8mDe”^Prescription(1,"Prescription Duration")=60^Prescription(1,"Script Date")="22/04/2008"

Page 9: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

METHODOLOGY – SEQUENTIAL PATTERN MINING

There are many Sequential Pattern Mining algorithms, but majority are overly complex or specific to multi-dimensional data, time sensitive data, etc.

We want a simple method for proof-of-concept Some problems with existing “simple” algorithms

such as AprioriAll (Agrawal & Srikant 1995) and PrefixSpan (Pei et al. 2004)

Page 10: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

METHODOLOGY – PROBLEM 1: PATTERNS WITHIN SEQUENCES

Patterns: A -> B

SPM with Min Support: 3

Page 11: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

METHODOLOGY – PROBLEM 1: PATTERNS WITHIN SEQUENCES

Patterns: ---

SPM with Min Support: 3

Page 12: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

METHODOLOGY – PROBLEM 2: TRANSITIVE PATTERNS

SPM with Min Support: 3

Patterns: A -> C

Page 13: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

METHODOLOGY – ADDRESSING PROBLEMS

1. Link data with the same Patient Key2. Break each patient’s sequence up into

individual 2 drug sequences.

No transitive effect because no possible gaps No missing patterns within patients as all

prescriptions treated as their own sequence

Page 14: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

METHODOLOGY – PATTERN MINING

Use 2 support thresholds Minimum Support Minimum Patients

Pattern Mine Iteratively: Mine for patterns in level 5 (flat data) Modify the ATC codes (eg A01AA01 -> A01AA) Mine for patterns in level 4 Modify the ATC codes (eg A01AA -> A01A) Etc

Reflect on strength and relevance of gained patterns How strongly supported the patterns are How meaningful they are

Page 15: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

EXAMPLE PATTERNS WITH MINSUP = 6, MINPAT = 3

LEVEL 5 : 2 ITEM PATTERNS

PATTERN: Temazepam (N05CD07) -> Paracetamol (N02BE01) PATTERN OCCURS 8 TIMES IN TOTAL OF 1 PATIENTS OCCURENCES: (3, 49) (3, 52) (3, 56) (3, 58) (3, 65) (3, 67) (3, 80) (3, 100)

PATTERN: Metformin (A10BA02) -> Gliclazide (A10BB09) PATTERN OCCURS 5 TIMES IN TOTAL OF 3 PATIENTS OCCURENCES: (1, 64) (4, 1) (4, 3) (13, 45) (13, 49)

....

LEVEL 4 : 2 ITEM PATTERNS

PATTERN: Biguanides (A10BA) -> Sulfonamides, urea derivatives (A10BB) PATTERN OCCURS 5 TIMES IN TOTAL OF 3 PATIENTS OCCURENCES: (1, 64) (4, 1) (4, 3) (13, 45) (13, 49)

....

Page 16: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

RESULTS FOR MINSUP = 6, MINPAT = 3

Level Pattern Length

Unique Patterns

Total Patterns

5 2 8 55

4 2 11 73

3 2 15 109

3 3 1 6

2 2 20 152

2 3 1 6

1 2 31 430

1 3 25 211

1 4 11 73

1 5 2 17

1 6 1 7

Page 17: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

RESULTS FOR MINSUP = 6, MINPAT = 3

Level Pattern Length

Unique Patterns

Total Patterns

5 2 8 55

4 2 11 73

3 2 15 109

3 3 1 6

2 2 20 152

2 3 1 6

1 2 31 430

1 3 25 211

1 4 11 73

1 5 2 17

1 6 1 7

Page 18: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

RESULTS FOR MINSUP = 6, MINPAT = 3

A01AA01

A01AA

A01A

A01

A

Level Pattern Length

Unique Patterns

Total Patterns

5 2 8 55

4 2 11 73

3 2 15 109

3 3 1 6

2 2 20 152

2 3 1 6

1 2 31 430

1 3 25 211

1 4 11 73

1 5 2 17

1 6 1 7

Example code

Page 19: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

INTERESTING PATTERNS DISCOVERED

Only emerges at level 4, and is one of the strongest patterns

Gets diluted at level 5

PATTERN: ACE inhibitors, plain (C09AA) -> HMG CoA reductase inhibitors (C10AA) PATTERN OCCURS 9 TIMES IN TOTAL OF 2 PATIENTS

PATTERN: OTHER ANALGESICS AND ANTIPYRETICS (N02B) -> HYPNOTICS AND SEDATIVES (N05C) PATTERN OCCURS 6 TIMES IN TOTAL OF 1 PATIENTS

Identified as unusual by supervisor Jan Stanek

PATTERN: Metformin (A10BA02) -> Gliclazide (A10BB09) PATTERN OCCURS 5 TIMES IN TOTAL OF 3 PATIENTS

PATTERN: Biguanides (A10BA) -> Sulfonamides, urea derivatives (A10BB) PATTERN OCCURS 5 TIMES IN TOTAL OF 3 PATIENTS

PATTERN: ORAL BLOOD GLUCOSE LOWERING DRUGS (A10B) -> ORAL BLOOD GLUCOSE LOWERING DRUGS (A10B)

PATTERN OCCURS 19 TIMES IN TOTAL OF 3 PATIENTS Very vague pattern, is present at lower levels By higher level, loses its meaning; becomes

contaminated

Page 20: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

ACHIEVEMENTS

Created modified algorithm Finds patterns within sequences as well as across

sequences Uses 2 support thresholds

Discovered more rules than if performing pattern mining on flat data

Asses the impact of using a hierarchy for sequential pattern mining YES using hierarchy DOES furthers pattern mining BUT it is up to an expert in the field to assess

whether the extra patterns are useful or not.

Page 21: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

CONCLUSION Unique Research

Using hierarchy to enhance pattern mining Finding patterns within sequences as well as across

all sequences Small data set required

This research shows the potential importance of using a hierarchy to enhance data mining

Forms the basis for further research

Page 22: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

REFERENCES

Agrawal, R & Srikant, R 1995, 'Mining Sequential Patterns', paper presented at the Eleventh International Conference on Data Engineering.

Pei, J, Han, J, Dayal, U, Mortazavi-Asl, B, Wang, J, Pinto, H, Chen, Q & Hsu, M 2004, 'Mining sequential patterns by pattern-growth: The prefixspan approach', IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 11.

Zhao, Q & Bhowmick, S 2003, 'Sequential pattern mining: A survey', ITechnical Report CAIS Nayang Technological University Singapore, pp. 1–26.

Page 23: U SING A HIERARCHY IN PLACE OF FLAT DATA FOR S EQUENTIAL P ATTERN M INING Matt Ramsey 100063967 Supervisor: Jan Stanek Minor Thesis Presentation

DISCUSSION