pgmhd: a scalable probabilistic graphical model for massive hierarchical data problems ieee big data...

15
PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014

Upload: june-burke

Post on 06-Jan-2018

217 views

Category:

Documents


0 download

DESCRIPTION

Introduction Probabilistic graphical models (PGM) consist of a structural model and a set of conditional probabilities. Graphical models can be classified into two major categories: –(1) directed graphical models (Bayesian networks) – (2) undirected graphical models (Markov Random Fields) IEEE Big Data 2014

TRANSCRIPT

Page 1: PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014

PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems

IEEE Big Data 2014

Page 2: PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014

Agenda

• Introduction• Motivation• Model Structure• Progressive Learning• Use Cases

– Automate MS Annotation (Multi-label Classification)

– Latent Semantic Discovery• Conclusion

IEEE Big Data 2014

Page 3: PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014

Introduction

• Probabilistic graphical models (PGM) consist of a structural model and a set of conditional probabilities.

• Graphical models can be classified into two major categories: – (1) directed graphical models (Bayesian

networks)– (2) undirected graphical models (Markov

Random Fields)

IEEE Big Data 2014

Page 4: PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014

Motivation

MS1

MS2MS3

1300 2,979,334

Frag1 Frag2

..

GOG1

GOG2

MS1

MS213000* 2,979,334 =

3,873,134,200

MS3

IEEE Big Data 2014

Page 5: PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014

Model Structure

50 2020

40 50 30 50

10

5

10 20 1515

GOG1 GOG2

F1 F2F3 F4 F5 F6

F7F8

F9 F10 F11

MS1

MS2

MS3

P(GOG1 | F1,F3,F7) = P(GOG1|F1) * P(GOG1|F3) * P(F3|F7)) = 50/50 * 20/60 * 10/25

IEEE Big Data 2014

Page 6: PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014

Progressive Learning

• This learning technique is very attractive in the big data age for the following reasons:– Training the model does not require processing all

data upfront.– It can easily learn from new data without the need to

re-include the previous training data in the learning.– The training session can be distributed instead of

doing it in one long-running session.

IEEE Big Data 2014

Page 7: PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014

Automate MS Annotation(Multi-label Classification)

• Data Set Includes:Item CountScan 1974

Peak 266571

Edges 10743

Root 450

MS2 Fragment Node 5983

MS3 Fragment Node 201

IEEE Big Data 2014

Page 8: PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014

Results

IEEE Big Data 2014

Page 9: PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014

Results

IEEE Big Data 2014

Page 10: PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014

Results

Page 11: PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014
Page 12: PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014

Latent Semantic discovery

Java Developer

.NET Developer

Nurse Health Care

Java J2EE C#Care giver

RN Senior Home

510

350 5

0100

10

15

1

P(Java,J2EE| Java Developer) = P(Java|Java Developer) * P(J2EE|Java Developer) = 5/7 * 10/10

P(Java,C#|Java Dev, .NET Dev) = P(Java|Java Dev)*P(Java|.NET Dev) * P(C#|Java Dev) * P(C#|.NET Dev)

IEEE Big Data 2014

Page 13: PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014

Results

IEEE Big Data 2014

Page 14: PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014

Conclusion

• we propose an efficient and scalable probabilistic graphical model for massive hierarchical data (PGMHD).

• we successfully applied PGMHD to the bioinformatics domain to automatically classify and annotate high-throughput mass spectrometry data.

• we successfully applied this model to large-scale latent semantic discovery by using 1.6 billion search log entries provided by CareerBuilder.com within a Hadoop Map/Reduce framework.

IEEE Big Data 2014

Page 15: PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014

Questions

IEEE Big Data 2014