pgmhd: a scalable probabilistic graphical model for massive hierarchical data problems ieee big data...

Post on 06-Jan-2018

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Introduction Probabilistic graphical models (PGM) consist of a structural model and a set of conditional probabilities. Graphical models can be classified into two major categories: –(1) directed graphical models (Bayesian networks) – (2) undirected graphical models (Markov Random Fields) IEEE Big Data 2014

TRANSCRIPT

PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems

IEEE Big Data 2014

Agenda

• Introduction• Motivation• Model Structure• Progressive Learning• Use Cases

– Automate MS Annotation (Multi-label Classification)

– Latent Semantic Discovery• Conclusion

IEEE Big Data 2014

Introduction

• Probabilistic graphical models (PGM) consist of a structural model and a set of conditional probabilities.

• Graphical models can be classified into two major categories: – (1) directed graphical models (Bayesian

networks)– (2) undirected graphical models (Markov

Random Fields)

IEEE Big Data 2014

Motivation

MS1

MS2MS3

1300 2,979,334

Frag1 Frag2

..

GOG1

GOG2

MS1

MS213000* 2,979,334 =

3,873,134,200

MS3

IEEE Big Data 2014

Model Structure

50 2020

40 50 30 50

10

5

10 20 1515

GOG1 GOG2

F1 F2F3 F4 F5 F6

F7F8

F9 F10 F11

MS1

MS2

MS3

P(GOG1 | F1,F3,F7) = P(GOG1|F1) * P(GOG1|F3) * P(F3|F7)) = 50/50 * 20/60 * 10/25

IEEE Big Data 2014

Progressive Learning

• This learning technique is very attractive in the big data age for the following reasons:– Training the model does not require processing all

data upfront.– It can easily learn from new data without the need to

re-include the previous training data in the learning.– The training session can be distributed instead of

doing it in one long-running session.

IEEE Big Data 2014

Automate MS Annotation(Multi-label Classification)

• Data Set Includes:Item CountScan 1974

Peak 266571

Edges 10743

Root 450

MS2 Fragment Node 5983

MS3 Fragment Node 201

IEEE Big Data 2014

Results

IEEE Big Data 2014

Results

IEEE Big Data 2014

Results

Latent Semantic discovery

Java Developer

.NET Developer

Nurse Health Care

Java J2EE C#Care giver

RN Senior Home

510

350 5

0100

10

15

1

P(Java,J2EE| Java Developer) = P(Java|Java Developer) * P(J2EE|Java Developer) = 5/7 * 10/10

P(Java,C#|Java Dev, .NET Dev) = P(Java|Java Dev)*P(Java|.NET Dev) * P(C#|Java Dev) * P(C#|.NET Dev)

IEEE Big Data 2014

Results

IEEE Big Data 2014

Conclusion

• we propose an efficient and scalable probabilistic graphical model for massive hierarchical data (PGMHD).

• we successfully applied PGMHD to the bioinformatics domain to automatically classify and annotate high-throughput mass spectrometry data.

• we successfully applied this model to large-scale latent semantic discovery by using 1.6 billion search log entries provided by CareerBuilder.com within a Hadoop Map/Reduce framework.

IEEE Big Data 2014

Questions

IEEE Big Data 2014

top related