pgmhd: a scalable probabilistic graphical model for massive hierarchical data problems ieee big data...
DESCRIPTION
Introduction Probabilistic graphical models (PGM) consist of a structural model and a set of conditional probabilities. Graphical models can be classified into two major categories: –(1) directed graphical models (Bayesian networks) – (2) undirected graphical models (Markov Random Fields) IEEE Big Data 2014TRANSCRIPT
PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems
IEEE Big Data 2014
Agenda
• Introduction• Motivation• Model Structure• Progressive Learning• Use Cases
– Automate MS Annotation (Multi-label Classification)
– Latent Semantic Discovery• Conclusion
IEEE Big Data 2014
Introduction
• Probabilistic graphical models (PGM) consist of a structural model and a set of conditional probabilities.
• Graphical models can be classified into two major categories: – (1) directed graphical models (Bayesian
networks)– (2) undirected graphical models (Markov
Random Fields)
IEEE Big Data 2014
Motivation
MS1
MS2MS3
1300 2,979,334
Frag1 Frag2
..
GOG1
GOG2
…
MS1
MS213000* 2,979,334 =
3,873,134,200
MS3
IEEE Big Data 2014
Model Structure
50 2020
40 50 30 50
10
5
10 20 1515
GOG1 GOG2
F1 F2F3 F4 F5 F6
F7F8
F9 F10 F11
MS1
MS2
MS3
P(GOG1 | F1,F3,F7) = P(GOG1|F1) * P(GOG1|F3) * P(F3|F7)) = 50/50 * 20/60 * 10/25
IEEE Big Data 2014
Progressive Learning
• This learning technique is very attractive in the big data age for the following reasons:– Training the model does not require processing all
data upfront.– It can easily learn from new data without the need to
re-include the previous training data in the learning.– The training session can be distributed instead of
doing it in one long-running session.
IEEE Big Data 2014
Automate MS Annotation(Multi-label Classification)
• Data Set Includes:Item CountScan 1974
Peak 266571
Edges 10743
Root 450
MS2 Fragment Node 5983
MS3 Fragment Node 201
IEEE Big Data 2014
Results
IEEE Big Data 2014
Results
IEEE Big Data 2014
Results
Latent Semantic discovery
Java Developer
.NET Developer
Nurse Health Care
Java J2EE C#Care giver
RN Senior Home
510
350 5
0100
10
15
1
P(Java,J2EE| Java Developer) = P(Java|Java Developer) * P(J2EE|Java Developer) = 5/7 * 10/10
P(Java,C#|Java Dev, .NET Dev) = P(Java|Java Dev)*P(Java|.NET Dev) * P(C#|Java Dev) * P(C#|.NET Dev)
IEEE Big Data 2014
Results
IEEE Big Data 2014
Conclusion
• we propose an efficient and scalable probabilistic graphical model for massive hierarchical data (PGMHD).
• we successfully applied PGMHD to the bioinformatics domain to automatically classify and annotate high-throughput mass spectrometry data.
• we successfully applied this model to large-scale latent semantic discovery by using 1.6 billion search log entries provided by CareerBuilder.com within a Hadoop Map/Reduce framework.
IEEE Big Data 2014
Questions
IEEE Big Data 2014