the power of declarative analytics
DESCRIPTION
Invited Talk at Modern Data Management Systems Summit on August 29-30, 2014 at Tsinghua University in Beijing, China. http://ise.thss.tsinghua.edu.cn/MDMS/English/program.jsp Abstract: Modern enterprises are increasingly relying on complex analyses on large data sets to drive business decisions. Tasks such as root cause analysis from system logs and lead generation based on social media, customer retention and digital marketing are rapidly gaining importance. These applications generally consist of three major analytic phases: text analytics, semi-structured data processing (joins, group-by, aggregation), and statistical/predictive modeling. The size of the datasets in conjunction with the complexity of the analysis necessitates large-scale distributed processing of the analytical algorithms. At IBM we are building tools and technologies based on declarative languages to support each of these analytic phases. The declarative nature of the language abstracts away the need for programmer-optimization. Furthermore, the syntax of these languages is designed to appeal to the corresponding communities. As an example for statistical modeling, we expose a high-level language with syntax similar to R -- a very popular statistical processing language. In this talk I will give an overview of some real-world big data applications we are currently working on and use that to motivate the need for declarative analytics consisting of the three major phases discussed above. I will then describe, in some detail, declarative systems for text analytics along with a discussion on speeds, feeds and comparisons.TRANSCRIPT
© 2014 IBM Corporation
The Power of Declarative Analytics
August 30, 2014
Acknowledgement: Shiv, Sekar, Fred, Laura, Berthold, and many more to list here.
Yunyao Li
IBM Almaden Research Center
© 2014 IBM Corporation
Unlocking the value from big data
© 2014 IBM Corporation
Case Study: Sentiment Analysis
Text
Analytics
Product catalog, Customer Master Data, …
Social Media
• Products
Interests 360o Profile
• Relationships• Personal
Attributes
• Life
Events
Statistical
Analysis,
Report Gen.
Bank
6
23%
Bank
5
20%
Bank
4
21%
Bank
3
5%
Bank
2
15%
Bank
1
15%
Customer 360º3
Who can we cross/up sell?
What are our customers
thinking of our brand?
What do our customers want?
© 2014 IBM Corporation
0000----IIII IIIIIIII IIIIIIIIIIII IVIVIVIV
Time to develop a drug; 12 -15 years
Avg cost to develop a drug: 1.2 billion
Adapted from PhRMA (Pharmaceutical Research and Manufacturers of America) 2013 profile
Laboratory50,000 +
Compounds
Pre-Clinical250 Compounds
Clinical5 Compounds
FDA approval1 compound
“..Toxicity and Serious Adverse Events in Late Stage Drug Development
are the Major Causes of Drug Failure”
Intervention should happen at critical transition bottlenecks between stages
(most likely to impact outcome)
Drug Development Pipeline
© 2014 IBM Corporation
Structure
Indication
Containdication
Mode of action
/ target
Effect level
Side Effects
Data-driven decision making�More efficient clinical trial design, data analytics and drug success / failure predictions
'Structured and Structured and Structured and Structured and
unstructured data sourcesunstructured data sourcesunstructured data sourcesunstructured data sources
Case Study: Drug Discovery
© 2014 IBM Corporation
Case Study: Water Cost Index
• Financial reports
• News feeds
• Websites
• …
What is the cost of water
in different regions?
Financial Analytics
Who care about the cost of water?• Water agencies: Improve credit profile for water infrastructure projects
• Lenders: Better estimate cost and profits of such projects
• Insurers: Better understand underlying risk of such projects
• Consumers: Access water at an affordable price despite of increasing population and demand for water
• Provides market
benchmark
• Spurs growth of financial
products for both water
producers and investors
© 2014 IBM Corporation
Case Study: Water Cost Index
• Financial reports
• News feeds
• Websites
• …
Statistical
Analysis
Water Cost Index
• Uganda signed up as 1st
customer for WCI
Text
Analytics
Financial Analytics
• WCI published on ongoing
basis starting end of Sep.
2013
• Wall Street Journal article
on WCI
7
© 2014 IBM Corporation
What is this talk about ?What is this talk about ?What is this talk about ?What is this talk about ?
�What makes analytics tasks difficult and what can be learnt the
success of relational systems
�Brief description of declarative systems being built at IBM for
√ Information Extraction (SystemT)
√ Machine Learning (SystemML)
X Entity Resolution (DeeR)
8 IBM Research – Almaden9/10/2014
Data integrationStatistic Analysis
/Machine Learning
Information
Extraction
Databases
Semi-/Unstructured Documents
© 2014 IBM Corporation9 IBM Research – Almaden IBM Confidential9/10/2014
Challenges in Information Extraction
.....……………………….……………….Laura Haas
works for IBM in
San Jose, CA. ….………………….…..…………………
InformationExtraction
Person Org Loc
Laura Haas IBM San Jose,CA
Example: Named Entity RecognitionNamed Entity Hierarchy:Named Entity Hierarchy:Named Entity Hierarchy:Named Entity Hierarchy:
200+ types: Person, Organization, Location,…
© 2014 IBM Corporation10 IBM Research – Almaden IBM Confidential9/10/2014
Challenges in Information Extraction
BreadthBreadthBreadthBreadthWide varieties of extraction tasks
Named Entity Hierarchy:Named Entity Hierarchy:Named Entity Hierarchy:Named Entity Hierarchy:
200+ types: Person, Organization, Location,…
• Collecting dictionariesCollecting dictionariesCollecting dictionariesCollecting dictionaries
• Writing regular expressionsWriting regular expressionsWriting regular expressionsWriting regular expressions
• Collecting other wordCollecting other wordCollecting other wordCollecting other word----level featureslevel featureslevel featureslevel features
Labeling + training/tuning machine learning modelsLabeling + training/tuning machine learning modelsLabeling + training/tuning machine learning modelsLabeling + training/tuning machine learning models
orororor
Writing + testing rules Writing + testing rules Writing + testing rules Writing + testing rules
IE development takes effort!!IE development takes effort!!IE development takes effort!!IE development takes effort!!
© 2014 IBM Corporation11 IBM Research – Almaden IBM Confidential9/10/2014
Challenges in Information Extraction
BreadthBreadthBreadthBreadthWide varieties of extraction tasks
Named Entity Hierarchy:Named Entity Hierarchy:Named Entity Hierarchy:Named Entity Hierarchy:
200+ types: Person, Organization, Location,…200+ types: Person, Organization, Location,…200+ types: Person, Organization, Location,…200+ types: Person, Organization, Location,…
IE development takes effort!!IE development takes effort!!IE development takes effort!!IE development takes effort!!
Domain customizations is usually required!!Domain customizations is usually required!!Domain customizations is usually required!!Domain customizations is usually required!!
ComplexityComplexityComplexityComplexityIn development & customization
… Pres. Barack Obama arrived
today at the White House …
Entity Boundary: Person or Person or Person or Person or
Position + Person ?Position + Person ?Position + Person ?Position + Person ?
Entity Definition: Location/Facility/Organization?Location/Facility/Organization?Location/Facility/Organization?Location/Facility/Organization?
© 2014 IBM Corporation12 IBM Research – Almaden IBM Confidential9/10/2014
Challenges in Information Extraction
BreadthBreadthBreadthBreadthWide varieties of extraction tasks
Named Entity Hierarchy:Named Entity Hierarchy:Named Entity Hierarchy:Named Entity Hierarchy:
200+ types: Person, Organization, Location,…
IE development takes effort!!IE development takes effort!!IE development takes effort!!IE development takes effort!!
… Pres. Barack Obama arrived
today at the White House …
Entity Boundary: Person or Person or Person or Person or
Position + Person ?Position + Person ?Position + Person ?Position + Person ?
Domain customizations is usually required!!Domain customizations is usually required!!Domain customizations is usually required!!Domain customizations is usually required!!
ComplexityComplexityComplexityComplexityIn development & customization
Entity Definition: Location/Facility/Organization?Location/Facility/Organization?Location/Facility/Organization?Location/Facility/Organization?
State-of-the-art Open-Source Rule-based
System
• 80,000+ dictionary entries
• 4,800 lines of JAPE and Java code
• Accuracy (English): 50%-80%
• Performance: 20KB/sec, 8GB RAM
State-of-the-art Machine-learning system
• Combination of 4 classifiers
• 150,000+ dictionary entries
• 15+ regexes for word features
• Accuracy: 89%
• Throughput: ~ 10 KB/sec
ScaleScaleScaleScale450M+ tweets per day, …
© 2014 IBM Corporation
Challenges in Scalable Machine LearningChallenges in Scalable Machine LearningChallenges in Scalable Machine LearningChallenges in Scalable Machine Learning
13 IBM Research – Almaden9/10/2014
V W H≈
docu
men
ts
topicsto
pics
words
x
• Billions of non-zeros within tens of hours• Careful partitioning of data• Maximize data locality and parallelism
[Liu, WWW 2010]
~1500 lines of Java code
% initialize W, Hwhile (~converged)
W = W*(V%*%t(H))/(W%*%H%*%t(H))
H = H*(t(W)%*%V)/(t(W)%*%W%*%H)
end
W = W*max(V%*%t(H) – alphaW JW, 0)/(W%*%H%*%t(H))
H = H*max(t(W)%*%V – alphaH JH, 0)/(t(W)%*%W%*%H)
W = W*((S*V)%*%t(H))/((S*(W%*%H))%*%t(H))
H = H*(t(W)%*%(S*V))/(t(W)%*%(S*(W%*%H)))
RegularizersJW,JH
Weighted Sq Loss/Matrix Completion Setting
Parallel implementation is half the story ! Typical application requires experimenting with multiple variants
W = W*(V/(W%*%H) %*% t(H))/(E*%t(H))
H = H*(t(W)%*%(V/(W%*%H)))/(t(W)%*%E)
Different Loss functionKL-divergenceBreadthBreadthBreadthBreadth
Wide varieties of ML models
ComplexityComplexityComplexityComplexityIn implementation
ScaleScaleScaleScale450M+ tweets per day, …
© 2014 IBM CorporationIBM Research – Almaden
What is common across these analytics tasks ?What is common across these analytics tasks ?What is common across these analytics tasks ?What is common across these analytics tasks ?
� Variety of problems and solutions–Every customer’s data & problems are unique in some way
–Need to quickly implement new business logic–Need to experiment with multiple algorithms for a particular analytic problem
� Quality of answers is very important !!
–High quality analytics requires “complex” programs–Skilled developers + domain experts
� Performance is critical
–Bigger data demands faster execution• Social Media:
Twitter alone has 400M+ messages / day; 1TB+ per day
• Financial Data:
SEC alone has 20M+ filings, several TBs of data, with documents range from few KBs to few MBs
• Machine Data:
One application server under moderate load at medium logging level �1GB of logs per day
ComplexityComplexityComplexityComplexity
ScaleScaleScaleScale
BreadthBreadthBreadthBreadth
© 2014 IBM Corporation15 IBM Research – Almaden9/10/2014
Declarative Systems : The Relational WorldDeclarative Systems : The Relational WorldDeclarative Systems : The Relational WorldDeclarative Systems : The Relational World
Compute average salary
for each department
select D.did, avg(E.salary)
from Employee E, Department D
where E.did = D.did
group by D.did
SQL Query
Task
Declarative High-level Language
User specifies tasks in a high-level
language, w/o specifying algorithms for
data processing
…………
Tables, IndicesTables, IndicesTables, IndicesTables, Indices
Execution
Strategy
Query Optimization
System uses optimization strategies to
choose from alternate execution plans
Query OptimizerOptimization
Physical Data Independence
User does not have to worry about
physical data representation and
access aids while writing queries;
system manages the physical layer
© 2014 IBM Corporation16 IBM Research – Almaden9/10/2014
Pat Selinger
Why did Relational Systems succeed ?Why did Relational Systems succeed ?Why did Relational Systems succeed ?Why did Relational Systems succeed ?
Boeing said “We can ask questions we could
never find the answers to before. We’re now able
to do more than we could ever do before.”
Bruce Lindsay The invention of nonprocedural specification was
a tremendous simplification that made it much
easier to specify applications. No longer did you
have to say which index to use and which join
method to use to get the job done.
SIGMOD Record, June 2005
SIGMOD Record, December 2003
Michael Stonebraker
Query optimizers can beat all but the best
DBMS application programmers.
“What Goes Around Comes Around”,
Readings in Database Systems, 4th Edition, 2005
© 2014 IBM CorporationIBM Research – Almaden
What is common across these analytics tasks ?What is common across these analytics tasks ?What is common across these analytics tasks ?What is common across these analytics tasks ?
� Variety of problems and solutions–Every customer’s data & problems are unique in some way
–Need to quickly implement new business logic–Need to experiment with multiple algorithms for a particular analytic problem
� Quality of answers is very important !!
–High quality analytics � “complex” programs
–Skilled developers + domain experts
� Performance is critical
–Bigger data � faster execution
ComplexityComplexityComplexityComplexity
ScaleScaleScaleScale
BreadthBreadthBreadthBreadth
© 2014 IBM Corporation18 IBM Research – Almaden 9/10/2014
What am I going to talk about ?
SystemT
(Information Extraction)
SystemML
(Machine Learning)
Data Model
Operations
Language Syntax
Platform
Design
Choices
Analytics
Systems
� What makes analytics tasks difficult and what can be
learnt from the success of relational systems
� Brief description of declarative systems built at IBM and
the design choices made along the way
© 2014 IBM Corporation19 IBM Research – Almaden9/10/2014
Information Extraction Information Extraction Information Extraction Information Extraction ---- SystemTSystemTSystemTSystemT
© 2014 IBM Corporation20 IBM Research – Almaden9/10/2014
I went … to the OTIS concert last night
They played … “I Will Survive”…
a bunch of other bands also playing
The sax player in that band…
Concert Mention Pattern
Review within
200 tokens
Informal Music Band Reviews from Blogs
Consecutive Review Snippets are within 25 tokens
At least 3 occurrences of Music Review Snippet and Generic Review Snippet
Review ends with one of these.
Start with Concert Mention
Complete review is
within 200 tokens
Music
Review
Snippet
Music
Review
Snippet
Music
Review
Snippet
© 2014 IBM Corporation21 IBM Research – Almaden9/10/2014
StateStateStateState----ofofofof----thethethethe----art: Common Pattern Specification Language (CPSL)art: Common Pattern Specification Language (CPSL)art: Common Pattern Specification Language (CPSL)art: Common Pattern Specification Language (CPSL)
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur
risus in sagittis facilisis Jon Foreman their lead vocal/guitaristJon Foreman their lead vocal/guitaristJon Foreman their lead vocal/guitaristJon Foreman their lead vocal/guitarist hendrerit faucibus pede mi ipsum.
Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis,
Level 0Level 0Level 0Level 0
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Proin Jon Foreman their lead vocal/ Jon Foreman their lead vocal/ Jon Foreman their lead vocal/ Jon Foreman their lead vocal/
<Instrument><Instrument><Instrument><Instrument> arcu tincidunt
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin in
sagittis , <BandMember> their lead <BandMember> their lead <BandMember> their lead <BandMember> their lead
vocal/guitaristvocal/guitaristvocal/guitaristvocal/guitarist rutrum velit sed amet lt arcu tincidunt
⟨Token⟩[~ “pipe | guitarist | …”] � ⟨Instrument⟩⟨ ⟩⟨Token⟩[~ “([A-Z]\w+)\s+[A-Z]\w+”] �⟨BandMember⟩
Level 1Level 1Level 1Level 1
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. luctus, risus in sagittis
facilisis <BandMember> their lead vocal/<Instrument><BandMember> their lead vocal/<Instrument><BandMember> their lead vocal/<Instrument><BandMember> their lead vocal/<Instrument> hendrerit faucibus pede mi ipsum.
Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis,
Example Rule: Band Member name followed within 5 tokens by Instrument clue is a Music Review Snippet
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. augue rutrum
lorem velit, sed <ReviewSnippet><ReviewSnippet><ReviewSnippet><ReviewSnippet>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat
⟨BandMember⟩ ⟨Token⟩{0,5} ⟨Instrument⟩� ⟨MusicReviewSnippet⟩
Level 2Level 2Level 2Level 2
A common language to specify and represent extraction rules as cascading grammars
Developed jointly between SRI and Department of Defense (1999)
© 2014 IBM Corporation22 IBM Research – Almaden9/10/2014
Why is this not sufficient ?Why is this not sufficient ?Why is this not sufficient ?Why is this not sufficient ?
� Counting and aggregations are not natural primitives in grammar and have to be handled in
custom code [Chiticariu, ACL 2010]
� Finely tuned grammar-based extraction system, with custom code for counting and
aggregation, took ~ 6 hours to extract reviews from a million web logs
Consecutive Review Snippets are within 25 tokens
At least 3 occurrences of Music Review Snippet or Generic Review Snippet
Review ends with one of these.
Start with Concert Mention
Complete review is
within 200 tokens
© 2014 IBM Corporation23 IBM Research – Almaden9/10/2014
SystemT SystemT SystemT SystemT –––– Declarative Approach to Information ExtractionDeclarative Approach to Information ExtractionDeclarative Approach to Information ExtractionDeclarative Approach to Information Extraction
Annotated
Document
Stream
AQL SystemT
Optimizer
SystemT
Runtime
Compiled
Graph
Compiled
Operator
Graph
Rule language with
familiar SQL-like syntax
Specify annotator
semantics declaratively
Choose an efficient
execution plan that
implements the
semantics
Highly scalable,
embeddable Java
runtime
Input
Document
Stream
See SIGMOD 2010 tutorial [Chiticariu et al., 2010]
for details on other recent declarative IE systems
© 2014 IBM Corporation24 IBM Research – Almaden9/10/2014
<BandMember> <Instrument>
0-5 tokens
create view MusicReviewSnippet as
select B.name as member, I.value as instrument,
CombineSpans(B.name,I.value) as review
from BandMember B, Instrument I
where FollowsTok(B.name, I.value, 0, 5);
Expressing Music Review Snippet Rule in AQL
create view BandMember as
extract regex /[A-Z]\w+\s+[A-Z]\w+] / on D.text
from Document D;
Choice of SQL-like syntax for AQL motivated by wider adoption of SQL
© 2014 IBM Corporation25 IBM Research – Almaden9/10/2014
What makes AQL expressive?What makes AQL expressive?What makes AQL expressive?What makes AQL expressive?
� Extraction primitives–Regular Expressions–Dictionary
� Text-specific primitives–Multi-lingual tokenization and parts-of-speech–Sentence and paragraph boundary detection–Span-based predicates
� Set-level primitives–Join–Block –Consolidation–Group By
© 2014 IBM Corporation26 IBM Research – Almaden9/10/2014
How will the Music Band Review extractor work in SystemT?How will the Music Band Review extractor work in SystemT?How will the Music Band Review extractor work in SystemT?How will the Music Band Review extractor work in SystemT?
Music
Review Snippet
Generic
Review Snippet
Concert Mention
Join
Union
Block
Join predicates
enforce additional
constraints
Find blocks of three or
more “Review Snippet”
patterns
…………
Review Snippet
Blocks of Review Snippet
© 2014 IBM Corporation27 IBM Research – Almaden9/10/2014
Class of Optimizations in SystemTClass of Optimizations in SystemTClass of Optimizations in SystemTClass of Optimizations in SystemT
� RewriteRewriteRewriteRewrite----basedbasedbasedbased: rewrite algebraic operator graph
–Shared Dictionary Matching–Shared Regular Expression Evaluation
–On-demand tokenization
� CostCostCostCost----basedbasedbasedbased: relies on novel selectivity estimation for text-specific operators
–Standard transformations • E.g., push down selections
–Restricted Span Evaluation• Evaluate expensive operators on restricted regions of the document
Tokenization overhead is paid only once
BandMember
(followed within 5 tokens)
Plan C
Plan A
Join
Instrument
Restricted Span Evaluation
Plan B
BandMemberIdentify Instrument starting
within 5 tokensExtract text to the right
InstrumentIdentify BandMember ending
within 5 tokensExtract text to the left
© 2014 IBM Corporation
Performance benefits using SystemT[Chiticariu et al. ACL’10]
� Music Band Review extraction task over a million web logs– SystemT vs. the grammar implementation
• 10 minutes vs. ~ 6 hours
� Named-entity extraction task over multiple document corpora– SystemT throughput ranges from 400 – 900 KB/sec/core (depending on the size of the document)
– SystemT vs. State-of-the-Art Learning-based System [Florian et al, CoNLL’03]~ 50 times higher throughput
– SystemT vs. State-of-the-Art Grammar-based System [ANNIE, Cunningham et al, ACL’02]~ 10 - 50 times higher throughput ~ 60 - 90% less memory consumption
� Revisiting the Twitter example, for keeping up with today’s tweets with 18 cores – SystemT takes 30 minutes per day as opposed to running 24/7 for the state-of-the-art system
© 2014 IBM Corporation29 IBM Research – Almaden9/10/2014
Runs fast ! But is SystemT expressive enough to compare on quality ? [Chiticariu et al. ACL’10, EMNLP’10]
� SystemT outperforms current best results on multiple benchmark datasets
– CoNLL 2003• F-measure between 89% and 92% for Person, Organization and Location
tasks
• Beats the state-of-the-art results consistently by up to 4%
– Enron Email• F-measure 85% for Person task
• Better than the state-of-the-art result by 7%
© 2014 IBM Corporation
What design choices did we make for SystemT ?
SystemT
(Information Extraction)
SystemML
(Machine Learning)
Data Model Document-at-a-time model
Data types: Span, Tuple, Relation
Operations Feature extraction primitives
Text-specific primitives
Set-level primitives
Language
Syntax
SQL-like syntax
Platform Embeddable runtime deployed in a wide
range of execution environments
Design
Choices
Analytics
Systems
© 2014 IBM Corporation31 IBM Research – Almaden IBM Confidential9/10/2014
SystemMLSystemMLSystemMLSystemML
© 2014 IBM Corporation32
Status Quo of Machine Learning AlgorithmsStatus Quo of Machine Learning AlgorithmsStatus Quo of Machine Learning AlgorithmsStatus Quo of Machine Learning Algorithms
� Machine Learning algorithm implementations today
– Specialized languages for Machine Learning
• R, Matlab
• Execution strategy for programs is determined by user
– Low-level implementations• Directly implement ML algorithms on specific platforms
–Hand-tuned implementations on specialized hardware GPU, BlueGene etc.
� But the programmer has to handle
– Performance optimizations due to data and compute platform characteristics – Parallelization for specific platforms
IBM Research – Almaden 9/10/2014
© 2014 IBM Corporation33
SystemML GoalsSystemML GoalsSystemML GoalsSystemML Goals
GNMF: V ≈ U = W H
MapReduce
Platform
Operator
Implementations
Higher level
MR1
MRn
MR2
'
Optimizations
V=readMM("in/V", rows=1e8, cols=1e5);
W=readMM("in/W", rows=1e8, cols=10);
H=readMM("in/H", rows=10, cols=1e5);
max_iteration=20;
i=0;
while(i<max_iteration){
H=H*(t(W)%*%V)/(t(W)%*%W%*%H);
W=W*(V%*%t(H))/(W%*%H%*%t(H));
i=i+1;}
� Provide language to implement ML algorithms
� Support specific ML constructs such as cross validation, bootstrapping, ensembles as first class citizens
�Optimizations based on data and system characteristics
� Scalable operator implementations
IBM Research – Almaden 9/10/2014
© 2014 IBM Corporation34
SystemML Architecture
� DML: Declarative Machine Learning Language
– Retain expressivity of current ML languages including
procedural constructs like while and for loops
� High-Level Operator (HOP) Component
– Represent dataflow in DAGs of matrices and scalar operations
– Choose from alternative execution plans using algebraic
rewrites and cost-based optimization
� Low-Level Operator (LOP) Component
– Low-level physical execution plan over key-value pairs
– “Piggyback” operations to reduce number of MapReduce jobs
� Runtime
– Efficient data representation and implementation of individual
operations in MapReduce framework
– Control module to orchestrate MR jobs
IBM Research – Almaden 9/10/2014
© 2014 IBM Corporation35
Simple Example of how SystemML works
Language HOP Component LOP Component Runtime
DC
B Binary hop
Divide
Binary hop
Multiply
C
A = B * (C / D)
Binary lop
Divide
Group lop
D
Binary lop
Multiply
Group lop
B
R1
M1
MR Job
LOP represents the physical plan for the program with a DAG for each statement block. LOP operates on key-value pairs and scalars
Multiple low-level operators combined in a MapReduce job
HOP represents the logical flow of the program as DAGs for each statement block.HOP operates on matrices and scalars
Input DML parsed into statement blocks with typed variables
IBM Research – Almaden 9/10/2014
© 2014 IBM Corporation
Declarative Machine Learning LanguageDeclarative Machine Learning LanguageDeclarative Machine Learning LanguageDeclarative Machine Learning Language
� Syntax borrowed from R
� What is supported
– Data Types: matrix, vector, scalar
– Statements
• Input/Output, Assignment, Control Structures (while, for), Rand
– Expressions
• Operators : Arithmetic, Comparative, Boolean, Matrix Multiplication
• Built-in Functions : Linear Algebra (transpose, …), Matrix aggregation (colSum, ...) ,
Mathematical (ln, sqrt, …)
– External Functions
– Machine Learning specific constructs : Cross validation, Ensemble learning
36 IBM Research – Almaden 9/10/2014
© 2014 IBM Corporation37
Categories of Optimization in SystemMLCategories of Optimization in SystemMLCategories of Optimization in SystemMLCategories of Optimization in SystemML
� HOP component
– Algebraic rewrites (e.g., matrix computation reordering)
– Cost-based optimization (e.g., choosing between different plans for matrix multiplication)
– Selection of physical representation of matrices (e.g., cell versus block representation)
� LOP component
– Piggybacking (packing lops that can be evaluated together in a single MapReduce job)
� Runtime
– Data representation (e.g., sparse versus dense)
– Sparsity-aware operator implementations
IBM Research – Almaden 9/10/2014
© 2014 IBM Corporation38
Performance Numbers
Gaussian NMF:V = readMM ("example.GNMF.V", rows= 1000, cols=100, nnzs= 2000, format="text");
W = readMM ("example.GNMF.W", rows= 1000, cols=20, nnzs= 20000, format="text");
H = readMM ("example.GNMF.H", rows= 20, cols=100, nnzs= 2000, format="text");
max_iteration = 10
i = 0
while (i < max_iteration) {
H = H * ((t(W) %*% V) / ( (t(W) %*% W) %*% H))
W = W * ((V %*% t(H)) / ( W %*% (H %*% t(H))))
i = i + 1
}
writeMM (W,"example.GNMF.W.result", format="text");
writeMM (H,"example.GNMF.H.result", format="text");
Data SizeTime per
iteration
Lines of
CodeRuntime Platform
In SystemML5 billion non zeros
(50m X 100k, sparsity 1x10-3)
1.2 hours11 lines of DML code
40 cores, 4 GB RAM per core
WWW 20104.38 billion non zeros
(43.9 m X 768m, sparsity 1.3x10-7)
7 hours1500 lines of Java code
SCOPE cluster
IBM Research – Almaden 9/10/2014
© 2014 IBM Corporation39
Additional AlgorithmsAdditional AlgorithmsAdditional AlgorithmsAdditional Algorithms
Exe
cutio
n T
ime
(sec
)
0
50
100
150
200
250
300
350
400
#rows and #columns in G (thousand)0 400 800 1200 1600
DML PageRank
Exe
cutio
n T
ime
(sec
)
0
200
400
600
800
#rows in V (million)0 2 4 6 8 10 12 14 16 18 20
DML Linear Regression
G=readMM("in/G", rows=1e6, cols=1e6);
p=readMM("in/p", rows=1e6, cols=1);
e=readMM("in/e", rows=1e6, cols=1);
ut=readMM("in/ut", rows=1, cols=1e6);
alpha=0.85;
max_iteration=20;
i=0;
while(i<max_iteration){
p=alpha*(G%*%p)+(1-alpha)*(e%*%u%*%p);
i=i+1}
writeMM(p, "out/p");
V=readMM("in/V", rows=1e8, cols=1e5);
b=readMM("in/b", rows=1e8, cols=1);
lambda = 1e-6;
r=-b ;
p=-r ;
norm_r2=sum(r*r);
max_iteration=20;
i=0;
while(i<max_iteration){
q=((t(V) %*% (V %*% p)) + lambda*p)
alpha= norm_r2/(t(p)%*%q);
w=w+alpha*p;
old_norm_r2=norm_r2;
r=r+alpha*q;
beta=norm_r2/old_norm_r2;
p=-r+beta*p;
i=i+1;}
writeMM(w, "out/w");
PageRank
Sparse Linear Regression
V: d x 100000, sparsity=0.001G: n x n, sparsity=0.001
IBM Research – Almaden 9/10/2014
© 2014 IBM Corporation
What design choices did we make for SystemML ?What design choices did we make for SystemML ?What design choices did we make for SystemML ?What design choices did we make for SystemML ?
SystemT(Information Extraction)
SystemML(Machine Learning)
Data Model Document-at-a-time modelData types: Span, Tuple, Relation
Data types: Matrix, Vector, Scalar
Operations Feature extraction primitivesText-specific primitivesSet-level primitives
Procedural constructs e.g., while, for
Linear Algebra operationsExternal Functions Machine Learning specific constructs
e.g., Cross Validation, Ensemble Learning
Language Syntax
SQL-like syntax R-like syntax
Platform Embeddable runtime deployed in a wide range of execution environments
MapReduce Runtime
Design Choices
AnalyticsSystems
© 2014 IBM Corporation41
SummarySummarySummarySummary
© 2014 IBM Corporation42
Lessons LearnedLessons LearnedLessons LearnedLessons Learned
– SystemT• Ships with eight IBM products• To date have not encountered a request that is not expressible in AQL
– SystemML• Ships with IBM BigInsights August beta this year• Declarative is the goal; but to express Machine Learning algorithms procedural constructs are needed• Users naturally gravitate to procedural constructs. Limiting usage of such constructs to only when required to specify “what needs to be done” may need lot of training
– SystemT• Choice of SQL-like syntax and Eclipse-based tooling quickly enabled hundreds of users with varied background
• But traditional NLP-trainees prompted us to provide a layer on top of AQL with grammar-like syntax• Business users demand even simpler and more usable tooling
– SystemML• Early days but multiple users inside IBM and almost all are previous R / Matlab users.• Familiar R syntax helps ML users up and running al most immediately
– SystemT• Document at a time model and all in-memory optimizations• Demonstrates that an order-of-magnitude throughput improvement can be obtained • Hardware acceleration further speed up the execution
– SystemML• Computation on a large-scale distributed platform• Initial experiences reinforce the argument “Query optimizers can beat all but the best programmers”IBM Research – Almaden 9/10/2014
© 2014 IBM Corporation
Tooling Research for the Development Life-Cycle
Develop
TestAnalyze
Development
DeployRefine
Test
Maintenance
Task Analysis
[ACL’11,12,13,CHI’13]
• Concordance Viewer
• Active labeling
• Labeling tool
• Extraction plan
• Track provenance [VLDB’10]
• Contextual clue discovery[CIKM’11]
• Regex learning [EMNLP’08]
• Suggest rule changes [VLDB’10]
• Rule induction [EMNLP’12]
• Dictionary refinement [SIGMOD’13]
• Rule learning
• NE Interface [EMNLP’10]
• Tagger UI [SIGMOD’07]
© 2014 IBM Corporation
Eclipse Tools OverviewEclipse Tools OverviewEclipse Tools OverviewEclipse Tools Overview
Ease ofEase ofEase ofEase of
ProgrammingProgrammingProgrammingProgramming
PerformancePerformancePerformancePerformance
TuningTuningTuningTuning
AutomaticAutomaticAutomaticAutomatic
DiscoveryDiscoveryDiscoveryDiscovery
AQL Editor
Explain
Pattern Discovery
Result Viewer
Regex Learner
AQL Editor:AQL Editor:AQL Editor:AQL Editor: syntax highlighting, auto-complete,
hyperlink navigation
Result Viewer:Result Viewer:Result Viewer:Result Viewer: visualize/compare/evaluate
Explain:Explain:Explain:Explain: show how each result was generated
Workflow UIWorkflow UIWorkflow UIWorkflow UI: end-to-end development wizard
Regex Generator:Regex Generator:Regex Generator:Regex Generator: generate regular expressions
from examples
Pattern DiscoveryPattern DiscoveryPattern DiscoveryPattern Discovery: identify patterns in the data
ProfilerProfilerProfilerProfiler: identify performance bottlenecks to be
hand tuned
© 2014 IBM Corporation
Web Tools OverviewWeb Tools OverviewWeb Tools OverviewWeb Tools Overview
Ease ofEase ofEase ofEase of
ProgrammingProgrammingProgrammingProgramming
Ease ofEase ofEase ofEase of
SharingSharingSharingSharing
Canvas:Canvas:Canvas:Canvas:
• Visual construction of extractors
• Customization of existing extractors
Result Viewer:Result Viewer:Result Viewer:Result Viewer: visualize/compare/evaluate
Concept catalog:Concept catalog:Concept catalog:Concept catalog: share concepts
Project: Project: Project: Project: share extractor development
Even for non-programmers
© 2014 IBM Corporation46
Don Chamberlin
We set out to help non-programmers interact with databases to open up access to data to a whole new class of people who could do things that were never possible before. The problem that we didn't think we were working on at all was how to embed query languages into host languages, or how to make a language that would serve as an interchange medium between different systems -those are the ways in which SQL ultimately turned out to be very successful,
SQL Reunion, 1995
Don observed that success of SQL was due to the language serving as an
interchange medium between systems. In contrast declarative systems
for analytics may indeed be successful for the original purpose that SQL
was intended – open up access to analytics to a whole new class of people
Don has the last word … Don has the last word … Don has the last word … Don has the last word …
Maybe ..
© 2014 IBM Corporation
Thank You!
47