streaming overview and contenders programming languages...
TRANSCRIPT
© 2015 IBM Corporation1
Streaming Meetup
30 August 2016Roger Rea, IBM Streams Offering Manager
Streaming Analytics and Python
Streaming overview and contenders
Programming Languages: SPL and Python
© 2015 IBM Corporation2
Why are we here?
Pizza?
Reese’s Peanut butter cups?Two fast growing trends are coming together - streaming analytics and Python. It's like peanut butter
and chocolate!
Streaming analytics is a superset of complex event processing, with clustered runtimes to support
greater volume of events, ability to analyze unstructured data and more expressive programming
paradigms. And very low latency to enable real time analytic processing.
Python is a widely used high-level, general-purpose, interpreted, dynamic programming language. Its
design philosophy emphasizes code readability, and its syntax allows programmers to express
concepts in fewer lines of code than possible in languages such as C++ or Java. The language provides
constructs intended to enable clear programs on both a small and large scale.
This meet up will provide an overview and comparison of many different contenders in the fast growing
streaming analytics space, then show a demo of IBM Streams technology allowing programs written
completely in Python to call Streams libraries, and then deploy those apps to the Streams runtime.
© 2015 IBM Corporation3
3
Speaker Biography
Roger Rea leads the cross functional team for marketing, sales, development,
services, product management and support for IBM Streams within Analytics
Platform Services. Prior to this assignment, Roger held a variety of sales,
technical, educational, marketing and management jobs at IBM, Skill
Dynamics and Tivoli Systems.
Roger earned a Bachelor of Science in Mathematics and Computer Science,
cum laude, from the University of California at Los Angeles (UCLA). He has
also received a Masters' Certificate in Project Management from George
Washington University.
Roger lives in Cary, North Carolina, USA with his wife and two children and
enjoys skiing, kayaking, reading, cooking and singing in his church choir.
Roger ReaIBM Streams Senior Offering Manager
[email protected], 1-919-345-7386Paste Photo here
EnglandWales London
BirminghamRea River
Rhea and Rea evolved in Britain from the ancient
Welsh word “Rhe” meaning “rapid stream.”
© 2015 IBM Corporation4
Audience Biography
Developers?
What languages?
•Python, Java, C/C++/C#, SPL, Ruby, ??
Data Scientists?
What tools?
•R, SPSS, SAS, WEKA, MATLAB, ?
© 2015 IBM Corporation5
What is Streaming Analytics?
Software that can filter, aggregate, enrich,
and analyze a high throughput of data from
multiple, disparate live data sources and in
any data format to identify simple and
complex patterns to provide applications with
context to detect opportune situations,
automate immediate actions, and
dynamically adapt.
© 2015 IBM Corporation6
Time is ripe for a new era of computing
Emerging trends create need for new languages
Scientific programming Fortran
Business programming Cobol
Systems programming at higher level C
Increased productivity C++
Database programming SQL
Web programming Java
Data scientist Python
Streaming data and multicore architectures
Streams Processing Language
Stored data and multicore architectures
Hadoop, Map-Reduce, Spark
© 2015 IBM Corporation7
Who delivers Streaming Analytics?
The Forrester Wave™: Big Data Streaming
Analytics Platforms, Q1 2016
Market Report Paper by Bloor, Author Ronnie Beggs
Publish date June 2016 Streaming analytics 2016
© 2015 IBM Corporation8
The Forrester Wave™:
Big Data Streaming Analytics, Q1 2016
© 2015 IBM Corporation9
Stream Computing
Open
Sourc
e
Exte
nsib
le p
latfo
rm
Managed S
erv
ice
Batc
h &
Stre
am
ing
Com
mand L
ine i/fa
ce
Web &
JM
X m
gm
t
At L
east O
nce
Exactly
one
Sta
te
Win
dow
s
Back p
ressure
Machin
e L
earn
ing
Model s
corin
g
Vid
eo/Im
age
Geospatia
l
Text A
naly
tics
Vis
ual d
evelo
pm
ent
Auto
mate
d H
A
Ente
rpris
e a
dapte
rs
Open s
ourc
eadapte
rs
Esper
IBM Streams
Storm
Flink
Spark Streaming
Dataflow
© 2015 IBM Corporation10
Integrated Development
EnvironmentScale-Out Runtime
Analytic Toolkits &
Adapters
Development and Management Functional and OptimizedFlexibility and Scalability
Cloud and on premise available for flexible deployment
IBM Streams Overview
© 2015 IBM Corporation11
Streams next release 3Q16
• Apache Edgent support:
Java based Streaming analytics targeted at Internet of Things
market to deliver analytics at the edge
• Streams Rules:
Rules compiler to enable ODM Rules to run natively on Streams
for superior performance and low latency
• Python development:
Python developers can easily call APIs to Streams libraries which
are then compiled and deployed to Streams
Technical Foundation:
1. Speech to Text toolkit
2. Cybersecurity Toolkit enhancements
3. Submission time fusion of operators
4. Asynch non-blocking checkpointing
5. Streams consistent region using RDMA
Information regarding potential future products is intended to outline our
general product direction and it should not be relied on in making a
purchasing decision. The information mentioned regarding potential future
products is not a commitment, promise, or legal obligation to deliver any
material, code or functionality. Information about potential future products
may not be incorporated into any contract. The development, release, and
timing of any future features or functionality described for our products
remains at our sole discretion.
© 2015 IBM Corporation12
12
IBM Streams: Overview of our best of breed programming model
Streams Processing Language (SPL)
Input(Source)
Output(Sink)
Process(Operators)
Platform optimized compilation
Meters
Usage
Model
Company
Filter
Usage
Contract
Text
Extract
Text
Extract
Degree
History
Compare
History
Temp
Action
Store
History
Season
Adjust
Daily
Adjust
Filter
Fuse
Cleanse
Classify
Analyze
Model
Act
Persist
Weather
Data
Operators:
- SPL or custom with
Java, C++, and now,
Python
- Compiled into
processing elements
(PE’s) for deployment
© 2015 IBM Corporation13
IBM Streams at a glance
Hadoop
Data
Warehouse
Communications Data Sources
TCP/IP
UDP/IP
HTTP
FTP
RSS
Messaging Toolkit (Kafka, XMS, IBM
MQ, Apache ActiveMQ, RabbitMQ,
MQ TT, MQ Low Latency
Messaging)
IBM DataStage
IBM Data Replication
Functions:
• Filter
• Enrich
• Normalize
• Windowed Aggregations
• Machine Learning
• Scoring (SPSS, R, MLlib)
• CEP & Pattern Matching
• Geospatial
• Video/Image
• Text Analytics (AQL)
• Speech to Text
• IBM ODM Rules
IBM Streams
Scale-out Runtime
Hadoop: HDFS, GPFS, Hive, Hbase,
BigSQL, Parquet, Thrift, Avro
RDBMS: IBM DB2, IBM DB2 Parallel
writer, IBM Informix, IBM BigInsights
BigSQL, IBM Netezza,
IBM Netezza NZLoad, solidDB,
Oracle, Microsoft SQL Server, MySQL,
Teradata, Aster, HP Vertica
NoSQL:
Key Value Stores (Memcached, Redis,
Redis-Cluster, Aerospike)
Column Oriented Stores (Cassandra,
Hbase)
Document Oriented Stores (IBM
Cloudant, Mongo, Couchbase)
NoSQL
© 2015 IBM Corporation14
14
IBM Streams: A pioneering platform rooted in real-time analytics since 2003[A technology hardened in the IBM Research labs for the first six years in collaboration with a quality conscious U.S
Government agency.] (It has been a fully supported IBM product since 2009. {v1.0 to v4.1 as of 2015})
Mining in Microseconds &
Statistics
Predictive
AdvancedMathematicalModels(IBM Research)
Natural Language
Processing
Geospatial
Acoustic(IBM Research and Open Source)
Entities & Relationships
Image & Video(Open Source)
© 2015 IBM Corporation15
Development Environment
Integrated Development
Environment
Development and Management
Streams Processing Language
Visual Composition Tools
© 2015 IBM Corporation16
IBM Streams: Development time terminology
Operator The fundamental building block of the Streams Processing
Language
Operators process data from streams and may produce new streams
Stream An infinite sequence of structured tuples
Can be consumed by operators on a tuple-by-tuple basis or through the definition of a window
Tuple A structured list of attributes and their types. Each tuple on
a stream has the form dictated by its stream type
Stream type Specification of the name and data type of each attribute in
the tuple
Window A finite, sequential group of tuples
Based on count, time, attribute value,or punctuation marks
directory:"/img"
filename:"farm"
directory:"/img"
filename:"bird"
directory:"/opt"
filename:"java"
directory:"/img"
filename:"cat"
Streams Application
stream
tuple
height:640
width:480
data:
height:1280
width:1024
data:
height:640
width:480
data:
operator
© 2015 IBM Corporation17
Anatomy of an Operator Invocation Operators share a common structure
italics are sections to fill in
Reading an operator invocation
Declare a stream stream-name
With attributes from stream-type
that is produced by MyOperator
from the input(s) input-stream
MyOperator behavior defined by
logic, parameters, windowspec, and configuration; output
attribute assignments are specified in output
For the example:
Declare the stream Sale with the attribute item, which is a raw
(ASCII) string
Join the Bid and Ask streams with
sliding windows of 30 seconds on Bid, and 50 tuples of Ask
When items are equal, and Bid price is greater than or equal to
Ask price
Output the item value on the Sale stream
stream<stream-type> stream-name
= MyOperator(input-stream; …)
{
logic logic ;
window windowspec ;
param parameters ;
output output ;
config configuration ;
}
Syntax:
17
Example
stream<rstring item> Sale = Join(Bid; Ask){
window Bid: sliding, time(30);Ask: sliding, count(50);
param match : Bid.item == Ask.item&& Bid.price >= Ask.price;
output Sale: item = Bid.item;}
© 2015 IBM Corporation18
IBM Streams: A rich set of data types to code powerful analytics and optimize performance
(any)
(composite)(primitive)
(collection) tupleboolean enum (numeric) timestamp (string) blob
list set maprstring ustring(integral) (floatingpoint) (complex)
(signed) (unsigned) (float) (decimal)
int8
int16
int32
int64
uint8
uint16
uint32
uint64
float32
float64
float128
decimal32
decimal64
decimal128
complex32
complex64
complex128
xml
User-defined types
type Integers = list<int32>;type MySchema = rstring s, Integers ls;
© 2015 IBM Corporation19
Application
– Data flow graph of operator instances connected to
each other via stream connections
Operator
– Reusable stream analytic
Input ports: receives data / Output ports: produces data
Source: No input ports / Sink: No output ports
Operator Instance
– A specific instantiation of an operator
Stream
– Continuous series of tuples, generated by an operator instance’s output port
Stream connection
– A stream connected to a specific operator instance input port
Processing Element (PE)
– A runtime process that executes a set of operator instances
Job
– An application instance running on a set of hosts
O1
O2
O3
(stream<Type> A) as O1 = MySrc() {}
() as O2 = MySink(A) {}
() as O3 = MySink(A) {}
A
stream A
stream
connection
MySink
MySink
MySrc
IBM Streams: Runtime terminology
© 2015 IBM Corporation20
IBM Streams: From operators to running jobs
Streams application graph:
A directed, possibly cyclic, graph
A collection of operators
Connected by streams
Each complete application is a potentially deployable job
Jobs are deployed to a Streams runtime environment, known as a Streams
Instance (or simply, an instance)
An instance can include a single processing node (hardware)
Or multiple processing nodes
Streams instance
OP
OP
Src
Src
Sink
Sink
OP
h/w node
node nodenode
node
node nodenode
© 2015 IBM Corporation21
21
Linear Road
Data Feeder
(TCP or Kafka or File)
Position report
and accident
Analytics for East
and West traffic.
(Type 0 and 1)
Daily expenditure
Analytics
(Type 3)
Account balance
Analytics
(Type 2)
Result
notifications
End to end average
throughput
1.87K events per second
(20.2 Million total events in 3 hours)
Response time below 1 second Response time at 1
second
Type 0 responses: 98% Type 0 responses: 2%
Type 1 responses: 97.8% Type 1 responses: 2.2%
Type 2 responses: 98.5% Type 2 responses: 1.5%
Type 3 responses: 99.9% Type 3 responses: 0.1%
(Linear Road specification states 1 to 5 seconds as an acceptable response time)
Application
components
# of CPU
cores
Data feeder 1
Event receiver and router 1
Type 0 and Type 1 analytics 1
Type 2 analytics 1
Type 3 analytics 1
Type 0 result writer 1
Type 1 result writer 1
Memory Usage: 1.8GB CPU Utilization: 2%
Linear Road Benchmark: Streams application graph for 1 expressway
© 2015 IBM Corporation22
Streams results
L-Rating 50 on one Azure node, 200 on 4
Azure nodes
1 node, 16 cores, nearly 1B events
4 nodes, 64 cores, nearly 4B events
Linear scalability
Handles bursty traffic
99% of responses sub-second
# of x-ways # of cars Entries Memory CPU
1 278973 19.2 Million 2.2 GB 2%
2 558726 38.5 Million 4.5 GB 4%
5 1.3 Million 96.3 Million 10.9 GB 7%
10 2.7 Million 192.5 Million 22.0 GB 11%
15 4.1 Million 289.7 Million 33.0 GB 16%
20 5.6 Million 385.2 Million 43.5 GB 20%
25 6.9 Million 482.0 Million 54.5 GB 26%
50 14.0 Million 963.1 Million 109.0 GB 31%
100 27.6 Million 1.9 Billion 220 GB 22%
150 41.5 Million 2.8 Billion 330 GB 33%
200 55.0 Million 3.8 Billion 440 GB 45%
0
20
40
60
80
100
1 5 10 15 20 25 30 35 40 45 50
Avg
. T
hro
ug
hp
ut
(K
even
ts/s
eco
nd
)
Number of expressways
0
100
200
300
400
50 100 150 200
Avg
. T
hro
ug
hp
ut
(K
even
ts/s
eco
nd
)Number of expressways
© 2015 IBM Corporation23
Python History
Conceived in late 80’s
Implementation began by December 1989
Multi-paradigm programming language: object-oriented programming and
structured programming are fully supported
Dynamic typing and late binding
Core philosophy
Beautiful is better than ugly
Explicit is better than implicit
Simple is better than complex
Complex is better than complicated
Readability counts
Guido van Rossum,
the creator of Python
© 2015 IBM Corporation24
Explore Python
Indentation matters
Variables
Numbers (integers and floats), Strings, Lists, Tuples, Dictionaries
Functions
Looping:
for <iteration variable> in <list>:
• <block of statements>
Conditional execution:
if <condition>:
• <block of statements>
Classes
© 2015 IBM Corporation25
Explore Python
As of August 2016, the Python Package Index, the official repository of
third-party software for Python, contains over 86,000 packages offering a
wide range of functionality, including:
graphical user interfaces, web frameworks, multimedia, databases,
networking and communications
test frameworks, automation and web scraping, documentation tools,
system administration
scientific computing, text processing, image processing
Notebooks
© 2015 IBM Corporation26
Streams & Python together: 2 capabilities
For the Python developer: Code all in Python, call Streams toolkits, run on Streams
Hello World:
import mymodule;
from streamsx.topology.topology import *
import streamsx.topology.context
topo = Topology("HelloWorld")
hw = topo.source(mymodule.hello_world)
hw.sink(print)
streamsx.topology.context.submit("STANDALONE", topo.graph)
For more, visit:
http://ibmstreams.github.io/streamsx.topology/doc/spldoc/html/tk$com.ibm.stream
sx.topology/ns$com.ibm.streamsx.topology.python$1.html
import mymodule; from streamsx.topology.topology import * import streamsx.topology.context topo = Topology("HelloWorld") hw = topo.source(mymodule.hello_world) hw.sink(print) streamsx.topology.context.submit("STANDALONE",
© 2015 IBM Corporation27
Streams & Python together: 2 capabilities
For the SPL developer: Decorate Python functions inline in SPL
# Import the SPL decorators from streamsx.spl import spl
# Defines the SPL namespace for any functions in this module
# Multiple modules can map to the same namespace
def splNamespace():
return "com.ibm.streamsx.topology.pysamples.mail"
@spl.pipe
def SimpleFilter(a,b):
"Filter tuples only allowing output if the first attribute is less than
the second. Returns the sum of the first two attributes."
if (a < b):
return a+b, For more, visit:
http://ibmstreams.github.io/streamsx.topology/doc/spldoc/html/tk$com.ibm.stream
sx.topology/ns$com.ibm.streamsx.topology.python$6.html
© 2015 IBM Corporation28
Steps to try it out
1. Download Streams Quick Start Edition: ibm.com/streams
2. Clone the streamsx.topology project: github.com/IBMStreams/streamsx.topology
1. first clone, then hit 'clone or download' to download to your machine
3. Extract to streamsx.topology in streamsadmin of Streams Quick Start Edition
4. cd to streamsx.topology directory and type 'ant‘
5. cd to com.ibm.streamsx.topology/opt/python/packages
1. Note the current directory path and type
2. 'export PYTHONPATH=$PYTHONPATH:<directory path>'.
3. export
PYTHONPATH=$PYTHONPATH:/home/streamsadmin/com.ibm.streamsx.topology/com.
ibm.streamsx.topology/opt/python/packages
4. Also, add this to your .bashrc profile
gedit ~/.bashrc
then add export statement to last line of file and save
© 2015 IBM Corporation29
Steps to try it out (continued)
6. Download Anaconda for Jupyter notebook: continuum.io/downloads
7. Install Anaconda
8. In Streams Quick Start Edition, open Streams Domain Manager
6. Ensure the domain is running!
9. In Streams Quick Start Edition Domain Manager, start the Streams Console
10. In Streams Console, ensure Instance is running
11. In Streams Quick Start Edition terminal window, type
• 'pip install git+https://github.com/pybrain/pybrain.git'
• This is a dependency for the NetDemo demo
12. Download and extract demos.zip into your streamsadmin directory
13. In the terminal, cd to the demos folder
14. Type 'jupyter notebook'
15. A browser should pop up. In it, change to the NetDemo directory
© 2015 IBM Corporation30
Steps to try it out (continued)
16. Click on the NetDemo.ipynb
17. The NetDemo demo does three things:
a) Creates a dataset (engine temp vs. probability of failure). This is the first line.
b) Creates a model to predict a probability of failure given an engine temp.
c) Creates a streaming application using the Python API which uses the model.
18. Put cursor in first cell, click on ‘run cell’
19. Repeat with next cell to build model
20. Repeat with last cell to run on Streams
1. This complies Python to SPL
2. Then compiles SPL to C/C++
3. Then creates .sab executiable
4. Then deploys to the Streams runtime
5. Returning code to the Jupyter notebook
21. When in doubt, go to kernel -> restart and clear output.
Run cell
© 2015 IBM Corporation31
Some observations
I’m not a programmer – that was probably obvious!
Even when I was, it was procedural, not OO, so I found Python confusing
Some confusing terminology
‘tuple’ used in both Streams and Python
SPL Map == Python Dict {}
Python ideosyncracies – indents, parentheses, square and curly brackets
© 2015 IBM Corporation32
Additional resources
Visit:
ibm.com/streams
github.com/Walmart
github.com/IBMStreams/benchmarks
© 2015 IBM Corporation34
Legal Disclaimer
• © IBM Corporation 2015. All Rights Reserved.
• The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained in this publication, it is
provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to change by IBM without notice. IBM shall
not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of,
creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.
• References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this
presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way.
Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
• If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete:
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending
upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no
assurance can be given that an individual user will achieve results similar to those stated here.
• If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete:
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance
characteristics may vary by customer.
• Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM Lotus® Sametime® Unyte™).
Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server). Please refer to http://www.ibm.com/legal/copytrade.shtml for
guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your presentation. All product names must be used as adjectives rather than nouns. Please list all
of the trademarks that you use in your presentation as follows; delete any not included in your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2,
PartnerWorld and Lotusphere are trademarks of International Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other
countries, or both.
• If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete:
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries.
• If you reference Java™ in the text, please mark the first use and include the following; otherwise delete:
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
• If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete:
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.
• If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete:
Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
• If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete:
UNIX is a registered trademark of The Open Group in the United States and other countries.
• If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete:
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.
• If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta Bank, Acme) please update
and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration purposes only.
© 2015 IBM Corporation35
Realtime ECG Monitoring with Python and Streams
Real-time analytics using Python and IBM Streams
Demo consists of two applications:
PhysionetIngestService – ingest ECG data from physionet.org – Data is published using Publish operator for
downstream Analytics
ECGPatientDataViz
• Application written in Python
• Ingest data from Physionet -> R Peak Detection using Biosppy -> Print
• Sets up two views – one for visualizing raw ECG data, one for R-Peak detection
© 2015 IBM Corporation36
Realtime ECG Monitoring with Python and Streams
Python Application in Jupyter Notebook
Real-time ECG visualization
Demonstrates how we can integrate with Python Visualization Library using View
Python Bokeh Visualization Library (http://bokeh.pydata.org/en/latest/)
© 2015 IBM Corporation37
Realtime ECG Monitoring with Python and Streams
Real-time R-Peak Detection in ECG Data
Real-time Poincaré plot to shows Heart Rate Variability – the more variability, the healthier the heart is.
Demonstrates how to use existing Python analytics library in real-time analytics (http://biosppy.readthedocs.io/en/stable/#)
Info about Poincare Plot (https://en.wikipedia.org/wiki/Poincaré_plot)