parstream - big data for business users
DESCRIPTION
CTO of ParStream Joerg Bienert hold a presentation on February 25, 2014 about Big Data for Business Users. He talked about several use cases of current ParStream customers and ParStreams' technology itself.TRANSCRIPT
Real-time
in
Big Data
Big Data“Every two days now we create as much information as we did from the dawn of civilization up until 2003.”
Eric Schmidt, Ex Google CEO
“85% of respondents say the issue is not about volume but the ability to analyze and act on data in real time”
Cap Gemini Study on Big Data 2012
Real Time
Karl Keirstead, BMO Capital Markets 2013
“It’s About Fast (not just Big) Data”
Fast Data
Real-time on Big Data becomes essential for survival of businesses
Realtime
Campaign steering
Trading risk analytics
Network monitoring
Algorithmic decisions
Algo trading
Programmatic ad-serving
Recommendation engine
Fraud prevention Interactive AnalyticsA/B-Testing App analytics
M2M
Big DataNetwork Data Point of Sale Data
Shopping Cart
Web Logs Sensors Twitter Stock Data Locations
Financial TXLogicstics Car Data
Real Time ?
Immediate Answers
Immediate Availability
● Smart Grids
Real
-Tim
eLa
g Ti
me
Batch Import Continuous Import
Daily Hourly Every second
Online Investigation
Automatic response systemsInteractive Analytics
Post-mortem Analytics
Weekly Every minute
● Fraud detection
● Ad-Serving
● Guided Shopping
● Campaign Control
● Web-Analytics
● Re-Targeting● Offer-Caches
● Trend-Spotting
● Customer churn rate reduction
● Revenue assurance
● Recommendation / promotional items
● Application monitoring
Trading analytics ●
● Prepaid-accounts
Customer account analytics ●
● Investment risk analytics
● Geo-spatial analytics
● Geo-Steering
< 1..10 milli sec
10 sec
10 min
10..100 milli sec
1 sec
1 min
Response time
1h
● SEO analytics
Immediate Answers & Availability
Availability
Ans
wer
s
8
USE CASES IN ALL INDUSTRIES
Confidential
eCommerceServices
FacettedSearch
Web analytics
SEO-analytics
Online-Advertising
Ad serving Profiling Targeting
Social Networks
Trend analysis
Fraud detection
Automatictrading
Risk analysis
Finance
Customerattritionprevention
Network monitoring
Targeting Prepaid
account mgmt
Telco
Smart metering
Smart grids Wind parks Mining Solar Panels
EnergyOil and Gas
Many More
Production Mining M2M Sensors Genetics Intelligence WeatherM
any
Ap
pli
cati
on
s
All Industries
9
Real-time Requires New Technology
RealtimeBig DataEngine
ContinuousData Import
Any Bus
Any File
Any StreamReal-Time
Monitoring
InteractiveAnalytics
Real-TimeDashboarding
Ultra-fastQuerying
Immediate Availability
1 ImmediateAnswers
3 InteractiveAnalytics
4
Geo-DistributedProcessing
5
Billion Records
2
LowTCO
6
etracker is a leading web-analytics and campaign steering company in Europe
Web-Analytics
Real-time web-analytics for 50,000 domains delivering 10 billion web-clicks
Continuous data import with maximum latency of 30 seconds
Complex interactive analytics for life-segmentation of customer groups
< 2 sec query response time for > 100 concurrent interactive user
Campaign steering – moving ahead from trail and error to continuous multidimensional optimization
ParStream imports 500,000 sensor readings per sec delivering real-time monitoring and long-term analytics
Gasturbines
5,000 sensors are delivering 1,800,000,000 measurements per hour
ParStream immediately imports and stores all sensor readings
Real-time monitoring with ParStream ensures early issue identification
Long-term analytics for predictive maintenance reduces downtime
Maintenance of gas turbines is a more lucrative business than the initial build
ParStream extends usage of QlikView installation from 400M to 6B records for interactive analytics
FMCG Retailer
Customer is the leading retail chain in Austria, a long term QlikView customer
POS-data analytics is heavily used for price negotiations with vendors
QlikView is easy to use and ultra fast but limits data volume to 400M records
Limited volume, time range and granularity of data hinders negotiations
ParStream extends usage of QlikView from 2 weeks to 6 month of data
Further extension to 30 billion records planned to cover 2.5 years of data
End-to-end network monitoring on packet-level detail unveils bottle-necks unseen for decades
Telecom
Continuous import with >1 million rows per second per node
Package level granularity delivers previously impossible insights
Field trail discovered bottle-neck nobody expected, billion dollar investment saved
Decentralized architecture capturing, storing and analyzing data at source
Massive reduction in network traffic due to decentralized storage
Solution is blue-print forInternet-of-Things use-cases
Decentralizedstorage & analytics
NDC NDC NDC
M2M Analytics
Network
Analytics
CRM/CEM Analytics
NPI Analytics
Analytics
Local Local Local Local Local
NDCNDC
Ad-hoc integration
Logical data warehouse
NoSQLFederation Server
Cache
v
• Keyword-Analysis of competitor domains
• Complex SQL Queries in Realtime
• 7 Tbyte mport
• 10 billion records
• < 1 sec Response time
• Reduction from 150 to 4 Servers
Google Search
Application Server
Complex correlativeSQL queries of many concurrent users
10,000,000,000 domain keyword relations
<1 sec response timeFirst 100domainsfor 10 millionkeywords in10 countries
Interactive domain traffic competitor report & analysis
SEO Analytics at Searchmetrics
INRA MetaGenoPolis (MGP) analyzes 17 billion records interactively – growing 100x per year
Bio-Technology
INRA is the world leader in meta-genomic research
Up to 50 million different bacteria are identified per stool sample
Sample size will grow by 100x over next 12 month
Data volume will grow from 17 billion to 2 trillion records
Researchers analyze correlation of bacteria presence with illnesses
ParStream is used to interactively discover and analyze correlations
Detection of Hurricane Risk Areas
Science: Climate Research
• Interactive Analytics of weather simulation data
• Response time 0.1 secon 3 billion data records
• Multi-dimensional queryingon geo-location data
• Run complex queries In-Databaseat very high speed
• No need for Cubes – up-to-date & full granularity
• Continuously import new data with low-latency
Coface Services is the Innovation Leaderin reliable Business Information
Facetted Search
Interactive guided selection process delivers better conversion rate
Multi-lingual text search and numeric-multiple-choice filters
15 billion data points
1,000 Coface columns+10,000 Customer columns
>100 concurrent users
< 100 ms response time
18
Real-time Requires New Technology
RealtimeBig DataEngine
ContinuousData Import
Any Bus
Any File
Any StreamReal-Time
Monitoring
InteractiveAnalytics
Real-TimeDashboarding
Ultra-fastQuerying
Immediate Availability
1 ImmediateAnswers
3 InteractiveAnalytics
4
Geo-DistributedProcessing
5
Billion Records
2
LowTCO
6
Needs vs. Reality
You want… What you get…
Sub-Second querieshigh speed import
Too Slow(Hadoop, Map Reduce)
Fully flexiblefully granular
Inflexible(Cassandra, KVS)
Scales on big data and big streams
Does not scale(traditional DBMS)
Billions of Records
Ultra-fastQuerying
ContinousImport
ThousandsOf Columns
ParStream Is Build For Fast Data
High QueryThroughput
ParStream is thefastest real-time database
for smart data
Unique Combination of continuous high speed import and
ultra-fast query response times
v
Map-Reduce RDBMS
Front-End
Raw-Data
Application Tool
Real-Time Analytics Engine
High Speed Loader with Low Latency
C++UDF - APISQL API / JDBC / ODBC
In-Memory andDisk Technology
Massively ParallelProcessing (MPP)
Multi-DimensionalPartitioning
Shared NothingArchitecture
3rd generation Columnar Storage
High PerformanceCompressed Index
(HPCI)
Patented high performancecompressed index - USP!
Build from scratch in C++
100 % own patented IP
Leading edge DB architecture
Massively parallel shared nothing cluster architecture
Optimized for standard hardware and many Linux distributions
Runs on single server, clusterand all clouds
Outstanding Technology with USP – high performance compressed index
Massive Performance Gain On Analytical Operations – Major Technological Innovation and Differentiation
High Performance Compressed Index (HPCI)
Superior ParStream index architecture
– High Memory Requirements
– High Load on CPUs
– Latency due to Decompression
– Not Suitable for Big Data
+ Immediate Query Processing
+ No Need for Decompression
+ Massively reduced memory + IO load
+ Ultra-high Throughput
Standard index architecture
Highly Scalable
Embedded
Systems
SingleServer
Cluster Cloud
Standard Hardware + Standard Linux
Real-time Query Performance
1 2 3 40
1000
2000
3000
4000
5000
6000
7000
8000
9000
Parstream
RedShift
Query # QUERY
1 select count(distinct AirlineID) as airlines, count(distinct FlightNum) from otp where YearD BETWEEN 1997 AND 2012 AND DestState='NY' AND Quarter=3 AND DayOfWeek=4 AND OriginState='FL'
2 select count(distinct AirlineID) as airlines, count(distinct FlightNum), sum(Distance) from otp where YearD BETWEEN 1997 AND 2012 AND DestState='NY' AND Quarter=3 AND DayOfWeek=4 AND OriginState='FL'
3 select count(distinct AirlineID) as airlines, count(distinct FlightNum), count(distinct Distance), sum(Distance) from otp where YearD BETWEEN 1997 AND 2012 AND DestState='NY' AND Quarter=3 AND DayOfWeek=4 AND OriginState='FL'
4 select max(TaxiIn), sum(DepDelayMinutes), min(TaxiIn), avg(ArrDelayMinutes) from otp where YearD BETWEEN 1997 AND 2012 AND DestState='NY' AND Quarter=3 AND DayOfWeek=4 AND OriginState='FL'
Q # RS (mS) PS (mS) Factor
1 7797 264 29
2 8036 313 25
3 7949 381 20
4 7086 129 55
Environment: Single EC2 XL node with 15 GB RAM, 2 TB disk on Amazon AWS.OTP Data Set with about 150 Million records
Comparison with leading analytical databases are available on request
Query Response Time
ParStream – real-time demo
Try out the interactive ParStream demo on https://www.parstream.com/product/demos/
ParStream – The Company
• Founded 2008 in Cologne
• 50 employees in Cologne, Paris, Silicon Valley, Boston
• International Customers
• Running 24x7 in production for more than 3 years
• $ 15.6 M funding: Khosla Ventures (lead), Andy Bechtolsheim, Crunchfund, Data Collective, Baker Capital, Tola Capital, and others