End User Panel on Real-Time Data Analytics
Building Predictive Applications with Real-Time Data Pipelines and Streamliner
Eric Frenkiel, CEO and Co-Founder, MemSQL
Going Real-Time is the Next Phase for Big Data
MoreDevices
More Interconnectivity
MoreUser Demand
…and companies are at risk of being left behind
MemSQL Architecture
S t r e a m i n g Da t a W a r e h o u s e
Streaming
Integrated streamingwith Streamliner
Database
High volume transactions for structured and unstructured data
Data Warehouse
Fast, scalableSQL for immediate
analytics
Applications and Technology Trends
Real-Time Analytics Risk-Management Personalization
Portfolio Tracking Monitoring and Detection
Internet of Things | Real-Time Data Pipelines | Operationalizing Apache Spark
Changing the Way the World Invests
Noah Zucker, Vice President – Tactical Engineering, Novus Partners
Scalable Portfolio Intelligence with MemSQL
100+ Investment Managers, $2 Trillion AUM Research Platform: 10,000+ Institutions Founded 2007, Privately Held
We help investors discover their true investment acumen and risk
About Novus
24/7 ETL Handholding
Overnight Failure = Business Hours Slowdown
Scala worker pool limited by the database
Non-trivial code changes needed to shard and scale
Before MemSQL…
Today’s Portfolio Intelligence…Right Now
Before MemSQL:
With MemSQL:
90 Min.
2 Min.
Customer Data
Persistent StoreETL Analytics
(Scala)
First-Class JSON Support…Happy Developers
memsql> select * from tasks t where t.task::uid::%clientId = 7;
+---------+---------------------------------------------------------------+| task_id | task |+---------+---------------------------------------------------------------+| 3 | {"uid":{"clientId":7,"id":1009,"which":"P"},"user":"noahlz"} |+---------+---------------------------------------------------------------+1 row in set (0.00 sec)
Salat
Client team focuses on service, not ETL
Predictable application performance
Scala workers: 12 126 Add servers to scale –
No code changes needed
With MemSQL…
Problem: Business Intelligence Slows as We Grow
Data lives in SQL Easy to ask new questions in SQL But… Business Intelligence tasks taking longer Database isn’t built for quick aggregations
Solution: Scale-out SQL Database SQL team stays powerful Quick to iterate with quick answers Prepare for the future!
Problem: Data isn’t in MemSQL
Plus
You don’t have an engineer on your team
It’s hard to get an engineer’s time
You’ve got a job to do… (which is taking more and more time)
Solution: ETL Using REPLACE INTO MySQL SQL flavor (available in MemSQL) Handles new rows and updates on rows Easy to write
• Query source database then replace into target database
Many other scale-out SQL databases don’t have equivalent
Solution: MemSQL Loader + JSON Type
Only loads new files (or files whose content has changed)
Parallelizes the process Transformation script
simple: return id and raw json data
SQL team unaffected by new JSON events
./memsql-loader load /opt/events/** --table events --script=/opt/events-etl --file-id-column file_id --columns id,data
Problem: Processing Data on Select Need computed value in SQL query Computing the value slows down queries Computed value used on many queries
• e.g. domain from a URL string
Solution: Persistent Columns
Pre-compute result and save it on the row
Automatically updated if row changes
No need to alter ETL pipeline
ALTER TABLE events ADD COLUMN ( referring_domain AS substring_index(substring(data::$referrer, (locate('//', data::$referrer)) + 2), '/', 1) PERSISTED varchar(255))
Solution: Persistent Columns
Use pre-computed value in select
memsql> select data, referring_domain from events limit 2;+-------------------------------------+------------------+| data | referring_domain |+-------------------------------------+------------------+| {"referrer":"http://example.com/b"} | example.com || {"referrer":"http://example.com/a"} | example.com |+-------------------------------------+------------------+
We are the leading provider of cloud services for delivering, optimizing and securing online content and business applications
$1.96BRevenue
1,300Locations
5,000+Customers
5,100+Employees
CORPORATE STATS (2014):
OUR HISTORY: Founded 1998 and rooted in MIT technology—solving Internet congestion with math not hardware
The Business of Billing
Billing domino effect Akamai Customers Sub-customers
Daily billing requires: Fast data delivery
Accurate data
Old Model New Model
Generating a bill at end of month for customer services
Generating a bill at the end of every day for sub-customer services
Current Billing Data Management
Gather logs from 190,000+ servers in 1400 locations in 110 countries Multiple PBs/day aggregate/reduce into relevant billing data feed
Typical data record: 3 key fields plus metrics
Load resulting data record into our RDBMS system
Greatest Challenges Current system cannot handle expected throughput Difficult to quickly scale up existing environments New model will generate 10x+ data
Deploying MemSQLApplicationDaily Sub-customer billing
ProblemExisting RDMS pipeline loads were maxed out at 150-300K upserts/second, could not keep up with projected size of new billing model
ResultsMemSQL cluster performs at 1.9 million upserts/second, allowing transition from monthly to daily billing
Billing Data resource usage statistics
INSERT... ON DUPLICATE KEY UPDATE... (1.9 million/sec)
Billing Application
• Compute sub-customer charges daily
• Roll up sub-customer usage by customer/cloud provider
• More sophisticated platform offers customers better service, partners new business opportunities
Results Speak for Themselves 2M upserts/second on AWS EC2
instances
Scalability on commodity hardware
Meeting our billing windows
Unlocking revenue
Adapt PoC for real-world situations
Continue scaling linearly
Optimize results with small cluster deployment
What Next?
Eric Frenkiel, MemSQL CEO and co-founderSeptember 30, 2015 • New York, NY
Introducing MemSQL Streamliner
One click deployment of integrated Apache Spark
Put Spark in the Fast Lane• GUI pipeline setup• Multiple data pipelines• Real-time transformation
Eliminates batch ETL Open source on GitHub
Introducing the MemSQL Streamliner
Streamliner ArchitectureFirst of many integrated Apache Spark solutions
Other Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine Learning Solution
STREAMLINER
Streamliner ETL Detail
Other Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine Learning Solution
STREAMLINER
STREAMLINER
Custom
Future Extractor
JSON
Custom
Future Transformer
Extract Transform Load
Building Predictive Applications
StreamlinerInput
User JarSAS Generated PMML
Industrial Equipment
Sensor Data
S1 S2 S3 P1 P2 P3
Scoring Real-Time Data with Predictive Models
Sensor 1 Predictive Model 1
Streamliner Benefits
Build end-to-end data pipelines in minutes
Reduce data latency from days or hours to ZERO
Support thousands of concurrent users running real-time queries
Give users immediate access to fresh data via innovative applications