yzstack: provisioning customizable solution for big data sai wu, chun chen, gang chen, lidan shou,...
TRANSCRIPT
YZStack: Provisioning Customizable Solution for Big Data
Sai Wu, Chun Chen, Gang Chen,
Lidan Shou, Ke ChenZhejiang University
Hui Cao, yzBigData Co. Lte.
He Bai, City Cloud Technology
3H Problem in Deploying the Big Data System
• How can I build and deploy a big data system without back-ground knowledge?
• How can I migrate existing applications to the big data system?
• How can I use my big data system to do the analysis job?
Too Many Choices• Visualization :
– Openstack– Cloudstack– Vmware
• Cloud storage: – key-value store (hbase, cassandra, redis,…)– relational service (AWS, spanner,…)
• Processing engine: – MapReduce/Hadoop– Dryad– Pregel, GraphLab– Spark– epiC
• Application service: – Mahout– Hive– Spatial Hadoop
Can I Deploy a Big Data System Like Installing a Windows
Software?• Configure the installation as a customization
process• The installation software will copy the binary
codes to all servers and do the configuration automatically
• A browser-based management system to start/stop the services and monitor the status
YZStack: the Architecture
• Layers are loosely connected
• Each layer includes many selectable modules
• Modules of different layers are linked via the common interfaces
• Optimizations are implemented as special plugins
Cloud Virtual Server Cloud Storage
Cloud Network
IaaS
Relational Data Service
Object Based Service
Distributed File System
SaaS
Data MiningOLAP
ProcessingStream
ProcessingVisualization
Plugins
Security Module
OLTP Processing
DaaS
Data Integration Module
Applications
Smart TrafficHangzhou
E-cardAnalyzer for Power Grid
Green Hangzhou
Key-Value Store
PaaS
YZepiC
Graph EngineRelational
Analytical EngineRelational
Transactional Engine System Monitor
OptimzationModules
Data Importer
/ETL tools
Features of YZStack
• Adaptive Image– Based on openstack, partition the big image into small
chunks– Different images share the same chunk
• Optimization Plugins– Column-oriented plugin– Index plugin– Query optimization plugin– Iterative job plugin
• Visualization Tool– Zoom in/out for different dimensions
Optimization Plugin
Common Interface of Layer
Module A Module K...
...
Default Implementation hooks
Common Interface of Layer
Module A Module K...
...
Default Implementation
hooks
Optimization Plugin 1
Customized Function
Customized FunctionLayer 2
Layer 1
Optimization Plugin 2
Customized Function
Use Case: the Smart Financial System
• Built for the Zhejiang Provincial Department of Finance (ZPDF)
Virtual Server
Virtual Server
Virtual Server...
Distributed File System
Schema Metadata
Data Statistics
Table File
Tablet File
Tablet File...
Relational Data Service
YZepiC
SQL Query Parser
Query Optimizer Query Engine
Relational Analytical Engine
OLAP Module Data Mining Module
Data Importer
Tax Energy Environment
Traffic Human Electronic
Index Plugin
Visualization Tool
Security Plugin
Monitor Plugin
Economic Prediction
• Collaborate with researchers from college of economics, Zhejiang University
• Step 1:– Use the OLAP module to provide a basic view for each
registered company
Economic Prediction (cont.)
• Step 2:– Healthy Model: Based on the historical data, the
healthy model discovers risks and predicts prospects of an industry
– Energy Consumption Model: We link the financial data with the electronic, water, and environment data to rank each industry based on its energy consumption per unit of output value.
– Economic Impact: Model By connecting the financial data to the human resource data, we study how many workers are employed for an industry and their average salary
– Combine all three models to rank all industries accordingly
Economic Prediction (cont.)
• Step 3: Index of Economic (ongoing work)– To predict the status of the whole Zhejiang
Province using statistics generated by previous two steps
– Involving multiple complex economic models– Our economic researchers are using the
visualization tools to build and study their models
Detection of Improper Payment
• What is the improper payment?– A person is classified as the low-income type and
buys a house specially for low-and-medium wage earners. However, he is actually employed by IT company
– One company may submit different registration files to different government departments (e.g., it registers as a high-tech company in the Department of Science, but as a labor-intensive one in the Department of Labor) to enjoy various allowances from the government.
Why ZPDF?
• A harbor of financial data in Zhejiang Province– Electronic department – Traffic department – Tax department– …
• It is well motivated– Expected to save more than 1 billion CNYs
Improper Payment
• Step 1 (Consistent Problem):– To detect improper payment from two databases, D0
and D1,– we first generate two star-join queries, Q0 and Q1,
which selectively merge the fact tables with the dimension tables.
– The trick is that the entities returned by Q0 should not exist in the results of Q1.
– E.g., Q0 returns the high-income persons, while Q1 returns the users who own a house specially for low-and-medium wage earners.
Consistent Problem
• we apply the LSH (Locality Sensitive Hashing) to generate k hash values for each tuple from T0 and T1.
• So the tuples sharing the same hash value are considered as a candidate group.
• We define a similarity function sim(ti; tj) to evaluate the probability of two tuples representing the same entity. If sim(ti; tj) is greater than a predefined threshold, it will be forwarded to the verification module where a human-aided algorithm is applied to filter out the false positives.
Fact Table
Dimension Table
Dimension Table
Fact Table
Dimension Table
Dimension Table
Candidate Group
Candidate Group
Candidate Group
Verification
Conclusion
• YZStack is tailored for the users who have little or no experience in deploying and maintaining the cloud system.
• It simplifies the development of a new big data application as the process of module selection and customization.
• To show the flexibility and usability of YZStack, we demonstrate how we build a smart financial system for the Zhejiang Provincial Department of Finance using YZStack.