infobright python slides · •training and consulting services at published rate $12,950 $15,950...

30
Czy moŜna pogodzić dzialalność Czy moŜna pogodzić dzialalność naukową i komercyjną? Na przykladzie historii silnika bazodanowego stworzonego przez firmę Infobright Inc. Dominik Ślęzak

Upload: others

Post on 15-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Czy mo Ŝna pogodzi ć działalno ść Czy mo Ŝna pogodzi ć działalno ść naukow ą i komercyjn ą?

Na przykładzie historii silnika bazodanowego stworzonego przez firm ę Infobright Inc.

Dominik Ślęzak

Page 2: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

O czym b ędzie

� Infobright dzisiaj� Historia rozwoju� Szczegóły technologiczne� Jak się w tym nie pogubić?

2

Page 3: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Infobright

� Technology Innovation� First commercial open source data

warehouse (analytical database engine)� Community & Enterprise Editions

� Lowest cost technology in the market� Simplest & easiest to build & manage� Powerful & scalable MySQL

STARTUP TO WATCH

Cool Vendor in Data Management and Integration

2009

2008 High Performance Data Warehousing Partner

of the YearInfobright: Economic

Data Warehouse Choice

Partner of the Year 2009

� Technology unlike anything in the industry

� Strong Momentum & Mature Product� Release 3.x generally available� +100 customers in 10 Countries� +40 Partners on 6 continents� A vibrant open source community

� +1 million visitors� 20,000 downloads

3

Page 4: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Infobright: Designed for Analytics

Best FitTrue Analytics

Good FitStandard Analytics

Poor FitOLTP

Analytics Questionsabout the dataAggregates: sums, counts...

Some requests for data, optimized by Knowledge Grid

Primarily data retrieval &mixed read/write activity(but we handle mixed read/writes in analytics)

Environment Analytical ad hoc queries

Analytical and standard queries

Transaction applications& operational reportingad hoc queries standard queries & operational reporting

Sample Queries

How many new customers were identified recently?What was the largest sale made for item A?

Report the sales figures from California for X campaign for Jan 2008

SELECT *“Go get the data”

Notes Knowledge Grid is used heavily to resolve the query – almost no I/O

Knowledge Grid is used to reduce I/O times by optimizing data access

It does not use the advantages of a columnar database

Page 5: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Why Customers Buy Infobright (1)

� Fast query response with no tuning

Customer’s Test Alternative Infobright

Analytical queries 2+ hours with MySQL <10 seconds

Query (AND – Left Join) 26.4 seconds in Oracle .02 seconds

� Fast and consistent data load speed as database grows. Up to 300GB/hour on a single server

5

Oracle query set 10 seconds – 15 minutes 0.43 – 22 seconds

BI report 7 hours in Informix 17 seconds

Data load 11 hours in MySQL ISAM 11 minutes

Page 6: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Online Marketing Need Infobright’s Solution

� Data Volumes – online marketing data generates staggering volumes of information that needs to be captured and analyzed

� Real-time Response needed for

� Support for storing up to 50 TBs of data compressed, with up to 40:1 compression, providing the lowest hardware footprint

� Blazing response times for deep

Online Marketing

� Real-time Response needed for rapid response to campaigns and specific targeting

� Deep Ad Hoc Analysis –differentiation through deeper analytics across more granular data

� Low Maintenance – insane growth rates, tight resourcing means a solution with as little maintenance and cost as possible

� Blazing response times for deep business analytic queries on TBs of granular clickstream data

� Lowest total cost of ownership in data warehousing – simple to implement &manage, requiring little administration

� Customer experience, including:

6

Page 7: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Telecommunications Need Infobright’s Solution

� Data Volumes are simply staggering; call center analysis, call data records (CDR), customer billing data, network data incl. alarms, alerts and events

� Support for storing up to 50 TBs of data compressed, with up to 40:1 compression, providing the lowest hardware footprint

� Keeps data together while providing fast

Telecommunications

and events

� Avoiding Data Silos – storing vast amounts of data and ensuring timely access typically results in multiple development projects, creating unconnected silos of data

� Highly Complex Ad Hoc Analysis– vast cross-section of TELCOapplications means widely differing requirements & highly complex, ad hoc queries

� Keeps data together while providing fast response times on TBs of varied data

� Complex business analytics without the maintenance overhead provides the lowest TCO in data warehousing

� Customer experience, including:

7

Page 8: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Financial Services Need Infobright’s Solution

� Response Times – Capital markets firms depend on the speed at which they can analyze and process trades

� Risk Management – Must prevent risk and identify fraud in real-time to

� Fast analytic response times regardless of TBs of ever-increasing trade and tick data volumes

� Detailed business analytics to spot inconsistencies across huge volumes of

Financial Services

risk and identify fraud in real-time to maintain the business

� Regulatory Compliance –Compliance with rising regulatory requirements, directly impacts their ability to stay in business:

� Trading compliance (Reg NMS, MiFID)

� Corporate governance (Sarbanes-Oxley, Basel II)

inconsistencies across huge volumes of live and historical data, prevents losses of money and reputation

� Reporting accuracy regardless of rising volumes helps maintain compliance

� Customer experience, including:

8

Page 9: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Dominik (back to 2005)

9

Page 10: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Challenges (2005-2010)

O u tlo o k T em p . H u m id . W in d S p o rt?

1 S u n n y H o t H ig h W e ak N o

2 S u n n y H o t H ig h S tro n g N o

3 O v ercast H o t H ig h W e ak Y es

4 R ain M ild H ig h W e ak Y es

5 R ain C o ld N o rm al W e ak Y es

6 R ain C o ld N o rm al S tro n g N o

7 O v ercast C o ld N o rm al S tro n g Y es

8 S u n n y M ild H ig h W e ak N o

9 S u n n y C o ld N o rm al W e ak Y es

1 0 R ain M ild N o rm al W e ak Y es

1 1 S u n n y M ild N o rm al S tro n g Y es

� Data tables are incomparably larger than on this illustration but the users want the query speed to stay within minutes

� Data may grow very quickly

�Current data warehouse solutions (MPP, indices, sorting, partitioning, etc.) are not flexible and often require special hardware

10

1 1 S u n n y M ild N o rm al S tro n g Y es

1 2 O v ercast M ild H ig h S tro n g Y es

1 3 O v ercast H o t N o rm al W e ak Y es

1 4 R ain M ild H ig h S tro n g N o

� Data may grow very quickly but the users expect the data warehouse to adjust to the new data very quickly as well

� Types of queries are hard to predict and the users expect that arbitrary queries will be executed reasonably quickly

Page 11: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Column vs. Row-Oriented

EMP_ID FNAME LNAME SALARY

1 Moe Howard 10000

2 Curly Joe 12000

3 Larry Fine 9000

Row Oriented ( 1,Moe,Howard,10000; 2,Curly, Joe,12000; 3,Larry,Fine,9000;)

� Works well if all the columns are needed for � Works well if all the columns are needed for every query.

� Efficient for transactional processing if all the data for the row is available

� Works well with aggregate results (sum, count, avg, min, max, ... )

� Only relevant columns need to be touched� Allows for very efficient compression

Column Oriented ( 1,2,3; Moe,Curly,Larry; Howard,Joe,Fine; 10000,12000,9000;)

11

Page 12: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Data Packs and Compression

64K

64K

Data Packs� Each data pack contains 65,536 data values� Compression is applied to each individual data pack� The compression algorithm varies depending on data

type and distribution

Compression64K

64K

Compression� Results vary depending on data

distribution among data packs� A typical overall compression ratio

seen in the field is around 10:1� Some customers have seen results

have been as high as 40:1

Patent PendingCompression

Algorithms

12

Page 13: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Knowledge Grid

Data Pack Nodes (DPN)A separate DPN is created for every data pack created in the database to store basic statistical information

HistogramsHistograms are created for every Data Pack that contains numeric data and creates 1024 MIN-MAX intervals.

13

This metadata layer is ~1% of the compressed volume.For example, a 1TB (raw) database would have Knowledge Grid of less than 1 GB.

Character Maps (CMAPs)Every Data Pack that contains text creates a matrix that records the occurrence of every possible ASCII character

MIN-MAX intervals.

Page 14: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

salary age job city

A Simple Query using Knowledge Grid

SELECT count (*) FROM employeesWHERE salary > 50000AND age < 65AND job = ‘Shipping’AND city = ‘TORONTO’;

2. Find the Data Packs that contain age < 65

3. Find the Data Packs that have job = ‘Shipping’

All packs ignored

All packs ignored

1. Find the Data Packs with salary > 50000

‘Shipping’

4. Find the Data Packs that have City = ‘Toronto’

All packs ignored

5. Now we eliminate all rows that have been flagged as irrelevant.

Only this pack will be decompressed

6. Finally we have identified the data pack that needs to be decompressed

Completely Irrelevant

Suspect

All values match

14

Page 15: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

SELECT MAX(A) FROM T WHERE B>15;

E E

I/E I/E

1 2 3

I/E I/E

I/S/R denotes irrelevant/suspect/relevant; E – exact computation (decompression)

15

Page 16: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

HashBlockJoin (advanced usage of metadata)

16

Page 17: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

MySQL Pluggable Storage Engine Architecture

17

Page 18: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Infobright and MySQL (2006-2010)

Integration provides a simple path to highly scalable data warehousing for MySQL users

No new management interface to learninterface to learn

MySQL integration enables seamless connectivity to BI tools and MySQL drivers for ODBC, JDBC, C/C++, .NET, Perl, Python, PHP, Ruby, Tcl, etc.

18

Page 19: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

19

Page 20: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Features

Technical Support Forums and/orone-time 4-hr support pack

Silver, Gold, Platinum

Warranty and Indemnification

No Included

Mixed Read/Writes No Supported

Comparison of ICE and IEE (2008-2010)

Infobright Loader Up to 50 GB/hrMulti-threaded, Up to 300GB/hr

Data Load Types Text onlyText

Binary (up to 100% faster)

MySQL Loader No Supported

Temp Tables No Supported

Platform Support(keeps changing)

64-bit Intel and AMDRHEL 5, CentOS 5,

Debian 32-bit, Ubuntu 8.04, Fedora 9, Windows XP...

64-bit Intel and AMDSolaris 10, Windows Server,

RHEL 5, RHEL AS, CentOS 5, Debian...

Page 21: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

SubscriptionHours of Service

Response Time SLA

# Named Client

Contacts

Support Access

MethodsServices

* Two year PrePay

Charge/TB

One YearAnnual

Charge/TB

Silver

9am-5pmEDT

(no phone support)

Severity 1=4 hr

Severity 2=8 hr

1

� Web� Email� Self-Service Knowledge Base

•Training and consulting services at published rate

$12,950 $15,950

8am-6pmSeverity

� Phone� Web

•Health Check Service included

IEE Annual Subscriptions

Gold8am-6pm

(ET for NA,CET Europe)

Severity 1=2 hr

Severity 2=6 hr

3

� Web� Email� Self-Service Knowledge Base

included•10% Discount training and other consulting services•Sev 1 hot fixes

$15,950 $18,950

Platinum 7 x 24 x 365

Severity 1=1 hr

Severity 2=4 hr

5

� Phone� Web� Email� Self-Service Knowledge Base

•Health Check Service included•20% discount training and other consulting services•Sev 1 and 2 hot fixes

$18,950 $21,950

Perpetual Perpetual

(Same as Gold Support Level)

18% of Purchase Price not to increase by more then 10% per Yr.

$40,000

* Two year subscription pricing requires two year p repay

Page 22: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Why Customers Buy Infobright (2)

• Less cost

• Capacity-based subscription model

• Less hardware required

• Less work

• Simple schemas and standard SQL• Simple schemas and standard SQL

• Reduced monitoring requirements

• Reduced maintenance requirements

• Open source

• Fast implementation

• Open source community

• Broad platform support

22

Page 23: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions
Page 24: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Dominik (in 2010)

24

Page 25: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Dominik (in 2010)

25

Page 26: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Approximate SQL

� In such areas as, e.g., Business Intelligence and Web Analytics, there is an ongoing debate whether the answers to SQL statements have to be always exact.

� The same question occurs in the case of SQL-based machine learning algorithms, which are often based on heuristics, randomness and inexactness anyway.on heuristics, randomness and inexactness anyway.

� Motivation for SQL approximations is related also to such aspects as complexity of queries and data sources, dynamically changing data with a limited access, and huge data sets with a need to monitor convergence of query execution in time, regardless of whether the final answers are to be exact or not.

26

Page 27: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Data Granulation

27

Page 28: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

SQL-based Machine Learning

28

Page 29: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

Community Inspirations

� New Objectives� New Schemas� New Volumes� New Queries

� Count Distinct� Self-Joins� Decision Trees� Contingencies

� New Statistics

� New Data Types� SQL Extensions� Feature Extraction� Data Compression

29

Page 30: Infobright Python Slides · •Training and consulting services at published rate $12,950 $15,950 8am -6pm Severity Phone Web •Health Check Service included IEE Annual Subscriptions

DZIĘKUJĘ ZA UWAGĘ!!!www.infobright.com

www.infobright.org

[email protected]

http://en.wikipedia.org/wiki/Infobright