big data and postgresql

Big Data & PostgreSQLUsing TABLESAMPLE to Analyze

Very Large Datasets

By Umair Shahid

Who am I?● Got “pushed” into PostgreSQL in 2004, ended

up falling in love with it● Not a hardcore techie, yet passionate about

open source software● Heading the productization efforts at

2ndQuadrant● Interested in Big Data, specifically the newer

PostgreSQL features supporting it

What is the problem?Number of Rows Size on Disk (MB) Time Taken (ms)

1k 0.23 219.706

100k 24 1,302.135

1M 195 7,696.386

5M 951 40,691.603

10M 1,923 60,012.457

100M 19,456 801,493.319

Why is this significant?

● Data mining has typically been a painful process● Major contributor to the pain has been the time it

takes for queries to return● Many false steps before the required data is

identified● Waiting time is wasted time● Sampling, count based or time based, reduces

the wasted time significantly

What is TABLESAMPLE?

● Ability to read a random sample of data in a table

● Defined in SQL:2003 (5th revision of SQL)

● Implemented in PostgreSQL 9.5

Syntax

SELECT select_expression

FROM table_name

TABLESAMPLE sampling_method ( argument [, ...] )

[ REPEATABLE ( seed ) ]

sampling_method

● argument is percentage of rows● SYSTEM

○ Block level sampling○ Very fast○ Non-independent rows

● BERNOULLI○ Row level sampling○ Slower than SYSTEM○ Independent rows (uniformly random)

Demo sampling methods

REPEATABLE results● (Reminder: [ REPEATABLE ( seed ) ])● Optional argument● Used if random, yet repeatable results are

required● seed and argument need to be the same to

produce repeatable results● Any changes made to the table will result in a

different data set

Now it gets interesting … ● TABLESAMPLE allows for additional sampling methods

via extensions● tsm_system_time specifies max number of

milliseconds to spend reading a table● Implements the syntax:

SELECT select_expression

FROM table_name

TABLESAMPLE SYSTEM_TIME (argument)

Demo tsm_system_time

Enter Orange ...● Funded by AXLE (http:

//axleproject.eu)● Same project funded

TABLESAMPLE● Available integrated

with PostgreSQL in 2UDA (http://2ndquadrant.com/2uda)

● Uses TABLESAMPLE to very quickly create visualizations for data

● Can quickly create predictive models

Demo OrangeYou can find a very helpful tutorial at

http://2ndquadrant.com/2uda

big data and postgresql

Technology

indexing complex postgresql data types

porting oracle applications to postgresql - pgcon. · pdf...

postgresql: data analysis and analytics

really big elephants: postgresql dw

scalable postgresql as your data platform -...

data warehousing with postgresql - postgresql wiki

introduction to postgresql - varlena, llc | postgresql ......

hacking postgresql internals - citus...

optimization of a big postgresql database

the driving force behind postgresql -...

open-source big data analytics in · pdf fileopen-source...

big data in the open private cloud - red hat · pdf filebig...

data science with postgresql zs_bárány_-_data... · pdf...

big bad postgresql: bi on a budget

big data technology ecosystem - · pdf filemsft sql server,...

postgresql is...

data visualisation with sql - postgresql

on beyond (postgresql) data types

query execution techniques in postgresql - neil · pdf...

postgresql em projetos de business analytics e big data...