learn how to run python on redshift

Download Learn How to Run Python on Redshift

If you can't read please download the document

Upload: chartio

Post on 15-Jan-2017

425 views

Category:

Technology


2 download

TRANSCRIPT

AWS Deck Template

How Bellhops Leverages Amazon Redshift UDFs for Massively Parallel Data ScienceIan Eaves, BellhopsMay 12th, 2016

Todays SpeakersChartioAJ WelchChartio.com

BellhopsIan EavesGetBellhops.com

AWSBrandon Chavisaws.amazon.com

The recording will be sent to all webinar participants after the event.Questions? Type them in the chat box & we will answer at the endPosting to social? Use #AWSandChartio

Housekeeping

Relational data warehouseMassively parallel; Petabyte scaleFully managedHDD and SSD Platforms$1,000/TB/Year; starts at $0.25/hour

Amazon RedshiftWhat is Amazon Redshift?

For those unfamiliar with Amazon Redshift, it is a fast, fully managed, petabyte-scale data warehouse for less than $1000 per terabyte per year.

fast, cost effective, easy to use (launch cluster in a few minutes, scale with the push of a button)

4

Amazon Redshift is easy to useProvision in minutes

Monitor query performance

Point and click resize

Built in security

Automatic backups

Redshift is not only cheaper but also easy to use. Provisioning takes 15 minutes. 5

Amazon Redshift System Architecture10 GigE(HPC)IngestionBackupRestoreSQL Clients/BI Tools

128GB RAM16TB disk

16 coresS3 / EMR / DynamoDB / SSHJDBC/ODBC

128GB RAM16TB disk

16 coresCompute Node

128GB RAM16TB disk

16 coresCompute Node

128GB RAM16TB disk

16 coresCompute NodeLeaderNode

6

The Amazon Redshift view of data warehousing10x cheaper

Easy to provision

Higher DBA productivity10x faster

No programming

Easily leverage BI tools, Hadoop, Machine Learning, StreamingAnalysis in-line with process flows

Pay as you go, grow as you need

Managed availability & DREnterpriseBig DataSaaS

7

The legacy view of data warehousing ...Global 2,000 companiesSell to central ITMulti-year commitmentMulti-year deployments Multi-million dollar deals

8

Leads to dark dataThis is a narrow view

Small companies also have big data(mobile, social, gaming, adtech, IoT)

Long cycles, high costs, administrative complexity all stifle innovation

9

New SQL FunctionsWe add SQL functions regularly to expand Amazon Redshifts query capabilities

Added 25+ window and aggregate functions since launch, including:LISTAGG[APPROXIMATE] COUNTDROP IF EXISTS, CREATE IF NOT EXISTSREGEXP_SUBSTR, _COUNT, _INSTR, _REPLACEPERCENTILE_CONT, _DISC, MEDIANPERCENT_RANK, RATIO_TO_REPORT

Well continue iterating but also want to enable you to write your own

Scalar User Defined FunctionsYou can write UDFs using Python 2.7Syntax is largely identical to PostgreSQL UDF syntaxSystem and network calls within UDFs are prohibited

Comes with Pandas, NumPy, and SciPy pre-installedYoull also be able import your own libraries for even more flexibility

Scalar UDF ExampleCREATE FUNCTION f_hostname (VARCHAR url) RETURNS varcharIMMUTABLE AS $$ import urlparse return urlparse.urlparse(url).hostname$$ LANGUAGE plpythonu;Rather than using complex REGEX expressions, you can import standard Python URL parsing libraries and use them in your SQL

Analytics for EveryoneThe best platform for everyone to explore and visualize data

Header Only13

The Smartest Companies Use Chartio

Two-Section14

Legacy BI

Expensive to set up

Expensive to maintain

Requires technical skills

Creates a bottleneck

Limits your ability to make decisions

Modern BI

Faster time to value

Easier to maintain

Modes for both technical and non-technical users

Alleviates bottlenecks

Enhances your ability to make decisions

Chartios Modern Architecture

Three Section17

Chartios Modern Architecture

Three Section18

Chartios Modern Architecture

19

The Chartio Schema Editor

Team-specific schemas

Rename tables/columns

Hide tables/columns

Define custom tables/columns

Define data types and foreign keys

20

Schema Editor Live Demo

Section Header21

UDFs (A Brave New World)Using UDFs in the Real WorldIan Eaves - Data Scientist

The Land Between SQL and Analysis

While SQL is a phenomenal tool for data extraction its either painful or impossible to work with for analysis.

Genera purpose programming languages like Python on the other hand are better suited to analysis and visualization but more difficult to use for pure extraction

Into this gap services like Chartio have emerged providing extended visualization and analysis options usually accomplished with those more traditional programming tools.

The Land Between SQL and Scripts

UDF

UDFs begin to bridge this gap by providing limited python functionality within the scope of your standard SQL toolbox.

A Little About Bellhops

On-demand moving and labor companySelf-scheduling capacity (a la Uber)Located in 83 markets

Because our supply of labor (lovingly referred to as Bellhops) are free to set their own schedule understanding the health of a market is extremely important.

Too many Bellhops chasing too little work yields high churn and inexperienced laborers.

On the other hand having only a handful of Bellhops might be sufficient to service demand in small or growing markets. However, this dynamic is unstable - what happens if a Bellhop decides to take a month off? Or how will it respond to sudden spikes in demand as happens during the summer?

One of the measures we use to determine when a market has entered an unstable dynamic like this is the Herfindahl Index

Herfindahl Index (H)N = number of market actorssi = market share of ith actor

Theres no need to linger on this but the Herfindahl Index is the sum of the squared market shares of actors in a market.

So for example, if we were looking at the Soda industry the actors in that market would be Coca Cola, Pepsi, Fanta, etc

More important is how its used. (next slide)

Herfindahl Index

Its found common usage in economics to determine if an industry has become monopolistic and to what degree that might be the case.

We at Bellhops use a similar idea to measure the concentration of work amongst our labor supply in each market.

A combination of this metric and UDFs allows us to provide real time feedback to the organization which would otherwise require separate extraction, processing, and analysis steps.

Take our index as an example. We are interested in knowing if there have been any sudden changes in indexed concentration for any of our markets.

This can be used as a call to action for our market health team So how do we do that?

Market Health Feedback Process

Monthly

Data Warehouse

First we calculate the current state of each market every day, week, month, etc as part of our ETL process

Market Health Feedback Process

Monthly

Data Warehouse

These values are then feed from our warehouse into psql, Chartio, or any other tool of your choice. In our case our end users (The Market Health Team) interact with data primarily through Chartio so thats where it will sit.

Market Health Feedback Process

Monthly

UDFs

Data Warehouse

We next feed these values into a Python UDF

Which In this case is an implementation of the Students T-test.

A t test effectively allows you to determine a value differs significantly from its historical distribution and the degree to which it differs. Its an especially important distribution when the number of samples being used is small.

In this case we will be determining whether a markets concentration differs significantly from the historical (say past 6 months) of observations.

Market Health Feedback Process

Monthly

UsersUDFs

Data Warehouse

Finally these significance warnings are surfaced directly to relevant users through pre-made Chartio dashboards for them to take action when necessary.

T-test UDF

create function f_t_test (val float, mean float, stddev float, n_samps float, alpha float)returns varcharstableas $$from scipy.stats import tdf = n_samps - 1

tval = (mean - val) / stddevp = t.sf(abs(tval), df) * 2 # two sided

if p 0 else 'Worse'else:return 'No Change'$$LANGUAGE plpythonu;

Finally here is our actual UDF. This is an implementation of a two sided t test with a couple of notable features.

We were able to make use of prebuilt python functionality like scipySo long as our UDF exclusively uses data available immediately within its scope (unfortunately meaning no disk or network access) we have all the power of python at our finger tops. That means things like complicated conditional logic can be trivially implemented bypassing otherwise clumsy SQL.

Example Table Schemamarket_month_dimensioniddateherfindahl_index~~~

market_fact-idmonth_keymarket_name~~

Lets just take a toy model of two tables in our data warehouse. The first is a fact key containing a market_id and a foreign key to the market_month_dimension table.

The market_month_dimension table contains a variety of statistics calculated monthly for each market; one of which is the herfindahl_index.

Our QueryWITH market_stats AS (SELECT market_name, date, mmd.herfindahl_index, avg(herfindahl_index) OVER (PARTITION BY market_name ORDER BY date ROWS BETWEEN 6 PRECEDING AND 1 PRECEDING) as avg, stddev_samp(herfindahl_index) OVER (PARTITION BY market_name ORDER BY date ROWS BETWEEN 6 PRECEDING AND 1 PRECEDING) as stddevFROM market_fact LEFT JOIN( SELECT herfindahl_index, month_key as join_key, date FROM market_month_dimension ) AS mmd ON join_key = market_fact.month_keyGROUP BY market_name, date, month_key, mmd.herfindahl_indexORDER BY date)

SELECT market_name, date, herfindahl_index, avg, stddev, f_t_test(herfindahl_index, avg, stddev, 6, .2)FROM market_statsWHERE market_name = 'Atlanta'ORDER BY market_name, date;

With our UDF and schema in hand we can now execute a query!

In this case we are using our T-test UDF to determine in which months Atlantas herfindahl index changed dramatically as compared to the past six months.

As you can see actually using the UDF is extremely simple and behaves as if it were any other function. The majority of the hard work lies in constructing our temporary table containing the six month moving average and standard deviations.

Sample Result

General Thoughts

UDF Use Cases

By their scalar nature UDFs are in some sense reflective rather than prescriptive we found that reflective nature to be most useful in support of the analytics being performed by our BI team.

They are additionally useful when cumbersome SQL expressions might be simplified by an equivalent python library or representation. Things like (slide)

General Thoughts

UDF Use Cases

Complicated Conditional Logic

Complicated conditional logic

General Thoughts

UDF Use Cases

Complicated Conditional Logic

Text Processing

Text processing especially when the equivalent regular expression is complicated or contains numerous edge cases (urls, emails, etc)

General Thoughts

UDF Use Cases

Complicated Conditional Logic

Text Processing

Basic Statistical Analysis

and doing basic statistical analysis.

Thank You

Questions?ChartioAJ [email protected]

BellhopsIan [email protected]

AWSBrandon [email protected] aws.amazon.com