dataprep-the easiest way to prepare data in python

41
Jiannan Wang DataPrep - The easiest way to prepare data in Python Simon Fraser University Apr 21, 2021, Thomson Reuters

Upload: others

Post on 20-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DataPrep-The easiest way to prepare data in Python

Jiannan Wang

DataPrep - The easiest way to prepare data in Python

Simon Fraser University

Apr 21, 2021, Thomson Reuters

Page 2: DataPrep-The easiest way to prepare data in Python

FromModel-Centric to Data-Centric

2https://analyticsindiamag.com/big-data-to-good-data-andrew-ng-urges-ml-community-to-be-more-data-centric-and-less-model-centric/

Page 3: DataPrep-The easiest way to prepare data in Python

Data Preparation Is Still the Bottleneck!!!

https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

https://www.anaconda.com/state-of-data-science-2020

2014

3

2020

Page 4: DataPrep-The easiest way to prepare data in Python

Why Is Data Preparation Hard?

Collection Cleaning Integration Analysis

How much time is spent on preparation?

1. Too many small problems (e.g., standardize date, dedup address, etc)

2. Humans have different levels of expertise (in data science and programming)

3. Domain specific (finance, social science, healthcare, economics, etc.)

4

Page 5: DataPrep-The easiest way to prepare data in Python

Human-in-the-loop Data PreparationThree Directions

• Spreadsheet GUI

• Workflow GUI

• Notebook GUI

5

Page 6: DataPrep-The easiest way to prepare data in Python

Spreadsheet GUI

6

Page 7: DataPrep-The easiest way to prepare data in Python

7

Workflow GUI

Page 8: DataPrep-The easiest way to prepare data in Python

Notebook GUI

8

Page 9: DataPrep-The easiest way to prepare data in Python

Which Direction To Go?

Source: https://www.verifiedmarketresearch.com/product/data-prep-market/

Data Prep Market was valued at USD 3.29 Billion in2019 and is projected to reach USD 18.11 Billion by2027, growing at a CAGR of 25.64% from 2020 to 2027“ ”Three Directions

• Spreadsheet GUI

• Workflow GUI

• Notebook GUI9

Targeted at non-programmers

Targeted at data scientists

Page 10: DataPrep-The easiest way to prepare data in Python

Our Vision

Machine Learning Made Easy

Data Preparation Made Easy

Deep Learning Made Easy

Big Data Made Easy

Visualization Made Easy

10

Page 11: DataPrep-The easiest way to prepare data in Python

DataPrep Components

DataPrep.EDA

DataPrep.Connector

DataPrep.Clean

DataPrep.Feature

Simplify Web Data Collection

Simplify Exploratory Data Analysis

Simplify Data Cleaning

DataPrep.Integrate

Simplify Feature Engineering

Simplify Data Integration

Planning

11

May 2019 - Now

Nov 2019 - Now

Sept 2020 - Now

Planning

Page 12: DataPrep-The easiest way to prepare data in Python

User Feedback

https://www.reddit.com/r/Python/comments/hlqnim/understand_your_data_with_a_few_lines_of_code_in/ 12

. . .

Page 13: DataPrep-The easiest way to prepare data in Python

Talk Outline

1. DataPrep Overview

2. Dive into DataPrep

• DataPrep.EDA

• DataPrep.Connector

3. Future Direction

13

Page 14: DataPrep-The easiest way to prepare data in Python

DataPrep.EDATask-Centric Exploratory Data Analysis

14Jinglin Peng*, Weiyuan Wu*, et al.DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python. SIGMOD 2021

Page 15: DataPrep-The easiest way to prepare data in Python

Exploratory Data Analysis (EDA)Understand data and discover insights

via data visualization, data summarization, etc.

15

Understand “Age” column

Page 16: DataPrep-The easiest way to prepare data in Python

Current EDA Solutions in Python

16

Solution 1: Pandas + Matplotlib

L Hard to Use

• Beginner: Need to know how to write plotting code

• Expert: Need to write lengthy and repetitive code

Write Code Write Code Write CodeUnderstand “Age” column

Page 17: DataPrep-The easiest way to prepare data in Python

Current EDA Solutions in Python

17

Solution 2: Pandas-profiling

L Slow

L Hard to Customize

Page 18: DataPrep-The easiest way to prepare data in Python

DataPrep.EDA Design Goals

18

EDA Solutions Easy to Use InteractiveSpeed

Easy toCustomize

1. Pandas + Matplotlib L J J

2. Pandas-profiling J L L

3. DataPrep.EDA J J J

Page 19: DataPrep-The easiest way to prepare data in Python

Key Idea

19

Task-Centric API Design

• Declarative

• Support both coarse-grained and fine-grained EDA tasks

Example• plot(df): “I want to see an overview of the dataset”

• plot_missing(df): “I want to understand the missing values of the dataset”

• plot(df, x): “I want to understand the column x”

• plot(df, x, y): “I want to understand the relationship between x and y”

• …

Page 20: DataPrep-The easiest way to prepare data in Python

DataPrep.EDA (Demo)

20

Page 21: DataPrep-The easiest way to prepare data in Python

Under the Hood

21

??

??

Mapping Rules

Data Processing Pipeline

Page 22: DataPrep-The easiest way to prepare data in Python

Mapping RulesN = Numerical, C = Categorical

22

[1] https://www.data-to-viz.com/[2] Exploratory data analysis with R[3] Missingno: a missing data visualization suite…

Page 23: DataPrep-The easiest way to prepare data in Python

Data Processing Pipeline

ConfigManager

Compute Module

Render Module

Intermediates

Config1

2

Data

3

23

Page 24: DataPrep-The easiest way to prepare data in Python

Interactive Speed

24

Ubuntu 16.04 Linux server with 64 GB memory and 8 Intel E7-4830 cores

Page 25: DataPrep-The easiest way to prepare data in Python

Efficiency ComparisonDataPrep.EDA vs Pandas-Profiling

Pandas-Profiling DataPrep.EDA

25

Page 26: DataPrep-The easiest way to prepare data in Python

DataPrep.EDA TakeawaysInnovation

The first task-centric EDA system in Python

Achieve three design goalsEasy to useInteractive speedEasy to customize

26

Page 27: DataPrep-The easiest way to prepare data in Python

DataPrep.ConnectorA Unified API Wrapper

to Simplify Web Data Collection

27

Page 28: DataPrep-The easiest way to prepare data in Python

Data Collection Through Restful APIs

Business DataSocial Data

Event Data Publication Data

28

Page 29: DataPrep-The easiest way to prepare data in Python

Restful API Example

29

Page 30: DataPrep-The easiest way to prepare data in Python

Restful API WrapperWrap API calls into Easy-to-Use Python Functions

. . .30

Page 31: DataPrep-The easiest way to prepare data in Python

Build a New API Wrapper is Tedious!

Authorization

HTTP Connection

Pagination

Concurrency

Result Parsing

…...

Connect to the website server

Handle authorization schemes

Request data from multiple pages

Retrieve data in parallel with less time

Convert Json string to Pandas Dataframe

…...

31

Page 32: DataPrep-The easiest way to prepare data in Python

Yelp Spotify TwitterYoutube Wiki Facebook

IMDb Pinterest WalmartNY Times Reddit

...

If we don’t unify API wrappers, then ...

32

Page 33: DataPrep-The easiest way to prepare data in Python

Yelp Spotify TwitterYoutube Wiki Facebook

IMDb Pinterest WalmartNY Times Reddit

...

If we don’t unify API wrappers, then ...

● Bad for developers (repetitive building efforts)

● Bad for users (burden to learn many API wrappers)

33

Page 34: DataPrep-The easiest way to prepare data in Python

Reusable Components

DataPrep.ConnectorA Unified API Wrapper

Yelp Config File

Spotify Config File

Youtube Config File

Twitter Config File

DBLP Config File

Facebook Config File

Reddit Config File

….

Configuration Files

Good for developers (No repetitive building efforts)

34

Page 35: DataPrep-The easiest way to prepare data in Python

The Unified API

Good for users (No burden to learn many API wrappers)

35

Page 36: DataPrep-The easiest way to prepare data in Python

DataPrep.Connector (Demo)

36

Page 37: DataPrep-The easiest way to prepare data in Python

DataPrep.Connector TakeawaysInnovation

The first unified API Wrapper in Python

Good For DevelopersSpeed up wrapper development process

Good For UsersSpeed up data collection from Web APIs

37

Page 38: DataPrep-The easiest way to prepare data in Python

Talk Outline

1. DataPrep Overview

2. Dive into DataPrep

• DataPrep.EDA

• DataPrep.Connector

3. Future Direction

38

Page 39: DataPrep-The easiest way to prepare data in Python

Future DirectionDataPrep.EDA• Make plots look attractive

• Understand multiple dataframes (plot_diff, plot_db, …)

DataPrep.Connector

• Speed up read_sql() with arrow and parallel connection

39

(http://cx.dataprep.ai)

Page 40: DataPrep-The easiest way to prepare data in Python

Future DirectionDataPrep.Clean• Goal: Implement 100+ clean_{type}(df, x) functions

• Example: clean_email, clean_date, clean_phone, clean_country, etc.

• Application: Data Validation, Data Standardization, Semantic Type Detection

40

Page 41: DataPrep-The easiest way to prepare data in Python

The easiest way to prepare data in Python

http://dataprep.ai

pip install –U dataprep

41