sharing about my data science journey and what i do at lazada

62
Hi, I’m Eugene I’m here to share about my data science journey and what I do at Lazada 4 th April 2016 SMU Masters of IT in Business

Upload: eugene-yan-ziyou

Post on 16-Apr-2017

1.390 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Hi, I’m Eugene I’m here to share aboutmy data science journey andwhat I do at Lazada

4th April 2016SMU Masters of IT in Business

Before I begin, any questions you would like addressed?

I’ll answer throughout my sharing.

An introduction about myself

Studied Psychology and Businessat Singapore Management University (SMU); wanted to usedata to create positive impact

Did economic and political analysis at Ministry of Trade & Industry (MTI)

Joined IBM to pursue passionin working with data

First step into data science as a data analyst, where I…

Developed dashboards and analytics for end-to-end supply chain optimization

Worked on an anti-money laundering and entity resolution system for a global bank

Collected and analyzed tweets to provide insight on tweet share and sentiment for electronics conglomerate

Then, was transferred to workforce analytics team, working on data from IBM’s 450k employees to build…

Forecast models for global job demand to optimize recruitment and workforce allocation

Job recommendation engine to increase internal transfers, skill renewal, satisfaction, and reduce attrition

Currently at Lazada’s Data Science team; more later

My data science journey

Skill sets needed to be a data analyst and how I acquired them

Probability, statistics and experimental design from education in Psychology

Technical skills in SPSS Statistics and R from undergraduate education in Psychology

Written and verbal communication from essays and presentations (SMU), and briefs and stakeholder engagement with industry leaders (MTI)

Teamwork from projects in SMU and MTI

Skill sets needed to be a data scientist and how I acquired them

- Statistics- Experimental

Design- SPSS & R- Communication- Teamwork

More R via MOOCs:- Data Analysis and statistical inference (Duke) - Computing for Data Analysis ( Johns Hopkins)

Python via MOOCs: - Computer Science and Programming in Python (MIT)- Interactive programming in Python (Rice)

SQL via any site with in-browser query engine

Machine Learning via MOOCs:- Machine Learning (Stanford)- Statistical Learning (Stanford)- Social and Economic Networks (Stanford)- Text Mining and Analytics (Urbana-Champaign)

Distributed storage and processing via MOOCs: - Mining Massive Datasets (Stanford)- Big data with Apache Spark (UC Berkeley) - Scalable Machine Learning with Apache Spark (UC Berkeley)

Learning alone is insufficient; I also had to practice (a lot)

Volunteer for things people don’t want to do- Volunteered for project on Twitter tracking with $0 budget

Twitter project: Connect to API, download tweets 24/7 over 2 weeks, analyze tweets; learnt how to:- Work with APIs- Recover from failure automatically- Work with data that can’t fit in memory- Text analytics and sentiment analysis

Volunteer with DataKind SG and helping NGOs tackle problems through data science

Volunteer to facilitate Johns Hopkins Data Science Specialization (Statistical Inference)

Kaggle meaningfully on competitions with real-world applications; competitions I’ve tried include…

Otto Production Classification: Classifyproducts into 9 main product categories

Springleaf Marketing Response:Predict if customers will respond to direct mail

Telstra Network Disruptions:Predict severity of service disruption

Skill sets to be a better data scientist (what I’m focusing on now)

- Statistics- Experimental

Design- SPSS & R- Communication- Teamwork

- Python- SQL- Machine Learning- Distribute Storage

& Processing

Finding problems and opportunitiespeople overlook

Proper software engineering

Designing and buildingdata products end-to-end

Building data products using Spark (Scala)

My journey so far…

- Statistics- Experimental

Design- SPSS & R- Communication- Teamwork

- Python- SQL- Machine Learning- Distribute Storage

& Processing

- Finding use cases- Software Engineering- Designing data

products- Spark & Scala

So what can you do?- Get very good at basic SQL- Get very good at either R or Python- Understand basic machine learning techniques- Understand distributed systems and processing- Improve communication by writing and sharing

- Get experience by doing projects on machine learning and distributed processing (e.g., Open data, Volunteering, Kaggle, etc)

What I do at Lazada

Lazada Data Science: Data Engineers, Scientists, Tool Developers

A rough guide to each role

Collect, store, maintainEngineers

Explore, prepare, modelScientists

Expose, integrate, platform-izeTool Developers

Lines may blur between roles

Problems we work on…

Product-related:- Product Categorization- Attribute Extraction- Spam Detection- Image Quality Checking

Consumer-related:- Recommendations- Product Ranking- Consumer Segmentation- Customer Lifetime Value

Seller-related:- Price Elasticity- Detecting Counterfeits

Operation-related:- Delivery time forecasting

What I’m working on

Product categorization

Product title & description

Machine Learning Categorization

Rules-based Categorization

CrowdCategorization

Product Category

Quality Checking and Validation

Sufficient confidence

If insufficient confidence

API for self-service

Production

Scheduled batch jobs

Product Category

Product Ranking for onsite display

Product Data

Purchase Data

Behavioral Data (e.g., clickstream)

Other Data (e.g., ratings, etc)

Merging datasets

Feature Engineering

Model product rankings

Data Cleaning

Rule-based modifiers

Measurement & A/B Testing

Recommendations for newsletter subscribers

Product Data

Purchase Data

Behavioral Data (e.g., clickstream)

Other Data (e.g., ratings, etc)

Merging datasets

Feature Engineering

Data Cleaning

Customer Segmentation

Forecasted Top Sellers

Recommendations Newsletter Creation

Measurement & A/B Testing

Rule-based modifiers

How is my time spent

Data Preparation,

50%

Modeling, 20%

Productionizing, 30%

Coding Breakdown

Majority of time spent coding (thankfully)

Coding, 55%

Engagment, 30%

Others, 15%

Data Preparation- Merging data- Imputing nulls- Removing duplicates- Handling outliers- Fixing formats- Etc, etc, etc

Building the model- Feature engineering- Machine learning- Validation- Iterate, iterate, iterate

Deploying to production- Proof-of-concept- Developing API- Scheduling jobs- Continuous integration- Fixing bugs

Engagement (with stakeholders)- Roadmap planning (quarterly)- Aligning solution with problem- Explaining and getting buy-in

Other tasks- Providing assistance- Research and brainstorming- Team sharing

Any further questions?

[email protected]@lazada.com