ronald menich, chief data scientist, predictix, llc at mlconf nyc

Retail Demand Forecasting with Machine LearningRonald P. (Ron) Menich

mlconf NYC 27 Mar 2015

GO, TEAM!

▪ Syrine Besbes▪ Wafa Hwess▪ Rihab Ben Aicha▪ Abhijit Oka▪ Mark Tabladillo▪ Ahmed Yassine Khaili

2

▪ Nikolaos Vasiloglou▪ Eugene Kamarchik▪ Kurt Stirewalt▪ Andy Dean▪ Firas Aloui▪ Molham Aref▪ Rafael Gonzalez-Coloni

Forgive me if I’ve missed someone

PREDICTIX’ CORE RETAIL DECISION SUPPORT OFFERINGS

▪ Planning▪ Assortment Planning▪ Merchandise Financial Planning▪ Item Planning

▪ Forecasting▪ Machine-learning models▪ All demand drivers

▪ Internal (promo, price, etc.)▪ External (weather, competition, events, etc.)

▪ Supply Chain Optimization▪ Network flow optimization▪ Optimize for profit

3

http://ppt/slides/slide1.xml

GETTING DEMAND FORECASTING RIGHT TRANSLATES TO $$$

▪ Size of the problem▪ 62 billion weekly forecasts (150K active skus X 8,000 stores X 52 weeks)▪ Many TB’s of data▪ 3,000 computing cores elastically provisioned

▪ Forecast accuracy▪ Measured 25% to 50% reduction in MAPE▪ The harder the problem the better the improvement▪ Measured reduction of bias in forecasts

▪ Benefits▪ $125M from inventory reductions alone▪ 20% ongoing benefit

4

IN THE BEGINNING, DEMAND FORECASTING SEEMED SIMPLE...

5

Time-series forecasting

…BUT THEN EVER GREATER COMPLEXITY AROSE

6

A Last year’s sales

B Manual partitioning of data, different TS models for different partitions

C Croston’s for sparse, Winters for dense

D Forecast at aggregate levels, spread down

J if/then/else assignment of different TS algorithms

...

N Have user manually map a new SKU to an existing one

...

O Have user manually inject local market knowledge

L Linear regression for promotions

Alarm Clock: Demand forecasts. But are they really “simple”?

…AND SO NOW WE ASK THE QUESTION

7

A Last year’s sales

B Manual partitioning of data, different TS models for different partitions

C Croston’s for sparse demand, Winters for dense

D Forecast at different hierarchical levels, spread down

J Automated if/then/else assignment of different TS algorithms

...

N Have user manually map a new SKU to an existing one

...

O Have user manually inject local market knowledge

L Linear regression for promo

Alarm Clock: Demand forecasts. But are they really “simple”?

REALLY?

Machine learning can provide a modern, simpler, theoretically sound and more extensible alternative for

retail demand forecasting

CAUSAL FACTORS DRIVE RETAIL DEMAND

How much additional demand was generated for Post Cereals because these were on promotion?

How much does the $4 in-store coupon contribute to the total uplift?

Does the table highlighting the $1.50 coupon and the final offer price drive any additional uplift?

Competition

Weather

SO AN ATTRIBUTE-BASED FORECASTING APPROACH IS APT

Inputs include:• Product Attributes

(including text descriptions e.g. reviews)

• Hierarchies• Competitor Data• Promotions• Pricing• Display• Store Attributes• Local events• Weather• Customer data• ...

CLOUD ELASTICITY

Machine Learning:• 2-way interactions• 3-way• 4-way

Predictive AnalyticsWhat If on price/promo/display changes

Demand Forecasts▪ Basic products▪ New products▪ Short lifecycle▪ Customer specific▪ ...

POSSIBLE SUPERVISED LEARNING MODELS

10

Random forests Restricted Boltzman machines

Deep learning

We chose factorization machines for several reasons

● Linear regression heritage of market mix modeling

● SGD/online suitability for handling large data sets

● Trend can be modeled

ZERO-FILLING --- KNOWING WHY DEMAND DID AND DIDN’T OCCUR AND WHEN

● Unlike for product recommender systems, retail forecasting must predict the timing of when demand will happen (not just the rating whenever it happens)

● An observation of sales might have (sku,store,day) primary key○ Was the product on the shelf

available to be sold?○ How much was sold, if any?

● In many retail contexts, the vast majority of observations have zero sales○ Recent example: zero sales

observations account for >97.5% of the training set

○ It is important to know why demand was zero

11

Extreme Case:Demand only occurs when there’s a discount

EXAMPLE FORECASTS - TOYS

12

Training set

Test set

EXAMPLE FORECASTS - SEASONAL GROCERY ITEM

13

Training on the left and middle

One month of holdout / test at the very right

EXAMPLE FORECASTS - QUICK SERVICE RESTAURANT

14

For very dense data - few zeros - almost unbiased forecasts with WAPE values below 12.5% can be achieved

NEW SKUS CAN READILY BE FORECASTED

15

REPLACEMENT SKUS CAN BE READILY FORECASTED

16

CHALLENGES / ONGOING WORK

● Zero-filling / training set cardinality control using weighted least squares

● Global effects and 2-way interactions are easily trainable, but 3-way and higher-order interactions require judicious feature engineering

● Parallel learning / consensus of learners

● Visualization / explanation of hidden factors used for interaction modeling

● Automated pruning of non-important attributes

17

THANK YOU.

18

ronald menich, chief data scientist, predictix, llc at mlconf nyc

Technology

sparse demand

additional demand

different ts models

different partitionsc

different hierarchical

tbs of data

weekly forecasts

forecasts benefits