feature engineering for diverse data types
TRANSCRIPT
![Page 1: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/1.jpg)
FEATURE ENGINEERING FOR DIVERSE DATA TYPESAlice ZhengOctober 10, 2016Seattle PyLadies Meetup
1
![Page 2: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/2.jpg)
2
MY JOURNEY SO FAR
Shortage of expertise andgood tools in the market.
Applied machine learning/data science
Build ML tools
Write a book
![Page 3: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/3.jpg)
3
MACHINE LEARNING IS USEFUL!
Model data.Make predictions.Build intelligent
applications.Play chess and go!
![Page 4: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/4.jpg)
4
THE MACHINE LEARNING PIPELINE
It is a puppy and it is extremely cute.
Raw data
FeaturesModels
Predictions
Deploy inproduction
![Page 5: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/5.jpg)
Models
![Page 6: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/6.jpg)
6
A SIMPLE MODELX
Y
X and Y
1
1
1
0
0
0
0 1
1
0 0 0
f(x, y) = 0.5 x + 0.5 y – 1 g(x, y) = 1 if f(x, y) > 0
0 if f(x, y) <= 0
![Page 7: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/7.jpg)
7
VISUALIZING A MODEL
1
1
X
Y
g(x,y)0
![Page 8: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/8.jpg)
8
FROM SIMPLE TO COMPLEX
Xn
X3
X2
X1
…
r1(X1, X2)
r2(X2∪X3)
rm(X1, Xn)
…
s1(r1, r2)
s2(r1, r3)
sm(rm-1, rm)
…
Use more complicated functions
or
Stack layers of simple functions(e.g., deep neural nets)
![Page 9: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/9.jpg)
9
BETWEEN RAW DATA AND MODELS• Mathematical models take numeric input• Raw data are not numeric (or not the right kind of numeric)• Featurization: the step in-between• Feature space: multi-dimensional numeric space where modeling
happens
![Page 10: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/10.jpg)
Feature Generation
Feature: An individual measurable property of a phenomenon being observed.
⎯ Christopher Bishop, “Pattern Recognition and Machine Learning”
![Page 11: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/11.jpg)
TEXT
![Page 12: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/12.jpg)
12
TURNING TEXT INTO FEATURES
It is a puppy and it is extremely cute.
What are the important measures?
Keywords? Verb tense? Subject,
object?
it 2is 2
puppy 1and 1cat 0
aardvark 0cute 1
extremely 1… …
Bag of words feature vector
Raw text
![Page 13: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/13.jpg)
13
VISUALIZING BAG-OF-WORDSpuppy
cute
1
1
It is a puppy andit is extremely cute
![Page 14: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/14.jpg)
14
CLASSIFYING BAG-OF-WORDS
puppy
cat
2
11
have
I have a puppy
I have a catI have a kitten
I have a dogand I have a pen
1Decision surface
![Page 15: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/15.jpg)
Feature Cleaning and Transformation
![Page 16: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/16.jpg)
16
AUTO-GENERATED FEATURES ARE NOISY
Rank Word Doc Count
Rank Word Doc Count
1 the 1,416,058 11 was 929,7032 and 1,381,324 12 this 844,8243 a 1,263,126 13 but 822,3134 i 1,230,214 14 my 786,5955 to 1,196,238 15 that 777,0456 it 1,027,835 16 with 775,0447 of 1,025,638 17 on 735,4198 for 993,430 18 they 720,9949 is 988,547 19 you 701,01510 in 961,518 20 have 692,749
Most popular words in Yelp reviews dataset (~ 6M reviews).
![Page 17: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/17.jpg)
17
AUTO-GENERATED FEATURES ARE NOISY
Rank Word Doc Count
Rank Word Doc Count
357,480 cmtk8xyqg 1 357,470 attractif 1357,479 tangified 1 357,469 chappagetti 1357,478 laaaaaaasts 1 357,468 herdy 1357,477 bailouts 1 357,467 csmpus 1357,476 feautred 1 357,466 costoso 1357,475 résine 1 357,465 freebased 1357,474 chilyl 1 357,464 tikme 1357,473 cariottis 1 357,463 traditionresort 1357,472 enfeebled 1 357,462 jallisco 1357,471 sparklely 1 357,461 zoawan 1
Least popular words in Yelp reviews dataset (~ 6M reviews).
![Page 18: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/18.jpg)
18
FEATURE CLEANING• Popular words and rare words are not helpful• Manually defined blacklist – stopwords
a b c d e f g h iable be came definitely each far get had ieabout became can described edu few gets happens ifabove because cannot despite eg fifth getting hardly ignoredaccording become cant did eight first given has immediatelyaccordingly becomes cause different either five gives have inacross becoming causes do else followed go having inasmuch… … … … … … … … …
![Page 19: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/19.jpg)
19
FEATURE CLEANING• Frequency-based pruning
![Page 20: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/20.jpg)
20
STOPWORDS VS. FREQUENCY FILTERS
No training required
Stopwords Frequency filters
Can be exhaustive
Inflexible
Adapts to data
Also deals with rare words
Needs tuning, hard to control
Both require manual attention
![Page 21: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/21.jpg)
21
FEATURE SCALING WITH TD-IDF• Scaling ”evens out” the features
• A soft filter• Tf-idf = term frequency x inverse document frequency• Tf = Number of times a terms appears in a document• Idf = log(# total docs / # docs containing word w)
• Large for uncommon words, small for popular words• Discounts popular words, highlights rare words
![Page 22: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/22.jpg)
22
VISUALIZING TF-IDF
puppy
cat
2
11
have
I have a puppy
I have a catI have a kitten
idf(puppy) = log 4idf(cat) = log 4idf(have) = log 1 = 0
I have a dogand I have a pen
1
![Page 23: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/23.jpg)
23
VISUALIZING TF-IDF
puppy
cat1
have
tfidf(puppy) = log 4tfidf(cat) = log 4tfidf(have) = 0
I have a dogand I have a pen,I have a kitten
1
log 4
log 4
I have a cat
I have a puppy
![Page 24: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/24.jpg)
IMAGES
![Page 25: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/25.jpg)
25
REPRESENTING IMAGES
What are the “semantic atoms” of images?• Semantic atom = a unit of meaning
![Page 26: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/26.jpg)
26
COLOR HISTOGRAM
40%
60%
White Blue
40%
60%
White Blue
![Page 27: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/27.jpg)
27
INFORMATION ABOUT STRUCTURE
Collection of local patches encapsulates global structure
![Page 28: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/28.jpg)
28
IMAGE GRADIENTS AND ORIENTATION HISTOGRAM
• Color changes indicate edges, patterns, or texture
• Image gradient: direction of largest change in color, starting from a pixel
-45º
0º
45º
-90º
90º135º
180º
-135º
• Gradient orientation histogram: indicates the prominent directions of color change in a patch of pixels
![Page 29: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/29.jpg)
29
SIFT IMAGE FEATURE PIPELINE
Lowe, ICCV 1999
![Page 30: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/30.jpg)
30
DEEP LEARNING APPROACH• Stack multiple layers – combine local features to form global features• Similar in spirit to SIFT/HOG
“AlexNet” – Krizhevsky et al., NIPS 2012
![Page 31: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/31.jpg)
31
VISUALIZING ALEXNET
Weights of a trained AlexNet. Left– first layer, right – second layer.
![Page 32: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/32.jpg)
32
FEATURIZATION CHALLENGES
It is a puppy and it is extremely cute.
“Human native”Conceptually abstract
Low Semantic content in dataHigh
Higher Difficulty of feature generationLower
TextImageAudio
![Page 33: Feature engineering for diverse data types](https://reader033.vdocuments.mx/reader033/viewer/2022042907/587f5bf91a28ab0d378b7627/html5/thumbnails/33.jpg)
33
KEY TO FEATURE ENGINEERING• Features sit in-between data and models• Need to encapsulate necessary semantic information from raw data• Distribution of data in feature space should be easily manageable by
intended model• Natural text and logs contain higher level semantic information
• Easier to featurize than images and audio• Requires ingenuity and intuition!
@RainyData [email protected]
Amazon Ad Platform is hiring!