talk data sciencemeetup
DESCRIPTION
TRANSCRIPT
How to Visualize High-Dimensional Data?(Also: How to make 2000 bucks in an hour?)
Laurens van der Maaten
Visualization
• Visualization is a key tool in the analysis of data
Works for low-dimensional data only!
Data visualization
• What can we do to visualize Big Data that has lots of variables?
• Make a scatter plot in which each point corresponds to a measurement
• Arrange the points such that nearby points model similar measurements
• How do we determine the locations of the points in the map?
• Techniques for dimension reduction, multidimensional scaling, or embedding
Embedding
Embedding
Embedding
• The input of an embedding algorithm is:
• Collection of high-dimensional data points or...
• Collection of pairwise (dis)similarities (a distance table)
• The output of an embedding algorithm is:
• Collection of low-dimensional data points (a map)
• Principal Components Analysis maps the data in a linear subspace, such that the variance of the projected data is maximized:
Principal components analysis
w
Tx
Principal components analysis
Principal components analysis
t-Distributed Stochastic Neighbor Embedding
• Measure pairwise similarities between high-dimensional objects:
pij =exp(�kxi � xjk2/2�2
)Pk
Pl 6=k exp(�kxk � xlk2/2�2
)
High-D
t-Distributed Stochastic Neighbor Embedding
• Move points around to minimize: KL(P ||Q) =�
i
�
j �=i
pij logpij
qij
qij =(1 + �yi � yj�2)�1
�k
�l �=k(1 + �yk � yl�2)�1
Low-D
t-Distributed Stochastic Neighbor Embedding
0123456789
van der Maaten & Hinton, 2008
Scaling up t-SNE
• Interpret evaluating t-SNE gradient as simulating an N-body system
• Use a Barnes-Hut algorithm to approximate t-SNE gradient in O(N logN)
0123456789
Scaling up t-SNE
• Scale up t-SNE to large data sets (MNIST, N = 70K; T = 10m):
van der Maaten, 2013
Scaling up t-SNE
• Even to data sets with millions of data points (TIMIT, N = 1.1M; T = 3h 40m):
So how did you win 2000 bucks in an hour?
• Kaggle and Merck hosted a molecular activity visualization challenge:
• Features derived from molecules’ chemical structure
• Each molecule also has an activity value
• The data distribution somehow changes over time
• Visualize features using t-SNE, and color according to activity and time
Merck visualization (1)
Data set #8 colored by activity
5.5
6
6.5
7
7.5
8
8.5
9
9.5
10
Data set #8 colored by time
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Merck visualization (2)
Data set #8 colored by time
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Data set #8 colored by activity
5.5
6
6.5
7
7.5
8
8.5
9
9.5
10
Limitations of using a single map
• Suppose we are visualizing words based on association data, or authors based on co-authorships, or Enron emails, or scale-free networks, etc.
• How can we model the words “river”, “bank”, and “bailout” in a single map?
RIVER
BANK
BAILOUT
Multiple maps t-SNE
• Construct multiple maps, and give each object a point in each map
• Assign an importance weight to each point
• Define the similarity between two points under the multiple maps model as a weighted sum over the similarities in the individual maps
van der Maaten & Hinton, MLJ 2012
Map 1 Map 2
1
½
1
½
RIVER
BANK
BAILOUT
BANK
Multiple maps t-SNE
• Definition of similarity under multiple maps model:
• Herein, we define the importance weights as:
• All map coordinates and importance weights are learned jointly
qj|i =
Pm �(m)
i �(m)j (1 + ky(m)
i � y(m)j k2)�1
Pm0
Pk ⇥=i �
(m0)i �(m0)
k (1 + ky(m0)i � y(m0)
k k2)�1
van der Maaten & Hinton, 2012
�(m)i =
exp(w(m)i )
�m� exp(w(m�)
i )
CHEERLEADER
MOLE
RESPECT
CONGRESS
FREEDOM
GROW
KIDS
PROCEDURE
PUBERTY
CROOKED
GROWTH PINK
UNSURE
WALKER WORM
CURVED
PARENTS
RESPONSIBILITY
UNEVEN
WART
BUREAU
FRESH HERITAGE
INSURANCERULE
SEXY
AGE
ANARCHY
CONFIDENT
CURVE
DINOSAUR
DISGUST
MAYOR
MODEL
OFFICIAL
PLAY DOUGHCARRY
GODDESS
GOVERNMENT
INSTRUCTION
LOOKS
PRESIDENT
ANCESTOR
CUTE
DEVICE
DIRECTIONS
FEDERAL
GOO
LINE
OBEY
SENATOR
TANGENT
VULGAR
ADULTS
BAG
BEAUTY
CHILDREN
DEODORANT
ERECT
GORGEOUS
GRANDPARENTS
INSTRUCTIONS
KNITTING
LUNCH
MAGGOT
POLITICIAN
REPUBLIC
RUST
APPEARANCEATTRACTIVE
AWKWARD
BORDERCONSTITUTION
DEMOCRACY
DEVELOP
EGYPT
GROSS
GROWN
KNAPSACK
LAWS
PRINCIPLE
SLIMYTARNISH
BEAST
CAMPAIGN
IMMATUREMATURE
MODERN
SLIMESLUG
TAXES
WRINKLEYUCK
BENT
CANE
CORRUPT
DISGUSTING
LAW
MONARCHYOLIGARCHY
POLICY
RESTRICTION
SENATE
STALE
UGLY
AMERICABOUNDARY
BUGLE
FOSSIL
GOVERNOR
HANDSOME
LEGISLATUREPOLITICS
REPULSIVE
SURE
UNUSEDWORN
ADULT
ANCIENT
ELDERS
GAL
NASTY
RULESSTRAIGHT
USED
YEARS
DEMOCRAT
FEEBLE
FOLLOW
GRANDPA
PRETTY
USA
ANTIQUE BALD
BOY
CERTAIN
GRANDMA
NEW
SCOUT
WISEELDERLY
OLD
POSITIVE
REPUBLICAN
SACK
TRICYCLE
YOUNG
ADORABLE
BEAUTIFUL
GIRLGUY
YOUTH
GLAD
GROWN−UPS
TOTE
FIELDCHEERLEADER
OVERWHELM
FREEDOM
LACE
WORRY
DICEFOOTBALL
PATTERN
SET
STRESS
STRIPE
ACTIVITY
CHEST
PANTS
PLACE
PLAID
POPULAR
STATUS
STRAP
AREA
CARD
POLYESTER
SUIT
ARENAATHLETIC
BASE
CASUAL
LEATHER
OFFICIALOLYMPICS
PLAYER
SITE
SPORTS
ANXIETY
DEAL
FANCY
OPPONENT PLAYING
REFEREE
SASH
STADIUMATHLETE
BASEBALL
CARDS
DEFENSE
HIP
PROM
BANG BAT
COAT
EXCITEMENT
FASTEN
FORMAL
MONUMENT
PENGUIN
SERIES
SHIRTSTARCH
STATUE
TEAM
TIE
BUTTON
POSITION
PUT
SHORTS
VOLLEYBALL
WEAR
ACE
BRA
CLUE
CONTEMPORARY
DECK
DRESS
JACKET
LOOSEN
MODERN
SKIRT
SOFTBALL CHESS
COLLAR
FAMOUS
JOCK
SWEATER
WAIST
BASKETBALL
CHARGE
COACHGAME
JEANS
LEAGUE
LOCATION
SEAM
SPADE
UMPIRE
CREASE CUFF
FLANNELFRILL
PITCHER
SPORT
WHERE
ZIPPERBELT
BLOUSE
CHECKERS
GOWN
HEM
JOKERMONOPOLYTACKLE
TUXEDO
CATCHERIVY
LAPEL
PITCH
POKER
RUMMY
SLEEVE
SQUAD
TOUCHDOWN
CREDIT
SPADES
VEST
JEOPARDY
LIBERTY
SLACKS
SOCCER
BUCKLE
OFFENSE
SHOELACE
DENIM
TROUSERS
LOCAL
EMPIREKEEPER
PASSAGE
STALK
DEPLETION
DOOR
INTEREST
THRESHOLD
ENVIRONMENT
HARVEST
KINGDOM
MINDEDTURN
BEYONDBREEZEWAY
DYNASTY
FENCE
HALLWAY
HANDLEINTIMATE
DOORWAY
HALL
CHINA
DICTATOR
VEER
AWAY
CARTOON
EDGE
OZONE
ROYAL
SURROUNDING
CASTLE
COMBINATION
LAYER
RULER
EMPEROR
MOAT PICKL
SOW
CORN
KNOCK
LIGHTNING
LOCK
MONARCHY
PALACEPICKLES
REAP
SPINACH
SURROUNDINGS
DISTANCEFURTHER
KEYS
LONG
OPENING
PRINCE
RING SCARECROW
VACANCYENGLAND
FAR
GARAGE
GATE
LATCH
MAT
BOLT
CLOSING
ROYALTY
JUICE
MONARCH
PRINCESSCLOSE
DISTANT
KEY
ROMAN
APART
CROWN
ENTRANCE
KING
POPEYE
WELL−BEING
QUEEN
SHUT
THRONE HINGE
KNOB
CLOSED
CORRIDOR
OPEN
BEETLE
DILL
CHEERLEADER
MOLE
RESPECT
CONGRESS
FREEDOM
GROW
KIDS
PROCEDURE
PUBERTY
CROOKED
GROWTH PINK
UNSURE
WALKER WORM
CURVED
PARENTS
RESPONSIBILITY
UNEVEN
WART
BUREAU
FRESH HERITAGE
INSURANCERULE
SEXY
AGE
ANARCHY
CONFIDENT
CURVE
DINOSAUR
DISGUST
MAYOR
MODEL
OFFICIAL
PLAY DOUGHCARRY
GODDESS
GOVERNMENT
INSTRUCTION
LOOKS
PRESIDENT
ANCESTOR
CUTE
DEVICE
DIRECTIONS
FEDERAL
GOO
LINE
OBEY
SENATOR
TANGENT
VULGAR
ADULTS
BAG
BEAUTY
CHILDREN
DEODORANT
ERECT
GORGEOUS
GRANDPARENTS
INSTRUCTIONS
KNITTING
LUNCH
MAGGOT
POLITICIAN
REPUBLIC
RUST
APPEARANCEATTRACTIVE
AWKWARD
BORDERCONSTITUTION
DEMOCRACY
DEVELOP
EGYPT
GROSS
GROWN
KNAPSACK
LAWS
PRINCIPLE
SLIMYTARNISH
BEAST
CAMPAIGN
IMMATUREMATURE
MODERN
SLIMESLUG
TAXES
WRINKLEYUCK
BENT
CANE
CORRUPT
DISGUSTING
LAW
MONARCHYOLIGARCHY
POLICY
RESTRICTION
SENATE
STALE
UGLY
AMERICABOUNDARY
BUGLE
FOSSIL
GOVERNOR
HANDSOME
LEGISLATUREPOLITICS
REPULSIVE
SURE
UNUSED
WORN
ADULT
ANCIENT
ELDERS
GAL
NASTY
RULESSTRAIGHT
USED
YEARS
DEMOCRAT
FEEBLE
FOLLOW
GRANDPA
PRETTY
USA
ANTIQUE BALD
BOY
CERTAIN
GRANDMA
NEW
SCOUT
WISEELDERLY
OLD
POSITIVE
REPUBLICAN
SACK
TRICYCLE
YOUNG
ADORABLE
BEAUTIFUL
GIRLGUY
YOUTH
GLAD
GROWN−UPS
TOTE
I want to give this stuff a try!
• Type “t-SNE” into Google, and click the first link
• You’ll find papers, examples, and implementations (in Matlab, Python, R, and C++)
• You can also drop me a line: [email protected]