attend to you: personalized image captioning code...

1
Cesc Chunseong Park Byeongchang Kim Gunhee Kim Quantitative Results Motivation CSMN Architecture Attend to You: Personalized Image Captioning with Context Sequence Memory Networks Seoul National University Qualitative Results Post generation examples InstaPIC-1.1M Dataset Objective Our Solution: CSMN A cup of coffee … Generate captions from image and user ’s context General users’ preferences over the captions created by different methods for a query image Hashtag prediction examples Collected from 1,124,815 unique posts and 6,315 unique users User Studies via Amazon Mechanical Turk Measured by both language and retrieval metrics (CSMN-*): Ours and variants (seq2seq): [Vinyals et al. NIPS15] (ShowTell): [Vinyals et al. TPAMI16] (AttendTell): [Xu et al. ICML15] (1NN): 1 nearest neighbor Previous image captioning creates a general description of an image Extend image captioning to reflect user’s personality [Xu et al, ICML2015] [Vinyal et al, CVPR2015] [Karpathy et el. CVPR2015] [Denton et al. KDD2015] Many more! Collect Instagram Posts Preprocess Posts/Hashtags Extract User’s Active Vocabulary Goal: Collect refined posts from Instagram 27 general categories from Pinterest 5 < caption length < 15, 50 < # posts per user < 1,000 Code and dataset are available at https://github.com/cesc-park/attend2u (1) (3) Please see the paper for more details ! y t y t (b) Word output memory update Softmax Image feature User context Word output y t-1 CNN (a) Prediction step y t-1 y 1 y t-1 y 1 Attention Output Input Word output y t y t Embedding ( ) * Update to new query Update memory y t-1 y 1 y t-1 y 1 y t Querying W q o q t c t f y t Dataset # posts # users caption 721,176 4,820 hashtag 518,116 3,633 Beautiful solitude in the morning Beautiful day for a wedding The beautiful Melbourne, I love spring User 1 User 2 User 3 Users craft sentences based on their experiences using their own words (a) Post generation (b) Hashtag prediction Query Image User’s Active Vocabulary Task2. Post generation Task1. Hashtag prediction (2) (GT) pool pass for the summer (Ours) the pool was absolutely perfect (GT) awesome view of the city (Ours) the city of cincinnati is so pretty (NoCNN) the beach (UsrIm) there are no words (GT) dinner and drinks with @username (Ours) wine and movie night with @username (Im) my afternoon is sorted (GT) this speaks to me literarily (Ours) I love this #quote (Showtell) is the only thing that matters _UNK (GT) #style #fashion #shopping #shoes #kennethcole… (Ours) #newclothes #fashion #shoes #brogues (GT) #boudoir #heartprint #love #weddings #potterybarn (Ours) #decor #homedecor #interiors #interiordesign #rustic #bride #pretty #wedding #home #white (GT) #coffee #dailycortado #love #vscocam #vscogood #vscophile #coffeebreak … (Ours) #coffee #coffeetime #coffeeart #latte #latteart #coffeebreak #vsco (GT) #greensmoothie #dairyfree #lifewithatoddler #glutenfree #vegetarian … (Ours) #greensmoothie #greenjuice #smoothie #vegan #raw #juicing #eatclean #detox #cleanse ( ) User context Image feature 01 ) 01 ( ResNet Memory setup Prediction/update step Goal: Build a vocabulary dictionary 40K for caption, 60K for hashtag Goal: Build user’s active vocabulary set TF-IDF weighted top- frequent words from the user’s previous posts Seoul National University Seoul National University Context Sequence Memory Network CNN memory I/O structure to jointly represent nearby ordered memory slots Multi-type memory cell to condition different types of context information Sequence generation w/o RNN to capture long-term info without vanishing gradient (1) : Image feature, active vocabulary, and previous words (2) : Adopting CNN memory structure for better context understanding (3) : Appending generated words for state-based sequence generation (1) (2) (3)

Upload: truongkhue

Post on 07-Aug-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Attend to You: Personalized Image Captioning Code …openaccess.thecvf.com/content_cvpr_2017/poster/293_POSTER.pdf · Active Vocabulary Goal: Collect refined posts from Instagram

Cesc Chunseong Park Byeongchang Kim Gunhee Kim

Quantitative ResultsMotivation

CSMN Architecture

Attend to You: Personalized Image Captioningwith Context Sequence Memory Networks

Seoul National University

Qualitative Results• Post generation examples

InstaPIC-1.1M Dataset

Objective

Our Solution: CSMN

A cup of coffee …

Generate captions from image and user’s context

• General users’ preferences over the captions created by different methods for a query image

• Hashtag prediction examples

• Collected from• 1,124,815 unique posts and 6,315

unique users

User Studies via Amazon Mechanical Turk

• Measured by both language and retrieval metrics

(CSMN-*): Ours and variants(seq2seq): [Vinyals et al. NIPS15](ShowTell): [Vinyals et al. TPAMI16](AttendTell): [Xu et al. ICML15](1NN): 1 nearest neighbor• Previous image captioning creates a general description of an image

Extend image captioning to reflect user’s personality

[Xu et al, ICML2015]

[Vinyal et al, CVPR2015]

[Karpathy et el. CVPR2015]

[Denton et al. KDD2015]

Many more!

Collect Instagram Posts

Preprocess Posts/Hashtags

Extract User’s Active Vocabulary

Goal: Collect refined posts from Instagram• 27 general categories from Pinterest• 5 < caption length < 15, 50 < # posts per user < 1,000

Code and dataset are available athttps://github.com/cesc-park/attend2u

(1)

(3)

Please see the paper for more details !

yt

yt

(b) Word output memory update

Softmax

Image feature User context Word output

yt-1

CNN

(a) Prediction step

yt-1y1 …

yt-1y1 …Attention

Output

InputWord output

yt

yt

Embedding𝐖'

( 𝐖') 𝐖'

*

Update to new query

Upd

ate

mem

ory

yt-1y1 …

yt-1y1 …

yt

QueryingWq

𝐖o

qt

ct

𝐖f

yt

Dataset # posts # users

caption 721,176 4,820

hashtag 518,116 3,633

Beautiful solitude in the morning

Beautiful day for a wedding

The beautiful Melbourne, I love spring

User 1

User 2

User 3

• Users craft sentences based on their experiences using their own words

(a) Post generation (b) Hashtag prediction

Query Image User’s Active Vocabulary

Task2. Post generation

Task1. Hashtag prediction

(2)

(GT) pool pass for the summer ✔(Ours) the pool was absolutely perfect ☀

(GT) awesome view of the city(Ours) the city of cincinnatiis so pretty(NoCNN) the beach(UsrIm) there are no words

(GT) dinner and drinks with @username(Ours) wine and movie night with @username(Im) my afternoon is sorted

(GT) this speaks to me literarily(Ours) I love this #quote(Showtell) is the only thing that matters _UNK

(GT) #style #fashion #shopping #shoes #kennethcole…(Ours) #newclothes#fashion #shoes #brogues

(GT) #boudoir #heartprint#love #weddings #potterybarn(Ours) #decor #homedecor#interiors #interiordesign#rustic #bride #pretty #wedding #home #white

(GT) #coffee #dailycortado#love #vscocam #vscogood#vscophile #coffeebreak …(Ours) #coffee #coffeetime#coffeeart #latte #latteart#coffeebreak #vsco

(GT) #greensmoothie #dairyfree#lifewithatoddler #glutenfree#vegetarian …(Ours) #greensmoothie#greenjuice #smoothie #vegan #raw #juicing #eatclean #detox #cleanse

𝐖'(

𝐖') User contextImage feature𝐖01

)

𝐖01(

ResNet

• Memory setup

• Prediction/update step

Goal: Build a vocabulary dictionary• 40K for caption, 60K for hashtag

Goal: Build user’s active vocabulary set• TF-IDF weighted top-𝐷 frequent words from the user’s

previous posts

Seoul National University Seoul National University

Context Sequence Memory Network

CNN memory I/O structure to jointly represent nearby ordered memory slots

Multi-type memory cell to condition different types of context information

Sequence generation w/o RNN to capture long-term info without vanishing gradient

(1) : Image feature, active vocabulary, and previous words(2) : Adopting CNN memory structure for better context understanding(3) : Appending generated words for state-based sequence generation

(1) (2) (3)