corinna cortes, head of research, google, at mlconf nyc 2017

Harnessing Neural Networks

Corinna CortesGoogle Research, NY

Harnessing the Power of Neural NetworksIntroduction

How do we standardize the output?

How do we speed up inference?

How do we automatically find a good network architecture?

Google’s mission is to organize the world’s information and make it universally accessible and useful.

Google Translate

Smart reply in Inbox

10%of all responses sent on mobile

LSTM in Action

LSTMs and Extrapolation

They daydream or hallucinate :-)

Feature or bug?

DeepDream Art Auction and Symposium (A&MI)

Magenta

A.I. Duethttps://aiexperiments.withgoogle.com/ai-duet/view/

Restricting the Output. Smart Replies.http://www.kdd.org/kdd2016/papers/files/Paper_1069.pdf

● Ungrammatical and inappropriate answers○ thanks hon!; Yup, got it thx; Leave me alone!

● Work with a Fixed Response Set○ Sanitized answers are clustered in semantically similar answers using

label propagation;○ The answers in the clusters are used to filter the candidate set generated

by the LSTM. Diversity is ensured by using top answers from different clusters.

● Efficient search via tries

Search Tree, Trie, for Valid Responses

Tuesday Wednesday Tuesday? Wednesday?

I can do

Cluster responses

How about

! What time works for you?

. What time works for you?

Computational Complexity

● Exhaustive: R x l

R size of response set, l length of longest sentence

● Beam search: b x l

Typical size of R ~ millions, typical size of b ~ 10-30

● A more elegant solution based on rules○ Exploit rules to efficiently enlarge the response set:

■ “Can you do Monday?” “Yes, I can do Monday”■ “Can you do Tuesday?” “Yes, I can do Tuesday”■ ...

“Can you do <time>?”

“Yes, I can do <time>” or “No, I can do <time + 1>

What if the Response Set in Billions?

Rules for Response Set

Text Normalization for Text-to-Speech, TTS, SystemsNavigation assistant

Text Normalization

Richard Sproat, Navdeep Jaitly, Google: “RNN Approaches to Text Normalization: A Challenge”https://arxiv.org/pdf/1611.00068.pdf

Break the Task in Two

● Channel model○ possible normalizations of that token? Sequence of tokens to words.○ Example: 123

■ one hundred twenty three, one two three, one twenty three, ...

● Language model○ which one is appropriate to the given context? Words to words.○ Example: 123

■ 123 King Ave. - the correct reading in American English would normally be one twenty three.

Combining the Models

One combined LSTM

Silly Mistakes

Add a Grammar to Constrain the OutputRule: <number> + <measurement abbreviation> => <number> + the possible verbalizations of the measure abbreviation.

Instantiation: 24.2kg => twenty four point two kilogram, twenty four point two kilograms, twenty four point two kilo.

Finite State Transducers: a finite state automaton which produces output as well as reading input, pattern matching, regular expressions.

Thrax GrammarMEASURE: <number> + <measurement abbreviation> -> <number> + measurement verbalizations

Input: 5 kg -> five kilo/kilograms/kilogram

MONEY: $ <number> -> <number> dollars

Input composed with FSTs. The output of the FST is used to restrict the output of the LSTM.

TTS: RNN + FSTMeasure and Money restricted by grammar.

One class per image type (horse, car, …), M classes.

Neural network inference: Just to compute the last layer requires MN multiply adds.

Super-Multiclass Classification Problem

Output layer, M units:

Last hidden layer, N units:

Asymmetric Hashing

Weights to the output layer, parted in N/k chunks

● Represent each chunk with a set of cluster centers (256) using k-means.

● Save the coordinates of the centers, (ID, coordinates).

● Save each weight vector as a set of closest IDs, hashcode.

Asymmetric Hashing

Weights to the output layer, parted in N/k chunks

● Represent each chunk with a set of cluster centers (256) using k-means.

● Save the coordinates of the centers, (ID, coordinates).

● Save each weight vector as a set of closest IDs, hashcode.

78 184 15 12 63 192

56 82 72

201 37 51

Asymmetric Hashing, Searching● For given activation u, divide it into its N/k chunks, uj:

○ Compute the 256 N/k distances to centers. 256N multiply adds, not MN.○ Compute the distances to all hash codes:

● MN/k additions needed.● The “Asymmetric” in “Asymmetric Hashing” refers to the fact that we hash the

weight vectors but not the activation vector.

Asymmetric HashingIncredible saving in inference time

Sometimes also with a bit of improved accuracy

“Learning to Learn” a.k.a “Automated Hyperparameter Tuning”

Google: AdaNet, Architecture Search with Reinforcement Learning

MIT: Designing Neural Networks Architectures Using Reinforcement Learning,

Harvard,Toronto, MIT, Intel: Scalable Bayesian Optimization Using Deep Neural Networks.

Genetic Algorithms, Reinforcement Learning, Boosting Algorithm

Modeling Challenges for ML

The right model choice can significantly improve the performance. For Deep Learning it is particularly hard as the search space is huge and

● Difficult non-convex optimization● Lack of sufficient theory

Questions● Can neural network architectures be learned

together with their weights?● Can this problem be solved efficiently and in a

principled way?● Can we capture the end-to-end process?

AdaNet● Incremental construction: At each round, the algorithm adds a subnetwork to

the existing neural network;

● Algorithm leverages embeddings previous learned;● Adaptively grows network, balancing trade-off between empirical error and

model complexity;● Learning bound:

Experimental Results, AdaNet

CIFAR-10: 60,000 images, 10 classes

SD of all #’s: 0.01

Label Pair AdaNet Log. Reg. NN

deer-truck 0.94 0.90 0.92

deer-horse 0.84 0.77 0.81automobile-truck 0.85 0.80 0.81

cat-dog 0.69 0.67 0.66

dog-horse 0.84 0.80 0.81

Neural Architecture Search with RL

Error rates on CIFAR-10

Perplexity on Penn Treebank

Current accuracy of NAS on ImageNet: 78%State-of-Art: 80.x%

“Learning to Learn” a.k.a “Automated Hyperparameter Tuning”

Google: AdaNet, Architecture Search with Reinforcement Learning

MIT: Designing Neural Networks Architectures Using Reinforcement Learning,

Harvard,Toronto, MIT, Intel: Scalable Bayesian Optimization Using Deep Neural Networks.

Genetic Algorithms, Reinforcement Learning, Boosting Algorithm

corinna cortes, head of research, google, at mlconf nyc 2017

Technology

mlconf nyc edo liberty

irina rish, researcher, ibm watson, at mlconf nyc 2017

bryan thompson, chief scientist and founder at systap, llc...

samantha kleinberg, assistant professor of computer science,...

mlconf nyc pek lum

session 2 - akyildiz, beinecke, yee at mlconf nyc

mlconf nyc 0xdata

lei yang, senior engineering manager, quora at mlconf nyc -...

mlconf nyc animashree anandkumar

juliet hougland, data scientist, cloudera at mlconf nyc

mlconf nyc corinna cortes

evan estola, lead machine learning engineer, meetup, at...

mlconf nyc samantha kleinberg

edo liberty, research director, yahoo at mlconf nyc -...

mlconf nyc justin basilico

jeremy stanley, evp/data scientist, sailthru at mlconf nyc

dan mallinger, data science practice manager, think big...

yael elmatad, senior data scientist, tapad at mlconf nyc -...

jeff johnson, research engineer, facebook at mlconf nyc

mlconf nyc chang wang