deep learning for nlp applications

Deep Learning forDeep Learning forNLP ApplicationsNLP Applications

What is DeepWhat is DeepLearning?Learning?

Just a NeuralJust a NeuralNetwork!Network!

"Deep learning" refers toDeep Neural Networks

A Deep Neural Network issimply a Neural Networkwith multiple hidden layers

Neural Networks havebeen around since the1970s

So why now?So why now?

LargeLargeNetworks areNetworks arehard to trainhard to train

Vanishing gradients makebackpropagation harderOverfitting becomes aserious issueSo we settled (for the timebeing) with simpler, moreuseful variations of NeuralNetworks

Then,Then,suddenly ...suddenly ...

We realized we can stackthese simpler NeuralNetworks, making themeasier to trainWe derived more efficientparameter estimation andmodel regularizationmethodsAlso, Moore's law kicked inand GPU computationbecame viable

So what's the bigSo what's the bigdeal?deal?

MASSIVE improvements inMASSIVE improvements inComputer VisionComputer Vision

Speech RecognitionSpeech RecognitionBaidu (with Andrew Ng as their chief) has built a state-of-the-art speech recognition system with DeepLearningTheir dataset: 7000 hours of conversation couple withbackground noise synthesis for a total of 100,000hoursThey processed this through a massive GPU cluster

Cross DomainCross DomainRepresentationsRepresentations

What if you wanted to takean image and generate adescription of it?The beauty ofrepresentation learning isit's ability to be distributedacross tasksThis is the real power ofNeural Networks

But Samiur, whatBut Samiur, whatabout NLP?about NLP?

Deep Learning NLPDeep Learning NLPDistributed word representationsDependency ParsingSentiment AnalysisAnd many others ...

StandardStandardBag of Words

A one-hot encoding20k to 50k dimensionsCan be improved byfactoring in documentfrequency

Word embeddingWord embeddingNeural Word embeddings

Uses a vector spacethat attempts topredict a word given acontext window200-400 dimensions

motel [0.06, -0.01, 0.13, 0.07, -0.06, -0.04, 0, -0.04]

hotel [0.07, -0.03, 0.07, 0.06, -0.06, -0.03, 0.01, -0.05]

Word RepresentationsWord Representations

Word embeddings make semantic similarity andsynonyms possible

Word embeddings have coolWord embeddings have coolproperties:properties:

Dependency ParsingDependency ParsingConverting sentences to adependency based grammar

Simplifying this to the verbsand it's agents is calledSemantic Role Labeling

SentimentSentimentAnalysisAnalysis

Recursive Neural NetworksCan model treestructures very wellThis makes it great forother NLP tasks too(such as parsing)

Get to theGet to theapplications partapplications part

already!already!

ToolsToolsPython

Theano/PyLearn2Gensim (for word2vec)nolearn (uses scikit-learn

Java/Clojure/ScalaDeepLearning4jneuralnetworks by Ivan Vasilev

APIsAlchemy APIMeta Mind

Problem: Funding SentenceProblem: Funding SentenceClassifierClassifier

Build a binary classifier that is able to take any sentencefrom a news article and tell if it's about funding or not.

eg. "Mattermark is today announcing that it has raised a

round of $6.5 million"

Word VectorsWord Vectors

Used Gensim's Word2Vec implementation to trainunsupervised word vectors on the UMBC WebbaseCorpus (~100M documents, ~48GB of text)Then, iterated 20 times on text in news articles in thetech news domain (~1M documents, ~300MB of text)

Sentence VectorsSentence VectorsHow can you compose word vectors to makesentence vectors?

Use paragraph vector model proposed by Quoc LeFeed into an RNN constructed by a dependencytree of the sentenceUse some heuristic function to combine the stringof word vectors

What did we try?What did we try?TF-IDF + Naive BayesWord2Vec + Composition MethodsWord2Vec + TF-IDF + Composition MethodsWord2Vec + TF-IDF + Semantic Role Labeling (SRL) +Composition Methods

Composition MethodsComposition Methods

Where wi represents the i'th word vector,wv the word vector for the verb, and a0 and a1 are

agents

What worked?What worked?Word2Vec + TFIDF + SRL + Circular Convolution/Additive

The first method with simple TFIDF/Naive Bayesperformed extremely poorly because of it's largedimensionalityCombining TFIDF with Word2Vec provided a small,but noticeable improvementAdding SRL and a more sophisticated compositionmethod increased performance by almost 5%

What else could we try?What else could we try?Can we apply this method to generate generalpurpose document vectors?

We are currently using LDA (a topic analysismethod) or simple TFIDF to create documentvectorsHow will this method compare to the alreadyproposed paragraph vector method by Quoc Le?

Can we associate these document vectors with muchsmaller query strings?

eg. Search for artificial intelligence against ourcompanies and get better results than keywordsearch

Who's doing ML atWho's doing ML atMattermark?Mattermark?

mattermark

We need more people! Refer anyone thatyou know that does Data Science/ML

deep learning for nlp applications

Engineering

coolword embeddings

whatabout nlp

funding sentenceproblem

news articles

thetech news domain

whatbut samiur

viableso whats

bigso whats