text intensive diligence - clojure nlp (january 18, 2017)

Weighing our Words

Clojure Chicago Meetup (January 18, 2017)

Kripa Rajshekhar (Metonymy Labs)

[email protected]

mailto:[email protected]


Patent owner success

rates in litigation

Patent text - “natural” language processing

TITLE: Process for manufacturing a DMOS transistor

ABSTRACT: In a new process of making a DMOS transistor, the doping of the

sloping side walls can be set independently from the doping of the floor region in a

trench structure. Furthermore, different dopings can be established among the side

walls. This is achieved especially by a sequence of implantation doping, etching to

form the trench, formation of a scattering oxide protective layer on the side walls,

and two-stage perpendicular and tilted final implantation doping. For DMOS

transistors, this achieves high breakthrough voltages even with low turn-on

resistances, and reduces the space requirement, in particular with regard to driver

structures.

Extracting the core ideas …

{ ["in"], :tag "PP"}

{ ["a" "new" "process"], :tag "NP"} <<<<<<<<<<

{ ["of"], :tag "PP"}

{ ["making"], :tag "VP"}<<<<<<<<<<

{ ["a" "dmos" "transistor"], :tag "NP"}<<<<<<<<<<

{ ["the" "doping"], :tag "NP"}

{ ["of"], :tag "PP"}

{ ["the" "sloping" "side" "walls"], :tag "NP"}

{ ["can" "be" "set"], :tag "VP"}

{ ["independently"], :tag "ADVP"}

{ ["from"], :tag "PP"}

{ ["the" "doping"], :tag "NP"}

{ ["of"], :tag "PP"}

{ ["the" "floor" "region"], :tag "NP"}

{ ["in"], :tag "PP"}

{ ["a" "trench" "structure" "furthermore" "different" "dopings"], :tag "NP"} …

35% chance of

owning prime

property

www.linkedin.com/pulse/legal-information-retrieval-1962-kripa-rajshekhar

‘The cost of preparing data … between five and six cents per line …

31,113 statutory sections (documents) on 611,195 punched cards …

between 25 and 40 minutes [to search] on the IBM 7070.’

We have (1) Exceeded the author’s computer-hardware at a

science-fiction-scale, and (2) Missed the mark on almost all other

fronts (algorithms, depth of application, knowledge models, ...)

https://www.linkedin.com/pulse/legal-information-retrieval-1962-kripa-rajshekhar

https://www.linkedin.com/pulse/legal-information-retrieval-1962-kripa-rajshekhar

Diligence performance with current toolkit

Augmented with advanced NLP

diligence

Efficiency e.g. Deal screening, Financing timeline

Knowledge e.g. % of legally relevant patents

NLP is an AI-complete problem

“You shall know a word by the company it

keeps”

(Firth, J. R. 1957)

1. Automatic Summarization: Produce a readable summary of text (e.g. articles in the

financial section of newspaper)

2. Coreference resolution: Given a body of text determine which words (called "mentions")

refer to the same objects

3. Discourse Analysis: Discover the nature of relationships (e.g. elaboration, explanation,

contrast, yes-no question, content question, statement)

4. Machine Translation: Translate written text in one language into another.

5. Named Entity Recognition (NER): Given a stream of text, determine which items map to

proper names identify the type (e.g. place, name, organization)

6. Natural Language Generation: Convert data into readable language.

7. Part of Speech Tagging: identify noun, verb, conjunction, pronoun, etc.

8. Question answering: Given natural language question, generate NL answer.

9. Relationship Extraction: Relationship between named entities

10. Sentiment Analysis: Extract subjective information (e.g. response to

product release from social media)

…. Many more that could be added

Christopher Manning’s 2015 Survey of NLP

Stanford NLP - Book, 102 Course Videos

Open NLP

https://opennlp.apache.org/

https://github.com/dakrone/clojure-opennlp

Word2Vector

http://papers.nips.cc/paper/

5021-distributed-representations-

Of-words-and-phrases

-and-their-compositionality.pdf

http://science.sciencemag.org/content/349/6245/261.full

http://nlp.stanford.edu/IR-book/

https://www.youtube.com/playlist?list=PL6397E4B26D00A269





http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf








Instructions for example POS run, assuming lein repl available:

git clone https://github.com/dakrone/clojure-opennlpcd clojure-opennlp/lein repl=>(use 'opennlp.nlp)=>(use 'opennlp.tools.filters)=>(def get-sentences (make-sentence-detector "models/en-sent.bin"))=>(def tokenize (make-tokenizer "models/en-token.bin"))=>(def pos-tag (make-pos-tagger "models/en-pos-maxent.bin"))=>(tokenize "Mr. Adams seriously suggested that the answer was 42")=>(pos-tag *1)=>(nouns *1)=>(verbs (pos-tag (tokenize "There is a theory which states that if ever anyone discovers exactly what the Universe is for and why it is here, it will instantly disappear and be replaced by something even more bizarre and inexplicable. There is another theory mentioned, which states that this has already happened.")))


Instructions for example W2V run:

# Uses https://github.com/Bridgei2i/clojure-word2vec# Which uses https://github.com/medallia/Word2VecJava

# Edit:project.clj dependencies [org.bridgei2i/word2vec "0.2.1"]curl "http://www.gutenberg.org/files/98/98-0.txt" > dickens.txtlein repl

=>(ns clojure-word2vec.examples (:require [clojure-word2vec.core :refer :all] [clojure.java.io :as io]))=>(def data (create-input-format "dickens.txt"))=>(def model (word2vec data :window-size 15)=>(count (.getVocab model))

=>(get-matches model "wild"); grep "wild" dickens.txt=>(.getRawVector (.forSearch model) "wild")

https://github.com/Bridgei2i/clojure-word2vec

https://github.com/medallia/Word2VecJava

“How could a little mathematics transmute

itself into linguistics?”

(Harris, Z. 1991)

Thank you.

[email protected]



text intensive diligence - clojure nlp (january 18, 2017)

Data & Analytics