text intensive diligence - clojure nlp (january 18, 2017)
TRANSCRIPT
Weighing our Words
Clojure Chicago Meetup (January 18, 2017)
Kripa Rajshekhar (Metonymy Labs)
Patent owner success
rates in litigation
Patent text - “natural” language processing
TITLE: Process for manufacturing a DMOS transistor
ABSTRACT: In a new process of making a DMOS transistor, the doping of the
sloping side walls can be set independently from the doping of the floor region in a
trench structure. Furthermore, different dopings can be established among the side
walls. This is achieved especially by a sequence of implantation doping, etching to
form the trench, formation of a scattering oxide protective layer on the side walls,
and two-stage perpendicular and tilted final implantation doping. For DMOS
transistors, this achieves high breakthrough voltages even with low turn-on
resistances, and reduces the space requirement, in particular with regard to driver
structures.
Extracting the core ideas …
{ ["in"], :tag "PP"}
{ ["a" "new" "process"], :tag "NP"} <<<<<<<<<<
{ ["of"], :tag "PP"}
{ ["making"], :tag "VP"}<<<<<<<<<<
{ ["a" "dmos" "transistor"], :tag "NP"}<<<<<<<<<<
{ ["the" "doping"], :tag "NP"}
{ ["of"], :tag "PP"}
{ ["the" "sloping" "side" "walls"], :tag "NP"}
{ ["can" "be" "set"], :tag "VP"}
{ ["independently"], :tag "ADVP"}
{ ["from"], :tag "PP"}
{ ["the" "doping"], :tag "NP"}
{ ["of"], :tag "PP"}
{ ["the" "floor" "region"], :tag "NP"}
{ ["in"], :tag "PP"}
{ ["a" "trench" "structure" "furthermore" "different" "dopings"], :tag "NP"} …
35% chance of
owning prime
property
www.linkedin.com/pulse/legal-information-retrieval-1962-kripa-rajshekhar
‘The cost of preparing data … between five and six cents per line …
31,113 statutory sections (documents) on 611,195 punched cards …
between 25 and 40 minutes [to search] on the IBM 7070.’
We have (1) Exceeded the author’s computer-hardware at a
science-fiction-scale, and (2) Missed the mark on almost all other
fronts (algorithms, depth of application, knowledge models, ...)
Diligence performance with current toolkit
Augmented with advanced NLP
diligence
Efficiency e.g. Deal screening, Financing timeline
Knowledge e.g. % of legally relevant patents
NLP is an AI-complete problem
“You shall know a word by the company it
keeps”
(Firth, J. R. 1957)
1. Automatic Summarization: Produce a readable summary of text (e.g. articles in the
financial section of newspaper)
2. Coreference resolution: Given a body of text determine which words (called "mentions")
refer to the same objects
3. Discourse Analysis: Discover the nature of relationships (e.g. elaboration, explanation,
contrast, yes-no question, content question, statement)
4. Machine Translation: Translate written text in one language into another.
5. Named Entity Recognition (NER): Given a stream of text, determine which items map to
proper names identify the type (e.g. place, name, organization)
6. Natural Language Generation: Convert data into readable language.
7. Part of Speech Tagging: identify noun, verb, conjunction, pronoun, etc.
8. Question answering: Given natural language question, generate NL answer.
9. Relationship Extraction: Relationship between named entities
10. Sentiment Analysis: Extract subjective information (e.g. response to
product release from social media)
…. Many more that could be added
Christopher Manning’s 2015 Survey of NLP
Stanford NLP - Book, 102 Course Videos
Open NLP
https://opennlp.apache.org/
https://github.com/dakrone/clojure-opennlp
Word2Vector
http://papers.nips.cc/paper/
5021-distributed-representations-
Of-words-and-phrases
-and-their-compositionality.pdf
Instructions for example POS run, assuming lein repl available:
git clone https://github.com/dakrone/clojure-opennlpcd clojure-opennlp/lein repl=>(use 'opennlp.nlp)=>(use 'opennlp.tools.filters)=>(def get-sentences (make-sentence-detector "models/en-sent.bin"))=>(def tokenize (make-tokenizer "models/en-token.bin"))=>(def pos-tag (make-pos-tagger "models/en-pos-maxent.bin"))=>(tokenize "Mr. Adams seriously suggested that the answer was 42")=>(pos-tag *1)=>(nouns *1)=>(verbs (pos-tag (tokenize "There is a theory which states that if ever anyone discovers exactly what the Universe is for and why it is here, it will instantly disappear and be replaced by something even more bizarre and inexplicable. There is another theory mentioned, which states that this has already happened.")))
Instructions for example W2V run:
# Uses https://github.com/Bridgei2i/clojure-word2vec# Which uses https://github.com/medallia/Word2VecJava
# Edit:project.clj dependencies [org.bridgei2i/word2vec "0.2.1"]curl "http://www.gutenberg.org/files/98/98-0.txt" > dickens.txtlein repl
=>(ns clojure-word2vec.examples (:require [clojure-word2vec.core :refer :all] [clojure.java.io :as io]))=>(def data (create-input-format "dickens.txt"))=>(def model (word2vec data :window-size 15)=>(count (.getVocab model))
=>(get-matches model "wild"); grep "wild" dickens.txt=>(.getRawVector (.forSearch model) "wild")
“How could a little mathematics transmute
itself into linguistics?”
(Harris, Z. 1991)