text intensive diligence - clojure nlp (january 18, 2017)

20
Weighing our Words Clojure Chicago Meetup (January 18, 2017) Kripa Rajshekhar (Metonymy Labs) [email protected]

Upload: kripa-rajshekhar

Post on 12-Apr-2017

85 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Text Intensive Diligence - Clojure NLP (January 18, 2017)

Weighing our Words

Clojure Chicago Meetup (January 18, 2017)

Kripa Rajshekhar (Metonymy Labs)

[email protected]

Page 2: Text Intensive Diligence - Clojure NLP (January 18, 2017)
Page 3: Text Intensive Diligence - Clojure NLP (January 18, 2017)
Page 4: Text Intensive Diligence - Clojure NLP (January 18, 2017)
Page 5: Text Intensive Diligence - Clojure NLP (January 18, 2017)
Page 6: Text Intensive Diligence - Clojure NLP (January 18, 2017)
Page 7: Text Intensive Diligence - Clojure NLP (January 18, 2017)

Patent owner success

rates in litigation

Page 8: Text Intensive Diligence - Clojure NLP (January 18, 2017)

Patent text - “natural” language processing

TITLE: Process for manufacturing a DMOS transistor

ABSTRACT: In a new process of making a DMOS transistor, the doping of the

sloping side walls can be set independently from the doping of the floor region in a

trench structure. Furthermore, different dopings can be established among the side

walls. This is achieved especially by a sequence of implantation doping, etching to

form the trench, formation of a scattering oxide protective layer on the side walls,

and two-stage perpendicular and tilted final implantation doping. For DMOS

transistors, this achieves high breakthrough voltages even with low turn-on

resistances, and reduces the space requirement, in particular with regard to driver

structures.

Page 9: Text Intensive Diligence - Clojure NLP (January 18, 2017)

Extracting the core ideas …

{ ["in"], :tag "PP"}

{ ["a" "new" "process"], :tag "NP"} <<<<<<<<<<

{ ["of"], :tag "PP"}

{ ["making"], :tag "VP"}<<<<<<<<<<

{ ["a" "dmos" "transistor"], :tag "NP"}<<<<<<<<<<

{ ["the" "doping"], :tag "NP"}

{ ["of"], :tag "PP"}

{ ["the" "sloping" "side" "walls"], :tag "NP"}

{ ["can" "be" "set"], :tag "VP"}

{ ["independently"], :tag "ADVP"}

{ ["from"], :tag "PP"}

{ ["the" "doping"], :tag "NP"}

{ ["of"], :tag "PP"}

{ ["the" "floor" "region"], :tag "NP"}

{ ["in"], :tag "PP"}

{ ["a" "trench" "structure" "furthermore" "different" "dopings"], :tag "NP"} …

Page 10: Text Intensive Diligence - Clojure NLP (January 18, 2017)

35% chance of

owning prime

property

Page 11: Text Intensive Diligence - Clojure NLP (January 18, 2017)

www.linkedin.com/pulse/legal-information-retrieval-1962-kripa-rajshekhar

‘The cost of preparing data … between five and six cents per line …

31,113 statutory sections (documents) on 611,195 punched cards …

between 25 and 40 minutes [to search] on the IBM 7070.’

We have (1) Exceeded the author’s computer-hardware at a

science-fiction-scale, and (2) Missed the mark on almost all other

fronts (algorithms, depth of application, knowledge models, ...)

Page 12: Text Intensive Diligence - Clojure NLP (January 18, 2017)
Page 13: Text Intensive Diligence - Clojure NLP (January 18, 2017)

Diligence performance with current toolkit

Augmented with advanced NLP

diligence

Efficiency e.g. Deal screening, Financing timeline

Knowledge e.g. % of legally relevant patents

Page 14: Text Intensive Diligence - Clojure NLP (January 18, 2017)

NLP is an AI-complete problem

“You shall know a word by the company it

keeps”

(Firth, J. R. 1957)

Page 15: Text Intensive Diligence - Clojure NLP (January 18, 2017)

1. Automatic Summarization: Produce a readable summary of text (e.g. articles in the

financial section of newspaper)

2. Coreference resolution: Given a body of text determine which words (called "mentions")

refer to the same objects

3. Discourse Analysis: Discover the nature of relationships (e.g. elaboration, explanation,

contrast, yes-no question, content question, statement)

4. Machine Translation: Translate written text in one language into another.

5. Named Entity Recognition (NER): Given a stream of text, determine which items map to

proper names identify the type (e.g. place, name, organization)

6. Natural Language Generation: Convert data into readable language.

7. Part of Speech Tagging: identify noun, verb, conjunction, pronoun, etc.

8. Question answering: Given natural language question, generate NL answer.

9. Relationship Extraction: Relationship between named entities

10. Sentiment Analysis: Extract subjective information (e.g. response to

product release from social media)

…. Many more that could be added

Page 17: Text Intensive Diligence - Clojure NLP (January 18, 2017)

Instructions for example POS run, assuming lein repl available:

git clone https://github.com/dakrone/clojure-opennlpcd clojure-opennlp/lein repl=>(use 'opennlp.nlp)=>(use 'opennlp.tools.filters)=>(def get-sentences (make-sentence-detector "models/en-sent.bin"))=>(def tokenize (make-tokenizer "models/en-token.bin"))=>(def pos-tag (make-pos-tagger "models/en-pos-maxent.bin"))=>(tokenize "Mr. Adams seriously suggested that the answer was 42")=>(pos-tag *1)=>(nouns *1)=>(verbs (pos-tag (tokenize "There is a theory which states that if ever anyone discovers exactly what the Universe is for and why it is here, it will instantly disappear and be replaced by something even more bizarre and inexplicable. There is another theory mentioned, which states that this has already happened.")))

Page 18: Text Intensive Diligence - Clojure NLP (January 18, 2017)

Instructions for example W2V run:

# Uses https://github.com/Bridgei2i/clojure-word2vec# Which uses https://github.com/medallia/Word2VecJava

# Edit:project.clj dependencies [org.bridgei2i/word2vec "0.2.1"]curl "http://www.gutenberg.org/files/98/98-0.txt" > dickens.txtlein repl

=>(ns clojure-word2vec.examples (:require [clojure-word2vec.core :refer :all] [clojure.java.io :as io]))=>(def data (create-input-format "dickens.txt"))=>(def model (word2vec data :window-size 15)=>(count (.getVocab model))

=>(get-matches model "wild"); grep "wild" dickens.txt=>(.getRawVector (.forSearch model) "wild")

Page 19: Text Intensive Diligence - Clojure NLP (January 18, 2017)

“How could a little mathematics transmute

itself into linguistics?”

(Harris, Z. 1991)