11 ontology-guided search and text mining for intelligence gathering kurt godden, ph.d. msr lab,...

25
1 1 Ontology-Guided Search and Text Mining for Intelligence Gathering Kurt Godden, Ph.D. MSR Lab, R&D [email protected]

Upload: jeremy-savary

Post on 16-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

1 1

Ontology-Guided Search and Text Mining

for Intelligence Gathering

Kurt Godden, Ph.D.MSR Lab, R&D

[email protected]

2 2

Outline

• Definitions of terms• Customers (Who cares?)• Finding Text – ontology-guided search• Text Processing –

– Content extraction– Text Mining

• Temporal Data Mining at GM• Multi-Lingual Text Processing• Summary

3 3

What is Text Mining?• Data Mining:

– The process of analyzing data to discover new patterns or relationships– 1st International Conference was KDD-95– http://www-aig.jpl.nasa.gov/public/kdd95/

• Text Mining is Subfield of Data Mining– As such, ideally TM is the process of analyzing unstructured text to discover

new patterns or relationships– In practice, TM often refers simply to the Content Extraction (CE) of

structured data from unstructured text, usually from finite-state parsers.

4 4

Content Extraction:Structured Data from Unstructured Text

<XYZ-Corp,exports-through,Dubai>

“Company XYZ, is known to ship products through the port of Dubai.”

From Text to Actionable Knowledge:

Abbas

AdenYemen

AjmanUAE

Algiers

AmmanJordan

BenghaziLybia

Brazil1

Brazil2

BuenosAires

Cairo

Canada1

CixiChina

DammamSaudi

Dominican1

DubaiUAE

French1Gdansk

Guangzhou

Hamburg

Helsinki

Homs

HongKong

Istanbul

Jakarta

Jeddah

Kansas

Karachi

KhamisMushaytSaudi

LahorePakistan

Libya1

Lisboa

LosAngeles

Magadan Misratah

MisratahLibya

MississaugaCanada

NingboPort

PortAden

RioDeJaneiroRioHaina

Riyadh

RuianZhejiangProvince

SanaaYemen

SaoPaulo

Saudi1

Shanghai

ShanghaiPort

SharjahUAESomervilleUSA

StPetersburg

SunsetUSA

Taipei

Urumchi

VyborgRussia

WichitaUSAXinfengGuangdong

ZhaoqingGuangdongProvince

ZhongshanGuangdongProvince

Automatic multi- language scanning

Entity and Relation extraction/distillation

Filtering

5 5

Who Cares?

• Government– NSA, CIA, DIA, DHS, DARPA

• Industry– Automotive

– Chemical

– Pharmaceutical

– Legal

– Consumer goods

– Aerospace

6 6

Why do they care?• Intelligence and Security

– Valdis E. Krebs was able to manually map much of the 9/11 terrorist cell from public documents.

• http://vlado.fmf.uni-lj.si/pub/networks/doc/Seminar/Krebs.pdf

• Industrial– Urban Legend: (Is it true?)

“80% of all corporate knowledge is in text.”– Market research– Fraud detection– Root cause analysis– Document clustering and categorization– Competitive intelligence– Patent analysis– etc

7 7

Before Mining Must Come Text

• How to find it?

8 8

Ontology-Guided Search (OGS)

• Oft-cited definition of ontology by T.R. Gruber:– An ontology is a formal specification of a shared

conceptualization.

• www.vivisimo.com clusters search results according to semantic categories

• OGS: use an ontology to guide the search for documents to include not only keywords of interest, but also terms that are semantically related to those keywords

9 9

What ontology to use?

• Public– Wordnet: http://wordnet.princeton.edu/

• Organizes content words (N,V,Adj,Adv) into sets of semantically-related concepts connected by relations

• Currently 207k pairs of words-senses– <bank1, monetary institution>– <bank2, land adjacent to river>

• Custom– Parts– Products– Processes

• Tool: Protégé at http://protege.stanford.edu/

10 10

Ontology-Guided Search (OGS)

avoids neighborhood riot “driving through”

avoiding neighborhoods riots “drive through”

avoided suburb “civil unrest” “drove through”

suburbs

• Use ontology to search not only on keywords, but on semantically-related keywords

11 11

Pitfalls of OGS

• Beware of semantically related terms

• Simulation of OGS using Wordnet– Original query:

• Which neighborhoods of Paris are safe?

– One of several transformed queries was:• Which suburbs of Paris are condoms?

12 12

Content Extraction Technology• Regular Expressions Mapped to Semantic

Templates• Regular Expression for Passives:

NP1 BE TV [by NP2] “The lecture was presented by Kurt Godden”

• Mapping of Match Registers to Template< NP2:agent, TV:relation, NP1:object><kg, presented, lecture>Post-ProcessingRule:

if NP2 is empty string, then use ‘someone’:agent

13 13

Content Extraction Example“Some 40 vehicles were torched in the Val d'Oise area NW of Paris.”

http://www.breitbart.com/news/2005/11/04/D8DLFA780.html

For pattern: NP1 BE TV [by NP2]‘vehicles’ matches NP1

‘were’ matches BE‘torched’ matches TV

No match for NP2

• Canonicalize tokens via a domain ontology (e.g. vehicles→vehicle, torched→burn)<someone, burn, vehicle>

• Additional triples can be matched by other RegExp patterns, giving:<vehicle, count, 40><vehicle, located-in, val-d’oise><val-d’oise, near, paris>

14 14

Why Only Regular Expressions?

• Computational Efficiency• Practical Adequacy• Workaround for lack of recursion: Lots of RE’s !

NP → NP and NP becomes

NP → CN and CN

NP → CN and CN and CN

NP → NAME and NAME

NP → NAME and NAME and NAME

15 15

After Text Must Come Mining

• Temporal Data Mining research by K.P. Unnikrishnan (GM R&D) and P.S. Sastry (IISc, Bangalore)

• TDMiner – Proprietary tool– Discovers frequent sequences of events from

symbolic data

16 16

17 17

18 18

19 19

For More Info:

• 4th Workshop on Temporal Data Mining: Network Reconstruction from Dynamic Data– http://www.kdd2006.com/workshops.html

• Laxman, Sastry and Unnikrishnan. “Discovering Frequent Episodes and Learning Hidden Markov Models: a Formal Connection.” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 11, pp. 1505-1517. 2005

20 20

• How to determine directed, acyclic graphs from sequential event data

x

z a n p

g

Network Reconstruction

21 21

Multilingual Problem

• What if source text is not in English?

22 22

Machine Translation (MT)

• Free, web-based tools not state-of-the-arte.g. http://babelfish.altavista.com/

• LanguageWeaver uses Statistical-Based MTSpin-off of USC Information Sciences Institute

www.languageweaver.com

23 23

24 24

Hypothesis

• Effective Content Extraction rules can be custom-developed for raw machine-translated text.

25 25

Summary• Text Mining Can Offer Real Value

– Used Extensively by Gov’t Intel Agencies

– Several COTS tools available for Content Extraction:• SAS Text Miner

• AeroText (Lockheed Martin)

• ClearForest

• Attensity

• etc.…

– GATE – Univ. of Sheffield, open-source– http://gate.ac.uk/