mining_unstructured_data

8/7/2019 Mining_Unstructured_Data

1/77

Mining Unstructured Data

Ronen FeldmanComputer Science DepartmentBar-Ilan University, ISRAEL

ClearForest [email protected]

2

Outline

Introduction to TextMining

Information Extraction Entity Extraction

Relationship Extraction

KE Approach

IE Languages

ML Approaches to IE HMM

Anaphora resolution

Evaluation

Link Detection

Introduction to LD

Architecture of LDsystems

9/11

Link Analysis Demos


2/77

3

Background

Rapid proliferation ofinformation available in digital

format

People have less time to absorb

more information

4

The Information Landscape

8QVWUXFWXUHG

7H[WXDO

6WUXFWXUHG

'DWDEDVHV

/DFNRIWRROVWRKDQGOHXQVWUXFWXUHGGDWD

3UREOHP


3/77

5

)LQG'RFXPHQWV

PDWFKLQJWKH4XHU\

'LVSOD\,QIRUPDWLRQ

UHOHYDQWWRWKH4XHU\

Extract Information from

within the documents

Actual information buried

inside documents

Long lists of documents Aggregate over entirecollection

6

Text Mining

,QSXW,QSXW,QSXW,QSXW,QSXW,QSXW,QSXW,QSXW

'RFXPHQWV'RFXPHQWV

2XWSXW2XWSXW2XWSXW2XWSXW2XWSXW2XWSXW2XWSXW2XWSXW

3DWWHUQV3DWWHUQV

&RQQHFWLRQV&RQQHFWLRQV

3URILOHV3URILOHV

7UHQGV7UHQGV

6HHLQJWKH)RUHVWIRUWKH7UHHV6HHLQJWKH)RUHVWIRUWKH7UHHV6HHLQJWKH)RUHVWIRUWKH7UHHV6HHLQJWKH)RUHVWIRUWKH7UHHV


4/77

7

ReadRead

ConsolidateConsolidate

Absorb / ActAbsorb / Act

UnderstandUnderstand

Find MaterialFind Material

/HW7H[W0LQLQJ'RWKH/HJZRUNIRU


5/77

9

Document Types

Structured documents

Output from CGI

Semi-structured documents

Seminar announcements

Job listings

Ads

Free format documents

News Scientific papers

10

Text Representations

Character Trigrams

Words Linguistic Phrases

Non-consecutive phrases

Frames

Scripts

Role annotation

Parse trees


6/77

11

The 100,000 foot Picture

Development

Evironment

12

Intelligent Auto-Tagging(c) 2001, Chicago Tribune.

Visit the Chicago Tribune on the Internet at

http://www.chicago.tribune.com/

Distributed by Knight Ridder/Tribune

Information Services.

By Stephen J. Hedges and Cam Simpson

Finsbury Park Mosque


7/77

13

Tagging a Document

BC-dynegy-enron-offer-update5Dynegy May Offer at Least $8 Bln to Acquire

Enron (Update5)By George Stein

SOURCEc.2001 Bloomberg News

BODY

Dynegy Inc

Roger HamiltonJohn Hancock Advisers Inc.

Roger Hamilton

money manager

John Hancock Advisers Inc.

Enron Corp

Moody's Investors ServiceEnron Corp

downgraded Baa2

bonds

Moody's Investors Service

.

``Dynegy has to act fast,'' said RogerHamilton, a money manager with John

Hancock Advisers Inc., which sold its Enron

shares in recent weeks. ``If Enron can't getfinancing and its bonds go to junk, they losecounterparties and their marvelous business

vanishes.''

Moody's Investors Service lowered itsrating on Enron's bonds to ``Baa2'' andStandard & Poor's cut the debt to ``BBB.'' in

the past two weeks.

14

kidney enlargement

VEGF

Vascular endothelial growth factor,VEGFdiabetic glomerularenlargement

Vascular endothelial growth factor,VEGF

diabetic glomerularenlargement

VEGF

insulin-like growth factor I, IGF-I

IGF-I

IGF-Iincrease in glomerular volume

and kidney weight

VEGF

IGF-I

IGF-I

increase in glomerular volume andkidney weight

IGF-I

VEGF

Various growth factors and cytokines have been implicated in differentforms of kidney enlargement. Vascular endothelial growth factor (VEGF)is essential for normal renal development and plays a role in diabeticglomerular enlargement. To explore a possible role for VEGF incompensatory renal changes after uninephrectomy, we examined the effectof a neutralizing VEGF-antibody (VEGF-Ab) on glomerular volume andkidney weight in mice treated for 7 days. Serum and kidney insulin-likegrowth factor I (IGF-I) levels were measured, since IGF-I has beenimplicated in the pathogenesis of compensatory renal growth, and VEGFhas been suggested to be a downstream mediator of IGF-I. Placebo-treateduninephrectomized mice displayed an early transient increase in kidneyIGF-I concentration and an increase in glomerular volume and kidneyweight. In VEGF-Ab-treated uninephrectomized animals, increasedglomerular volume was abolished, whereas renal hypertrophy was

partially blocked. Furthermore, the renal effects of VEGF-Abadministration were seen without affecting the renal IGF-I levels. Inconclusion, these results demonstrate that compensatory glomerulargrowth after uninephrectomy is VEGF dependent.

Intelligent Auto-TaggingCompensatory glomerular growth after unilateral nephrectomy is VEGFdependent.

Flyvbjerg A, Schrijvers BF, De Vriese AS, Tilton RG, Rasch R.

Medical Department M, Medical Research Laboratories, Institute ofExperimental Clinical Research, Aarhus University, DK-8000 Aarhus,Denmark.

BIOMEDICAL


8/77

15

A Tagged News Document

16

Original Document

000019641

3560177

November 25, 1991

News Release

Applied Materials, Inc. today announced its newest source technology, called

the Durasource, for the Endura 5500 PVD s ystem. This enhanced source

includes new magnet configurations, giving the industry's most advanced

sputtered aluminum step coverage in sub-micron contacts, and a new onepiece

target that more than doubles target life to approximately 8000 microns

of deposition compared to conventional two-piece bonded targets. The

Durasource enhancement is fully retrofittable to installed Endura 5500 PVD

systems. The Durasource technology has been specially designed for 200mm

wafer applications, although it is also available for 125mm and 1s0mm wafer

sizes. For example, step coverage symmetry is maintained within 3% between

the inner and outer walls of contacts across a 200mm wafer. Film thickness

uniformity averages 3% (3 sigma) over the life of the target.


9/77

17

Extraction Results I

:=DOC NR: 3560177DOC DATE: 251192DOCUMENT SOURCE: News ReleaseCONTENT: DATE TEMPLATE COMPLETED: 021292EXTRACTION TIME: 5COMMENT: article focuses on nonreportable target source but

reportable info available

/TOOL_VERSION: LOCKE.3.4/FILLRULES_VERSION: EME.4.0 :=PROCESS: MANUFACTURER:

18

Extraction Results II

:=

NAME: Applied Materials, INC

TYPE: COMPANY :=

TYPE: SPUTTERING

FILM: ALUMINUM

EQUIPMENT:

:=

NAME_OR_MODEL: Endura(TM) 5500

MANUFACTURER:

EQUIPMENT_TYPE: PVD_SYSTEM

STATUS: IN_USE

WAFER_SIZE: (200 MM)(121 MM)

COMMENT: actually three wafer sizes, third is error 1s0mm


10/77

19

BioMedical Example

XSome kinase

ProteinKinasePhosphorylation

Template

BES1 is phosphorylated and

appears to be destabilized by the

glycogen synthase kinase-3

(GSK-3) BIN2

{

Phosphorylation

BES1glycogen synthase

kinase-3 (GSK-3) BIN2

ProteinKinase

20

Leveraging Content Investment

$Q\W\SHRIFRQWHQW

8QVWUXFWXUHGWH[WXDOFRQWHQWFXUUHQWIRFXV

6WUXFWXUHGGDWDDXGLRYLGHRIXWXUH

)URPDQ\VRXUFH

:::ILOHV\VWHPVQHZVIHHGVHWF

6LQJOHVRXUFHRUFRPELQHGVRXUFHV

,QDQ\IRUPDW

'RFXPHQWV3')V(PDLOVDUWLFOHVHWF

5DZ RUFDWHJRUL]HG

)RUPDOLQIRUPDOFRPELQDWLRQ


11/77

21

The Tagging Process (a Unified View)

&OHDU7DJV6XLWH,QWHOOLJHQW$XWR,QWHOOLJHQW$XWR7DJJLQJ7DJJLQJ

7DJJLQJ&RQWUROOHU

7DJJLQJ&RQWUROOHU

6HPDQWLFWDJJHU

&ODVVLILFDWLRQWDJJHU

6WUXFWXUDOWDJJHU

5XOHERRNV5XOHERRNV5XOHERRNV

)HWFKHU

&OHDU/DE

6HPDQWLF7UDLQHU

5LFK

;0/$3,

5LFK

;0/$3,

8QVWUXFWXUHG&RQWHQW

8QVWUXFWXUHG&RQWHQW

&ODVVLILHUV&ODVVLILHUV&ODVVLILHUV

6WUXFWXUDO7HPSODWHV6WUXFWXUDO7HPSODWHV6WUXFWXUDO7HPSODWHV

6WDWLVWLFDO7UDLQHU6WDWLVWLFDO7UDLQHU

6WUXFWXUDO7UDLQHU6WUXFWXUDO7UDLQHU

,QWHOOLJHQW7DJJLQJ

22

Text Mining Horizontal Applications

7H[W0LQLQJ7H[W0LQLQJ/HDG*HQHUDWLRQ

(QWHUSULVH3RUWDOV

(,3

&RPSHWLWLYH,QWHOOLJHQFH

&RQWHQW

8QLIRUPLW\

&50$QDO\VLV

0DUNHW5HVHDUFK

1HZ3URGXFWV

2IIHULQJV

(YHQWV'HWHFWLRQV1RWLILFDWLRQV

,30DQDJHPHQW

&RQWHQW

0DQDJHPHQW


12/77

23

Text Mining: Vertical Applications

,QVXUDQFH

3KDUPD

3XEOLVKLQJ

)LQDQFLDO6HUYLFHV

*OREDO&KHPLFDO

/HJDO

%LRWHFK

'HIHQVH

Information Extraction


13/77

25

What is Information Extraction?

IE does not indicate which documents need to be read by a

user, it rather extracts pieces of information that are salient tothe user's needs.

Links between the extracted information and the originaldocuments are maintained to allow the user to referencecontext.

The kinds of information that systems extract vary in detailand reliability.

Named entities such as persons and organizations can beextracted with reliability in the 90th percentile range, but donot provide attributes, facts, or events that those entities haveor participate in.

26

Relevant IE Definitions

Entity: an object of interest such as a person ororganization.

Attribute: a property of an entity such as its name,alias, descriptor, or type.

Fact: a relationship held between two or moreentities such as Position of a Person in a Company.

Event: an activity involving several entities such as aterrorist act, airline crash, management change, newproduct introduction.


14/77

27

Example (Entities and Facts) Fletcher Maddox, former Dean of the UCSD

Business School, announced the formation of

La Jolla Genomatics together with his twosons. La Jolla Genomatics will release itsproduct Geninfo in June 1999. Geninfo is aturnkey system to assist biotechnologyresearchers in keeping up with the voluminousliterature in all aspects of their field.

Dr. Maddox will be the firm's CEO. His son,Oliver, is the Chief Scientist and holds patentson many of the algorithms used in Geninfo.Oliver's brother, Ambrose, follows more in hisfather's footsteps and will be the CFO of L.J.G.headquartered in the Maddox family'shometown of La Jolla, CA.

Pers ons: Or ganiz at ions : Locat ions: Art ifact s: Dates:Fletcher Maddox UCSD Business School La Jolla Geninfo June 1999

Dr. Maddox La Jolla Genomatics CA Geninfo

Oliver La J oll a Gen omati cs

Oliver L.J.G.

Ambrose

Maddox

PERSON Employee_of ORGANIZATION

Fletcher

Maddox

Fletcher

Maddox

Oliver

Ambrose

Employee_of

Employee_of

Employee_of

Employee_of

UCSD Business

School

La Jolla Genomatics

La Jolla Genomatics

La Jolla Genomatics

ARTIFACT Product_of ORGANIZATION

Geninfo Product_of La Jolla Genomatics

LOCATION Location_of ORGANIZATION

La Jolla Location_of La Jolla Genomatics

CA Location_of La Jolla Genomatics

28

Events and Attributes

COMPANY-FORMATION_EVENT:

COMPANY: La Jolla Genomatics

PRINCIPALS: Fletcher Maddox

Oliver

Ambrose

DATE:

CAPITAL:

RELEASE-EVENT:

COMPANY: La Jolla Genomatics

PRODUCT: Geninfo

DATE: June 1999

COST:

NA ME: Fle tche r Ma dd ox

Maddox

DESCRIPTOR: former Dean of the UCSD

Business School

his father

the firm's CEO

CATEGORY: PERSONNAME: Oliver

DESCRIPTOR: His son

Chief Scientist

CATEGORY: PERSON

NAME: Ambrose

DESCRIPTOR: Oliver's brother

the CFO of L.J.G.

CATEGORY: PERSON

NAME: UCSD Business Schoo l

DESCRIPTOR:

CATEGORY: ORGANIZATION

NA ME: La J oll a Gen oma ti cs

L.J.G.

DESCRIPTOR:

CATEGORY: ORGANIZATION

NAME: Geninfo

DESCRIPTOR: its product

CATEGORY: ARTIFACT

NAME: La Jolla

DESCRIPTOR: the Maddox family's hometown

CATEGORY: LOCATION

NAME: CA

DESCRIPTOR:

CATEGORY: LOCATION


15/77

29

IE Accuracy by Information Type

50-60%Events

60-70%Facts

80%Attributes

90-98%Entities

AccuracyInformation

Type

30

Unstructured TextPOLICE ARE INVESTIGATING A ROBBERY THAT OCCURRED AT THE 7-

ELEVEN STORE LOCATED AT 2545 LITTLE RIVER TURNPIKE IN THE

LINCOLNIA AREA ABOUT 12:30 AM FRIDAY. A 24 YEAR OLD

ALEXANDRIA AREA EMPLOYEE WAS APPROACHED BY TWO MEN WHO

DEMANDED MONEY. SHE RELINQUISHED AN UNDISCLOSED AMOUNT

OF CASH AND THE MEN LEFT. NO ONE WAS INJURED. THEY WERE

DESCRIBED AS BLACK, IN THEIR MID TWENTIES, BOTH WERE FIVE

FEET NINE INCHES TALL, WITH MEDIUM BUILDS, BLACK HAIR AND

CLEAN SHAVEN. THEY WERE BOTH WEARING BLACK PANTS AND

BLACK COATS. ANYONE WITH INFORMATION ABOUT THE INCIDENT

OR THE SUSPECTS INVOLVED IS ASKED TO CALL POLICE AT (703) 555-

5555.


16/77

31

Structured (Desired) Information

DayTimeTownAddressCrime

FRIDAY3:00 AMFAIRFAX7- ELEVEN

STORE LOCATED

AT 5624 OX ROAD,

ROBBERY

FRIDAY12:45 AMLINCOLNIA7- ELEVEN

STORE LOCATED

AT 2545 LITTLE

RIVER TURNPIKE,

ROBBERY

SUNDAY11:30 PMANNANDALE

8700 BLOCK OFLITTLE RIVER

TURNPIKE,

ABDUCTION

32

MUC Conferences

MUC 7

MUC 6

MUC 5

MUC 4

MUC 3

MUC 2

MUC 1

Conference

Spaces Vehicles and Missile Launches1997

Management Changes1995

Joint Venture and Micro

Electronics1993

Terrorist Activity1992

Terrorist Activity1991

Naval Operations1989

Naval Operations1987

TopicYear


17/77

33

The ACE Evaluation The ACE program is dedicated to the challenge of extracting content from

human language. This is a fundamental capability that the ACE programaddresses with a basic research effort that is directed to master first theextraction of entities, then the extraction of relations among theseentities, and finally the extraction of events that are causally related setsof relations.

After two years of research on the entity detection and tracking task, topsystems have achieved a capability that is useful by itself and that, in thecontext of the ACE EDT task, successfully captures and outputs well over50 percent of the value at the entity level.

Here value is defined to be the benefit derived by successfully extractingthe entities, where each individual entity provides a value that is a

function of the entity type (i.e., person, organization, etc.) and level(i.e., named, unnamed). Thus each entity contributes to the overallvalue through the incremental value that it provides.

34

40.6

63.0

11.3

24.3

9.1

11.8

0.1

0.6

0.0

0.3

0

10

20

30

40

50

60

70

80

90

100

Act value Max value

NormalizedValue/Cost

(in%

relativetoMaximumV

alue

FAC

LOC

GPE

ORG

PER

Miss 36%, False Alarm 22%, Type Error 6%

Average

of Top 4Systems

ACE Entity Detection & Tracking

Evaluation -- 2/2002Goal: Extract entities. Each

entity is assigned a value.This value is a function of itsTypeand Level. This value

is gained when the entity issuccessfully detected. Thisvalue is lost when an entityis missed, spuriouslydetected, ormischaracterized.

Table of Entity Values

0.0020.0040.010.020.04PRO

0.010.020.050.10.2NOM

0.050.10.250.51NAM

FACLOCGPEORGPER

Human performance ~80


18/77

35

Applications of Information

Extraction

Routing of Information Infrastructure for IR and for Categorization

(higher level features)

Event Based Summarization.

Automatic Creation of Databases and

Knowledge Bases.

36

Where would IE be useful?

Semi-Structured Text

Generic documents like News articles. Most of the information in the document is

centered around a set of easily identifiable

entities.


19/77

37

Approaches for Building IE

Systems

Knowledge Engineering Approach Rules are crafted by linguists in cooperation with domain

experts.

Most of the work is done by inspecting a set of relevant

documents.

Can take a lot of time to fine tune the rule set.

Best results were achieved with KB based IE systems.

Skilled/gifted developers are needed.

A strong development environment is a MUST!

38

Approaches for Building IE

Systems

Automatically Trainable Systems The techniques are based on pure statistics and almost no

linguistic knowledge They are language independent

The main input is an annotated corpus

Need a relatively small effort when building the rules,however creating the annotated corpus is extremelylaborious.

Huge number of training examples is needed in order toachieve reasonable accuracy.

Hybrid approaches can utilize the user input in thedevelopment loop.


20/77

39

Components of IE System

Tokenization

Morphological andLexical Analysis

Synatctic Analysis

Domain Analysis

Zoning

Part of Speech Tagging

Sense Disambiguiation

Deep Parsing

Shallow Parsing

Anaphora Resolution

Integration

Must

Advisable

Nice to have

Can pass

40

The Extraction Engine

ExtractionEngine

API

Generic

ExtractionRules

Generic

Lexicons and

Thesaruses

Customized

ExtractionRules

Domain

SpecificLexicons and

Thesaruses

Document

CoillectionEntities, Facts

and Events


21/77

41

Why is IE Difficult? Different Languages

Morphology is very easy in English, much harder in German andHebrew.

Identifying word and sentence boundaries is fairly easy in Europeanlanguage, much harder in Chinese and Japanese.

Some languages use orthography (like english) while others (likehebrew, arabic etc) do no have it.

Different types of style Scientific papers

Newspapers

memos

Emails

Speech transcripts

Type of Document Tables

Graphics

Small messages vs. Books

42

Morphological Analysis

Easy

English, Japanese

Listing all inflections of a word is a real possibility

Medium

French Spanish

A simple morphological component adds value.

Difficult

German, Hebrew, Arabic A sophisticated morphological component is a must!


22/77

43

Using Vocabularies

Size doesnt matter Large lists tend to cause more mistakes

Examples:

Said as a person name (male)

Alberta as a name of a person (female)

It might be better to have small domain

specific dictionaries

44

Part of Speech Tagging

POS can help to reduce ambiguity, and to deal

with ALL CAPS text. However

It usually fails exactly when you need it

It is domain dependent, so to get the best resultsyou need to retrain it on a relevant corpus.

It takes a lot of time to prepare a training corpus.


23/77

45

A simple POS Strategy

Use a tag frequency table to determine theright POS.

This will lead to elimination of rare senses.

The overhead is very small

It improve accuracy by a small percentage.

Compared to full POS it provide similar boost

to accuracy.

46

Dealing with Proper Names

The problem Impossible to enumerate

New candidates are generated all the time Hard to provide syntactic rules

Types of proper names People

Companies

Organizations

Products

Technologies Locations (cities, states, countries, rivers, mountains)


24/77

47

Comparing RB Systems with ML

Based Systems

90.690.3HUB4

TranscribedSpeech

90.493.7MUC7

9396.4MUC6

Wall Street

Journal

HMMRule Based

48

Building a RB Proper Name

Extractor A Lexicon is always a good start

The rules can be based on the lexicon and on: The context (preceding/following verbs or nouns)

Regular expressions Companies: Capital* [,] inc, Capital* corporation..

Locations: Capital* Lake, Capital* River

Capitalization

List structure

After the creation of an initial set of rules Run on the corpus

Analyze the results

Fix the rules and repeat

This Process can take around 2-3 weeks and result in performance of between 85-90% break even.

Better performance can be achieved with more effort (2-3 months) and thenperformance can get to 95-98%


25/77

Introduction to HMMs forIE

50

Motivation

We can view the named entity extraction as a

classification problem, where we classify each word

as belonging to one of the named entity classes or to

the no-name class.

One of the most popular techniques for dealing with

classifying sequences is HMM.

Example of using HMM for another NLP

classification task is that of part of speech tagging

(Church, 1988 ; Weischedel et. al., 1993).


26/77

51

What is HMM?

HMM (Hidden Markov Model) is a finite stateautomaton with stochastic state transitions andsymbol emissions (Rabiner 1989).

The automaton models a probabilistic generativeprocess.

In this process a sequence of symbols is produced bystarting in an initial state, transitioning to a newstate, emitting a symbol selected by the state and

repeating this transition/emission cycle until adesignated final state is reached.

52

Disadvantage of HMM

The main disadvantage of using an HMM for

Information extraction is the need for a largeamount of training data. i.e., a carefully

tagged corpus.

The corpus needs to be tagged with all the

concepts whose definitions we want to learn.


27/77

53

Notational Conventions

T = length of the sequence of observations (training set)

N = number of states in the model

qt = the actual state at time t

S = {S1,...SN} (finite set of possible states)

V = {O1,...OM} (finite set of observation symbols)

= {i} = {P(q1 = Si)} starting probabilities

A = {aij}=P(qt+1= Si | qt = Sj) transition probabilities

B = {bi(O

t)} = {P(O

t| q

t= S

i)} emission probabilities

54

The Classic Problems Related to

HMMs

Find P( O | ): the probability of an

observation sequence given the HMM model. Find the most likely state trajectory given

and O.

Adjust = (, A, B) to maximize P( O | ).


28/77

55

Calculating P( O | ) The most obvious way to do that would be to enumerate

every possible state sequence of length T (the length of theobservation sequence). Let Q = Q1,...QT , then by assumingindependence between the states we have

P(O|Q, ) =

P(Q|) =

By using Bayes theorem we have

P(O,Q|) = P(O|Q, ) P(Q|)

Finally The main problem with this is that we need to do2TNT multiplications, which is certainly not feasible even fora modest T like 10.

==

=T

i

iq

T

i

ii ObqOP i11

)(),|(

=+

1

111

T

i

qqq iia

56

The forward-backward algorithm

In order to solve that we use the forward-backward algorithmthis is far more efficient. The forward part is based on the

computation of terms called the alpha terms. We define thealpha values as follows,

We can compute the alpha values inductively in a very

efficient way. This calculation requires just N2T multiplications.

=

+

=

+

=

=

=

N

i

T

tj

N

i

ijtt

ii

iaOP

Obaij

Obi

1

1

1

1

11

)()|(

)()()(

)()(


29/77

57

The backward phase

In a similar manner we can define a backwardvariable called beta that computes the

probability of a partial observation sequence

from t+1 to T. The beta variable will also be

computed inductively but in a backward

fashion.

)()()(

1)(

11

1

jObai

i

tti

N

j

ijt

T

++

=

=

=

58

Solution for the second problem Our main goal is to find the optimal state sequence. We will do that by

maximizing the probabilities of each state individually.

We will start by defining a set of gamma variables the measure the probabilities

that at time t we are at state S i.

The denominator is used just to make gamma a true probability measure.

Now we can find the best state at each time slot in a local fashion.

The main problem with this approach is that the optimization is done locally, andnot on the whole sequence of states. This can lead either to a local maximum oreven to an invalid sequence. In order to solve that problem we use a well knowndynamic programming algorithm called the Viterbi algorithm.

=

==N

i

tt

tttt

t

ii

iiOP

iii

1

)()(

)()()|()()()(

Ni

iq

t

t

=1

)](max[arg


30/77

59

The Viterbi Algorithm

Intuition

Compute the most likely sequence starting with the emptyobservation sequence; use this result to compute the most likelysequence with an output sequence of length one; recurse untilyou have the most likely sequence for the entire sequence ofobservations.

Algorithmic Details

The delta variables compute the highest probability of a partialsequence up to time tthat ends in state Si. The psi variables

enables us to accumulate the best sequence as we move alongthe time slices.

1. Initialization:

0)(

)()(

1

11

=

=

i

Obi ii

60

Viterbi (Cont).

Recursion:

Termination:

Reconstruction:

For t = T-1,T-2,...,1. The resulting sequence,, solves Problem 2.

[ ] )()(1max

)( 1 tiijtt ObaiNij

=

[ ]ijtt ai

Nij )(

1

maxarg)( 1

=

[ ])(1

max* i

NiP

T

= [ ])(

1

maxarg*

iNi

q TT

=

)(*

11

*

++= ttt qq

**

2

*

1 , Tqqq


31/77

61

Viterbi (Example)

S1 S2

S3

0.1

0.10.1

0.5

0.4

0.4

0.50

.5

0.20

.8

0.80

.2

a

bab

ab

0.5

0.5

state

Time slice t ->

S1

S2

S3

0.25 0.02 0.016

0 0.04 0.008

0.1 0.08 0.016

t1 t2 t3

a b a

62

The Just Research HMM

Each HMM extracts just one field of a given document. Ifmore fields are needed, several HMMs need to be

constructed. The HMM takes the entire document as one observation

sequence.

The HMM contains two classes of states, background statesand target states. The background states emit words in whichare not interested, while the target states emit words thatconstitute the information to be extracted.

The state topology is designed by hand and only a few

transitions are allowed between the states.


32/77

63

Possible HMM Topologies

prefix states suffix states

initial state final state

background

state

target

state

background

state

target

state

final state

initial state

64

A more General HMM

Architecture

prefix statessuffix states

background

state

target

states

final state

initial state


33/77

65

Experimental Evaluation

46.7%55.3%40.1%48.1%30.9%

Status of

Acquisition

Price of

Acquisition

Abbreviation

of AcquiredCompany

Acquired

Company

Acquiring

Company

59.5%99.1%83.9%71.1%

End TimeStart TimeLocationSpeaker

66

BBNs Identifinder

An ergodic bigram model.

Each Named Class has a separate region in theHMM.

The number of states in each NC region is equal to

|V|. Each word has its own state.

Rather then using plain words, extended words are

used. An extended word is a pair , where f is a

feature of the word w.


34/77

67

BBNs HMM Architecture

EndEndStart

Company

Name

Person

Name

No Name

68

Possible word Features1. 2 digit number (01)

2. 4 digit number (1996)

3. alphanumeric string (A34-24)

4. digits and dashes (12-16-02)5. digits and slashes (12/16/02)

6. digits and comma (1,000)

7. digits and period (2.34)

8. any other number (100)

9. All capital letters (CLF)

10. Capital letter and a period (M.)

11. First word of a sentence (The)

12. Initial letter of the word is capitalized (Albert)

13. word in lower case (country)

14. all other words and tokens (;)


35/77

69

Statistical Model

The design of the formal model is done in levels.

At the first level we have the most accurate model,which require the largest amount of training data.

At the lower levels we have back-off models that areless accurate but also require much smaller amountsof training data.

We always try to use the most accurate modelpossible given the amount of available training data.

70

Computing State Transition

Probabilities When we want to analyze formally the probability of

annotating a given word sequence with a set of name

classes, we need to consider three different statistical

models:

A model for generating a name class

A model to generate the first word in a name class

A model to generate all other words (but the first word) in

a name class


36/77

71

Computing the Probabilities :

Details The model to generate a name class depends on the previous

name class and on the word that precedes the name class; thisis the last word in the previous name class and we annotate itby w-1. So formally this amounts to P(NC | NC-1,w-1).

The model to generate the first word in a name class dependson the current name class and the previous name class andhence is P(first| NC, NC-1).

The model to generate all other words within the same nameclass depends on the previoues word (within the same nameclass) and the current name class, so formally it is P(|-1, NC).

72

The Actual Computation

),(

),,(),|(

11

11

11

=

wNCc

wNCNCcwNCNCP

),(

),,,(),|,(

1

1

1

> MUC

50/29.693.7/40.54percent

86.6/82.197.6/82.1money

77.7/91.790.7/91.3location

68.6/92.576.4/77.6time

59/89.590.9/76.6date

83.1/95.991.1/93.7organization

84.9/88.691.9/85.5person

ace+muc7muc7r/p


44/77

87

Results with our new algorithm

60

70

80

90

100

10.7 14.6 19.8 27 36.7 49.9 68 92.5 126 171 233 317 432 588 801 1090 1484 2019 2749 3741

Amount of training (KWords)

Logarithmic scale - each tick

multiplies the amount by 7/6

F-measure(%)

DATE

LOCATION

ORGANIZATION

PERSON

DATE-Nymble

LOC-Nymble

ORG-Nymble

Currently availableAmount of training

88

Why is HMM not enough? The HMM model is flat, so the most it can do is assign a

tag to each token in a sentence.

This is suitable for the tasks where the tagged sequences donot nest and where there are no explicit relations betweenthe sequences.

Part-of-speech tagging and entity extraction belong to thiscategory.

Extracting relationships is different, because the taggedsequences can (and must) nest, and there are relationsbetween them which must be explicitly recognized.

< ACQUISITION> Ethicon Endo-Surgery,

Inc. , a Johnson & Johnson company, hasacquired < ACQUIRED> Obtech Medical AG


45/77

89

A Hybrid Approach The hybrid strategy, attempts to strike a balance between

the two knowledge engineer chores writing the extraction rules

manually tagging the documents.

In this strategy, the knowledge engineer writes SCFGrules, which are then trained on the data which isavailable.

The powerful disambiguating ability of the SCFG makeswriting rules much simpler and cleaner task.

The knowledge engineer has the control of the generalityof the rules (s)he writes, and consequently on the amountand the quality of the manually tagged training thesystem would require.

90

Defining SCFG

Classical definition: A stochastic context-freegrammar (SCFG) is a quintuple

G = (T, N, S, R, P) T is the alphabet of terminal symbols (tokens)

N is the set of nonterminals

S is the starting nonterminal

R is the set of rules

P : R [0..1] defines their probabilities.

The rules have the form n s1s2sk, where n

is a nonterminal and eachsi either token oranother nonterminal.

SCFG is a usual context-free grammar with theaddition of the P function.


46/77

91

The Probability Function

Ifris the rule n s1s

2s

k, then P(r) is the

frequency of expanding n using this rule.

In Bayesian terms, if it is known that a givensequence of tokens was generated by expandingn, then P(r) is the apriory likelihood that n wasexpanded using the rule r.

Thus, it follows that for every nonterminal n the

sum P(r) over all rules rheaded by n mustequal to one.

92

IE with SCFG

A very basic parsing is employed for the

bulk of a text, but within the relevant parts,the grammar is much more detailed.

The IE grammars can be said to define

sublanguages for very specific domains.


47/77

93

IE with SCFG In the classical definition of SCFG it is assumed that the

rules are all independent. In this case it is possible tofind the (unconditional) probability of a given parse treeby simply multiplying the probabilities of all rulesparticipating in it.

The usual parsing problem is given a sequence of tokens(a string) S, to find the most probable parse tree Twhichcould generate S. A simple generalization of the Viterbialgorithm is able to efficiently solve this problem.

In practical applications of SCFGs, it is rarely the casethat the rules are truly independent. Then, the easiestway to cope with this problem while leaving most of theformalism intact is to let the probabilities P(r) beconditioned upon the context where the rule is applied.

94

Markovian CSFG

HMM entity extractors, are a simple case ofmarkovian SCFGs.

Every possible rule which can be formed from theavailable symbols has nonzero probability.

Usually, all probabilities are initially set to be equal,and then adjusted according to the distributionsfound in the training data.


48/77

95

Training Issues

For some problems the available training corpora appear

to be adequate.

In particular, markovian SCFG parsers trained on thePenn Treebank perform quite well (Collins 1997,Charniak 2000, Roark 2001, etc).

But for the task of relationship extraction it turns out tobe impractical to manually tag the amount of documentsthat would be sufficient to adequately train an markovianSCFG.

At a certain point it becomes more productive to go backto the original hand-crafted system and write rules for it,even though it is a much more skilled labor!

96

SCFG Syntax A rulebook consists of declarations and rules. All

nonterminals must be declared before usage.

Some of them can be declared as output concepts, whichare the entities, events, and facts that the system isdesigned to extract. Additionally, two classes of terminalsymbols also require declaration: termlists, and ngrams. A termlist is a collection of terms from a single semantic

category, either written explicitly or loaded from external source.

An ngram can expand to any single token. But the probability ofgenerating a given token is not fixed in the rules, but learnedfrom the training dataset, and may be conditioned upon one ormore previous tokens. Thus, ngrams is one of the ways the

probabilities of the SCFG rules can be context-dependent.


49/77

97

Exampleoutput concept Acquisition(Acquirer, Acquired);

ngram AdjunctWord;

nonterminal Adjunct;

Adjunct :- AdjunctWord Adjunct | AdjunctWord;

termlist AcquireTerm = acquired bought (has acquired) (has bought);

Acquisition :- CompanyAcquirer [ ,Adjunct , ]

AcquireTerm CompanyAcquired;

98

Emulation of HMM Entity

Extractor in CSFG

output concept Company();

ngram CompanyFirstWord;

ngram CompanyWord;

ngram CompanyLastWord;

nonterminal CompanyNext;

Company :- CompanyFirstWord CompanyNext |CompanyFirstWord;

CompanyNext :- CompanyWord CompanyNext |CompanyLastWord;


50/77

99

Putting is all together

start Text;nonterminal None;

ngram NoneWord;

None :- NoneWord None | ;

Text :- None Text | Company Text |

Acquisition Text | ;

100

SCFG Training Currently there are three different classes of

trainable parameters in a TEG rulebook:

the probabilities of rules of nonterminals

the probabilities of different expansions of ngrams

the probabilities of terms in a wordclass.

All those probabilities are smoothed maximum

likelihood estimates, calculated directly from the

frequencies of the corresponding elements in the

training dataset.


51/77

101

Sample Rules

concept Text;

concept output Person;

ngram NGFirstName;

ngram NGLastName;

ngram NGNone;

wordclass WCHonorific = Mr Mrs Miss Ms Dr;

Person :- WCHonorific NGLastName;

Person :- NGFirstName NGLastName;

Text :- NGNone Text;

Text :- Person Text;

Text :- ;

102

Training the rule book

By default, the initial untrained frequencies of allelements are assumed to be 1. They can be changed

using syntax. Let us train this rulebook on the training set

containing one sentence: Yesterday, Dr Simmons, the

distinguished scientist, presented the discovery.

This is done in two steps. First, the sentence isparsed using the untrained rulebook, but with theconstraints specified by the annotations. In our casethe constraints are satisfied by two different parses:


52/77

103

2 Possible Parse TreesText Text

NGNone NGNone

Yesterday Yesterday

Text Text

NGNone NGNone

, ,

Text Text

Person Person

WCHonorific NGFirstName

Dr Dr

NGLastName NGLastName

Simmons Simmons

Text Text

NGNone NGNone

, ,

Text Text

104

How to pick the right Parse?

The difference is in the expansion of the Personnonterminal. Both Person rules can produce the

output instance, therefore there is an ambiguity.

In this case it is resolved in favor of theWCHonorific interpretation, because in theuntrained rulebook we have

P(Dr | WCHonorific) = 1/5 (choice of one termamong five equiprobable ones),

P(Dr | NGFirstName) 1/N, where N is the number

of all known words.


53/77

105

The Trained Rule Book

concept Text;

concept output Person;

ngram NGFirstName;

ngram NGLastName;

ngram NGNone;

wordclass WCHonorific = Mr Mrs Miss Ms Dr;

Person :- WCHonorific NGLastName;

Person :- NGFirstName NGLastName;

Text :- NGNone Text;

Text :- Person Text;

Text :- ;

106

Ngram Training

Additionally, the training program will generate a separatefile containing the statistics for the ngrams. It is similar in

nature, but a bit more complex, because the bigramsfrequencies, token features frequencies and unknown wordsfrequencies are also saved.

In order to understand the details of ngrams training it isnecessary to go over the details of their internal working.

An ngram always generates a single token. Any ngram cangenerate any token, but naturally the probability of generatingdepends on the ngram, on the token, and on the immediate

preceding context of the token.


54/77

107

MUC 7

90.5894.4287.0586.9190.1283.9386.6687.286.12Location

90.1990.989.4987.789.5385.9488.8489.7587.94Organization

92.2490.7893.7586.5786.8386.3186.0185.1386.91Person

FPrecRecallFPrecRecallFPrecRecall

Full SCFG systemEmulation using SCFGHMM entity extractor

108

DIAL vs. SCFG

90.5894.4287.0590.4989.591.46Location

90.1990.989.4988.0593.482.74Organization

92.2490.7893.7587.5393.881.32Person

FPrecRecallFPrecRecall

Full SCFG systemManual Rules (written in DIAL)


55/77

109

ACE 2

86.8484.9488.8385.8484.9686.7484.3783.2285.54GPE

64.7671.0659.4965.0671.0260.0358.0564.73552.62

Organizatio

n

85.5681.6889.8284.8281.6588.2584.3783.2285.54Person

80.2577.383.4461.1176.2450.99Role

FPrecRecallFPrecRecallFPrecRecall

Full SCFG systemErgodic SCFGHMM entity extractor

110

DIAL vs. TEG (ACE-1)

40.7450.0031.4847.8755.040.74Facility

76.2577.4875.0180.5685.7375.39Total

86.9683.6390.0389.3989.2689.53GPE

80.7180.6880.7583.5988.5678.62person

32.3844.4420.3337.4454.5420.34Location

59.564.5054.5068.5878.2558.92Organization

FPrecRecallFPrecRecall

Full TEG systemManual Rules (written in DIAL)


56/77


57/77

113

Final Comparison

71.413572.63380.560580.695{Sum}

22.975524.07447.870538.0825FAC

29.364531.59837.44235.091LOC

83.881585.25689.39288.957GPE

78.097576.42283.594584.835PERSON

49.69154.11368.589569.4895ORG

HMMG TEGDIALBest TEGF1

114

Complexity

Notation:

S: number of states in the grammar

N: number of tokens in the sentence

C: number of non-terminals in the grammar

Well behaved HMM: SN

Nymble like HMM: S2N (S=C)

TEG: SCN3

However:

Most portions can be executed as transducers

We can use smaller sliding windows

We can use approximations


58/77

115

Simple ROLE Rules (ACE-2) ROLE :- [Position_Before] ORGANIZATION->ROLE_2 Position ["in"

GPE] [","] PERSON->ROLE_1; ROLE :- GPE->ROLE_2 Position [","] PERSON->ROLE_1;

ROLE :- PERSON->ROLE_1 "of" GPE->ROLE_2;

ROLE :- ORGANIZATION->ROLE_2 "'" "s" [Position] PERSON->ROLE_1;

116

A little more complicated set of

rules ROLE :- [Position_Before] ORGANIZATION->ROLE_2 Position ["in" GPE]

[","] PERSON->ROLE_1;

ROLE :- GPE->ROLE_2 Position [","] PERSON->ROLE_1;

ROLE :- PERSON->ROLE_1 "of" GPE->ROLE_2;

ROLE :- ORGANIZATION->ROLE_2 "'" "s" [Position] PERSON->ROLE_1;

ROLE :- GPE->ROLE_2 [Position] PERSON->ROLE_1;

ROLE :- GPE->ROLE_2 "'" "s" ORGANIZATION->ROLE_1;

ROLE :- PERSON->ROLE_1 "," Position WCPreposition ORGANIZATION->ROLE_2;


59/77


60/77

119

Easy and hard in Coreference

Mohamed Atta, a suspected leader of the hijackers,had spent time in Belle Glade, Fla., where a crop-dusting business is located. Atta and other MiddleEastern men came to South Florida Crop Care nearlyevery weekend for two months.

Will Lee, the firm's general manager, said the menasked repeated questions about the crop-dustingbusiness. He said the questions seemed "odd," buthe

didn't find themen

suspicious until after the Sept.11 attack.

120

The Simple Coreference

Proper Names

IBM, International Business Machines, Big Blue

Osama Bin Ladin, Bin Ladin, Usama Bin Laden. (note the

variations)

Definite Noun Phrases

The Giant Computer Manufacturer, The Company, The

owner of over 600,000 patents

Pronouns

It, he , she, we..


61/77

121

Coreference ExampleGranite Systems provides Service Resource Management (SRM) software for

communication service providers with wireless, wireline, optical andpacket technology networks. Utilizing Granite' Xng System, carriers canmanage inventories of network resources and capacity, define networkconfigurations, order and schedule resources and provide a database ofrecord to other operational support systems (OSSs). An award-winningcompany, including an Inc. 500 company in 2000 and 2001, GraniteSystems enables clients including AT&T Wireless, KPN Belgium, COLTTelecom, ATG and Verizon to eliminate resource redundancy, improvenetwork reliability and speed service deployment. Founded in 1993, thecompany is headquartered in Manchester, NH with offices in Denver, CO;Miami, FL; London, U.K.; Nice, France; Paris, France; Madrid, Spain;

Rome, Italy; Copenhagen, Denmark; and Singapore.

122

A KE Approach to Corefernce Mark each noun phrase with the following:

Type (company, person, location)

Singular vs. plural Gender (male, female, neutral)

Syntactic (name, pronoun, definite/indefinite)

For each candidate Find accessible antecedents

Each antecedent has a different scope Proper namess scope is the whole document

Definite clausess scope is the preceding paragraphs

Pronouns might be just the previous sentence, or the same paragraph.

Filter by consistency check Order by dynamic syntactic preference


62/77

123

Filtering Antecedents

George Bush will not match she, or it George Bush can not be an antecedent of

The company or they

Using an Ontology we can use backgroundinformation to be smarter

Example: The big automaker is planning to get

out the car business. The company feels that itcan never longer make a profit making cars.

124

AutoSlog (Riloff, 1993)

Creates Extraction Patterns from annotated texts

(NPs).

Uses Sentence analyzer (CIRCUS, Lehnert, 1991) to

identify clause boundaries and syntactic constituents

(subject, verb, direct object, prepositional phrase)

It then uses heuristic templates to generate extraction

patterns


63/77

125

Example Templates

Was aimed at Passive-verb prep

Killed with Active-verb prep

Bomb against Noun prep

Bombed Active-verb

was victim aux noun

was murdered passive-verb

ExampleTemplate

126

AutoSlog-TS (Riloff, 1996)

It took 8 hours to annotate 160 documents, andhence probably a week to annotate 1000 documents.

This bottleneck is a major problem for using IE innew domains.

Hence there is a need for a system that can generateIE patterns from un-annotated documents.

AutoSlog-TS is such a system.


64/77

127

AutoSlog TS

Sentence Analyzer

S: World Trade CenterV: was bombedPP: by terrorists

AutoSlog Heuristics

Extraction

Patterns was killed

bombed by

Sentence Analyzer

Extraction Patterns was killed

was bombedbombed by

saw

EP REL % was bobmed 87%

bombed by 84% was killed 63%

saw 49%

128

Top 24 Extraction Patterns

Exploded on kidnappedOne of

was murderedDestroyed Was wounded in

Occurred on Responsibility for Took place on

was loctated occured was wounded

Claimed Caused took place

Death of Exploded in was injured

Attack on was kidnapped was killed

Assassination of Murder of exploded


65/77

129

Evaluation

Data Set: 1500 docs from MUC-4 (772 relevant)

AutoSlog generated 1237 patterns which weremanually filtered to 450 in 5 hours.

AutoSlog-TS generated 32,345 patterns, afterdiscarding singleton patterns, 11,225 were left.

Rank(EP) =

The user reviewed the the top 1970 patterns and

selected 210 of them in 85 minutes.

AutoSlog achieved better recall, while AutoSlog-TS achieved better precision.

freqfreq

freqrel2log

130

Learning Dictionaries by Bootstrapping

(Riloff and Jones, 1999)

Learn dictionary entries (semantic lexicon) and

extraction patterns simultaneously.

Use untagged text as a training source for learning.

Start with a set of seed lexicon entries and using

mutual bootstrapping learn extraction patterns and

more lexicon entries.


66/77

131

Mutual Bootstrapping Algorithm

Using AutoSlog generate all possible extraction patterns. Apply patterns to the corpus and save results to EPData

SemLex = Seed Words

Cat_EPList = {}

1. Score all extraction patterns in EPData

2. Best_EP = highest scoring pattern

3. Add Best_EP to Cat_EPList4. Add Best_EPs extractions to SemLex

5. Goto 1

132

Meta Bootstrapping Process

Seed Words

Permanent Semanticlexicon

Candidate Extraction Patternsand their Extractions

Temporary SemanticLexicon

Category EP List

Initialize

Add 5 Best

NPs

6HOHFWEHVWB(3

$GGEHVWB(3V

H[WUDFWLRQV


67/77

133

Sample Extraction Patterns

process in offices of expanded into

returned to has positionsservices in

taken in request informationdistributors in

part in message to customers in

ministers of thriveoutlets in

to enter devoted to consulting in

parts of sold to activities in

presidents of motivated seminars in

sought in positioningoperates in

become in is distributoroperations in

traveled to employedfacilities in

living in owned by offices in

terrorism locationwww companywww location

134

Experimental Evaluation

124/244

(.51)

101/194

(.52)

85/144

(.59)

68/94

(.72)

31/44

(.70)

4/4 (1)Terr. Weapon

158/250

(.63)

127/200

(.64)

100/150

(.67)

66/100

(.66)

32/50

(.64)

5/5 (1)Terr. Location

107/231

(.46)

101/181

(.56)

86/131

(.66)

63/81

(.78)

22/31

(.71)

0/1 (0)Web Title

191/250

(.76)

163/200

(.82)

129/150

(.86)

88/100

(.88)

46/50

(.92)

5/5 (1)Web Location

95/206(.46)

86/163(.53)

72/113(.64)

52/65(.80)

25/32(.78)

5/5 (1)WebCompany

Iter 50Iter 40Iter 30Iter 20Iter 10Iter 1


68/77

Semi Automatic Approach

Smart Tagging

136


69/77

137

138


70/77

139

140


71/77

141

142


72/77

Auditing Environments

Fixing Recall and Precision Problems

144

Auditing Events


73/77

145

Auditing a PersonPositionCompany

Event

146

Updating the Taxonomy and the

Thesaurus with New Entities


74/77


75/77

149

Using Hoovers

150

References I Anand T. and Kahn G., 1993. Opportunity Explorer: Navigating Large Databases Using Knowledge Discovery

Templates. In Proceedings of the 1993 workshop on Knowledge Discovery in Databases.

Appelt, Douglas E., Jerry R. Hobbs, John Bear, David Israel, and Mabry Tyson, 1993. ``FASTUS: A Finite-StateProcessor for Information Extraction from Real-World Text'', Proceedings. IJCAI-93, Chambery, France, August1993.

Appelt, Douglas E., Jerry R. Hobbs, John Bear, David Israel, Megumi Kameyama, and Mabry Tyson, 1993a. ``TheSRI MUC-5 JV-FASTUS Information Extraction System'', Proceedings, Fifth Message Understanding Conference(MUC-5), Baltimore, Maryland, August 1993.

Bookstein A., Klein S.T. and Raita T., 1995. Clumping Properties of Content-Bearing Words. In Proceedings ofSIGIR95.

Brachman R. J., Selfridge P. G., Terveen L. G., Altman B., Borgida A., Halper F., Kirk T., Lazar A., McGuinnessD. L., and Resnick L. A., 1993. Integrated Support for Data Archaeology. International Journal of Intelligent andCooperative Information Systems 2(2):159-185.

Brill E., 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging, Computational Linguistics, 21(4):543-565

Church K.W. and Hanks P., 1990. Word Association Norms, Mutual Information, and Lexicography,Computational Linguistics, 16(1):22-29

Cohen W. and Singer Y., 1996. Context Sensitive Learning Methods for Text categorization. In Proceedings ofSIGIR96.

Cohen. W., "Compiling Prior Knowledge into an Explicit Bias". Working notes of the 1992 AAAI springsymposium on knowledge assimilation. Stanford, CA, March 1992.

Daille B., Gaussier E. and Lange J.M., 1994. Towards Automatic Extraction of Monolingual and BilingualTerminology, In Proceedings of the International Conference on Computational Linguistics, COLING94, pages515-521.

Dumais, S. T., Platt, J., Heckerman, D. and Sahami, M. Inductive learning algorithms and representations for textcategorization. Proceedings of the Seventh International Conference on Information and Knowledge Management(CIKM98), 148-155, 1998.

Dunning T., 1993. Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics,19(1).


76/77

151

References II Feldman R. and Dagan I., 1995. KDT Knowledge Discovery in Texts. In Proceedings of the First International Conference on Knowledge

Discovery, KDD-95.

Feldman R., and Hirsh H., 1996. Exploiting Background Information in Knowledge Discovery from Text. Journal of Intelligent InformationSystems. 1996.

Feldman R., Aumann Y., Amir A., Klsgen W. and Zilberstien A., 1997. Maximal Association Rules: a New Tool for Mining for Keyword co-occurrences in Document Collections, In Proceedings of the 3rd International Conference on Knowledge Discovery, KDD-97, Newport Beach,CA.

Feldman R., Rosenfeld B., Stoppi J., Liberzon Y. and Schler, J., 2000. A Framework for Specifying Explicit Bias for Revision of ApproximateInformation Extraction Rules. KDD 2000: 189-199.

Fisher D., Soderland S., McCarthy J., Feng F. and Lehnert W., "Description of the UMass Systems as Used for MUC-6," in Proceedings of the6th Message Understanding Conference, November, 1995, pp. 127-140.

Frantzi T.K., 1997. Incorporating Context Information for the Extraction of Terms. In Preceedings of ACL-EACL97.

Frawley W. J., Piatetsky-Shapiro G., and Matheus C. J., 1991. Knowledge Discovery in Databases: an Overview. In G. Piatetsky-Shapiro and W.J. Frawley, editors, Knowledge Discovery in Databases, pages 1-27, MIT Press.

Grishman R., The role of syntax in Information Extraction, In: Advances in Text Processing: Tipster Program Phase II, Morgan Kaufmann, 1996.

Hahn, U.; and Schnattinger, K. 1997. Deep Knowledge Discovery from Natural Language Texts. In Proceedings of the 3rd InternationalConference of Knowledge Discovery and Data Mining, 175-178.

Hayes Ph. 1992. Intelligent High-Volume Processing Using Shallow, Domain-Specific Techniques. Text-BasedIntelligent Systems: CurrentResearch and Practice in Information Extraction and Retrieval. New Jersey, P.227-242.

Hayes, P.J. and Weinstein, S.P. CONSTRUE: A System for Content-Based Indexing of a Database of News Stories. Second Annual Conferenceon Innovative Applications of Artificial Intelligence, 1990.

Hobbs, J. (1986), Resolving Pronoun References, In B. J. Grosz, K. Sparck Jones, & B. L. Webber (Eds.), Readings in Natural Language

Processing (pp. 339-352), Los Altos, CA: Morgan Kaufmann Publishers, Inc. Hobbs, Jerry R., Douglas E. Appelt, John Bear, David Israel, Megumi Kameyama, and Mabry Tyson, ``FASTUS: A System for Extracting

Information from Text'', Proceedings, Human Language Technology, Princeton, New Jersey, March 1992, pp. 133-137.

Hobbs, Jerry R., Mark Stickel, Douglas Appelt, and Paul Martin, 1993, ``Interpretation as Abduction'', Artificial Intelligence, Vol. 63, Nos. 1-2,pp. 69-142. Also published as SRI International Artificial Intelligence Center Technical Note 499, December 1990.

152

References III Hull D., 1996. Stemming algorithms - a case study for detailed evaluation. Journal of the American Society for Information Science. 47(1):70-

84.

Ingria, R. & Stallard, D., A computational mechanism for pronominal reference, Proceedings, 27th Annual Meeting of the Association forComputational Linguistics, Vancouver, 1989.

Cowie J., and Lehnert W., "Information Extraction," Communications of the Association of Computing Machinery, vol. 39 (1), pp. 80-91.

Joachims, T. Text categorization with support vector machines: Learning with many relevant features. Proceedings of European Conference on

Machine Learning (ECML98), 1998 Justeson J. S. and Katz S. M., 1995. Technical Terminology: Some linguistic properties and an algorithm for identification in text. In Natural

Language Engineering, 1(1):9-27.

Klsgen W., 1992. Problems for Knowledge Discovery in Databases and Their Treatment in the Statistics Interpreter EXPLORA. InternationalJournal for Intelligent Systems, 7(7):649-673.

Lappin, S. & McCord, M., A syntactic filter on pronominal anaphora for slot grammar, Proceedings, 28th Annual Meeting of theAssociation for Computational Linguistics, University o f Pittsburgh, Association for Computational Linguistics, 1990.

Larkey, L. S. and Croft, W. B. 1996. Combining classifiers in text categorization. In Proceedings of SIGIR-96, 19th ACMInternationalConference on Research and Development in Information Retrieval (Zurich, CH, 1996), pp. 289297.

Lehnert, Wendy, Claire Cardie, David Fisher, Ellen Riloff, and Robert Williams, 1991. ``Description of the CIRCUS System as Used for MUC-3'', Proceedings, Third Message Understanding Conference (MUC-3), San Diego, California, pp. 223-233.

Lent B., Agrawal R. and Srikant R., 1997. Discovering Trends in Text Databases. In Proceedings of the 3rd International Conference onKnowledge Discovery, KDD-97, Newport Beach, CA.

Lewis, D. D. 1995a. Evaluating and optmizing autonomous text classification systems. In Proceedings of SIGIR-95, 18th ACMInternationalConference on Research and Development in Information Retrieval (Seattle, US, 1995), pp. 246254.

Lewis, D. D. and Hayes, P. J. 1994. Guest editorial for the special issue on text categorization. ACMT ransactions on Information Systems 12, 3,231.

Lewis, D. D. and Ringuette, M. 1994. A comparison of two learning algorithms for text categorization. In Proceedings of SDAIR-94, 3rd AnnualSymposium on Document Analysis and Information Retrieval (Las Vegas, US, 1994), pp. 8193.

Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka, R. 1996. Training algorithms for linear text classi fiers. In Proceedings of SIGIR-96, 19th

ACMInternational Conference on Research and Development in Information Retrieval (Z urich, CH, 1996), pp. 298306.


77/77

153

References IV Lin D. 1995. University of Manitoba: Description of the PIE System as Used for MUC-6 . In Proceedings of the Sixth Conference on Message

Understanding (MUC-6), Columbia, Maryland.

Piatetsky-Shapiro G. and Frawley W. (Eds.), 1991. Knowledge Discovery in Databases, AAAI Press, Menlo Park, CA.

Rajman M. and Besanon R., 1997. Text Mining: Natural Language Techniques and Text Mining Applications. In Proceedings of the seventhIFIP 2.6 Working Conference on Database Semantics (DS-7), Chapam & Hall IFIP Proceedings serie. Leysin, Switzerland, Oct 7-10, 1997.

Riloff Ellen and Lehnert Wendy, Information Extraction as a Basis for High-Precision Text Classification, ACM Transactions on InformationSystems (special issue on text categorization) or also Umass-TE-24, 1994.

Sebastiani F. Machine learning in automated text categorization, ACM Computing Surveys, Vol. 34, number 1, pages 1-47, 2002.

Sheila Tejada, Craig A. Knoblock and Steven Minton, Learning Object Identification Rules For Information Integration, Information SystemsVol. 26, No. 8, 2001, pp. 607-633.

Soderland S., Fisher D., Aseltine J., and Lehnert W., "Issues in Inductive Learning of Domain-Specific Text Extraction Rules," Proceedings ofthe Workshop on New Approaches to Learning for Natural Language Processing at the Fourteenth International Joint Conference on ArtificialIntelligence, 1995.

Srikant R. and Agrawal R., 1995. Mining generalized association rules. In Proceedings of the 21st VLDB Conference, Zurich, Swizerland.

Sundheim, Beth, ed., 1993. Proceedings, Fifth Message Understanding Conference (MUC-5), Baltimore, Maryland, August 1993. Distributed byMorgan Kaufmann Publishers, Inc., San Mateo, California.

Tipster Text Program (Phase I), 1993. Proceedings, Advanced Research Projects Agency, September 1993.

Vapnik, V., Estimation of Dependencies Based on Data [in Russian], Nauka, Moscow, 1979. (English translation: Springer Verlag, 1982.)

Vapnik, V., The Nature of Statistical Learning Theory, Springer-Verlag, 1995.

Yang, Y. and Chute, C. G. 1994. An example-based mapping method for text categorization and retrieval. ACMT ransactions on InformationSystems 12, 3, 252277.

Yang, Y. and Liu, X. 1999. A re-examination of text categorization methods. In Proceedings of SIGIR-99, 22nd ACM International Conference

on Research and Development in Information Retrieval (Berkeley, US, 1999), pp. 4249.

mining_unstructured_data

Documents