mining_unstructured_data
TRANSCRIPT
-
8/7/2019 Mining_Unstructured_Data
1/77
Mining Unstructured Data
Ronen FeldmanComputer Science DepartmentBar-Ilan University, ISRAEL
ClearForest [email protected]
2
Outline
Introduction to TextMining
Information Extraction Entity Extraction
Relationship Extraction
KE Approach
IE Languages
ML Approaches to IE HMM
Anaphora resolution
Evaluation
Link Detection
Introduction to LD
Architecture of LDsystems
9/11
Link Analysis Demos
-
8/7/2019 Mining_Unstructured_Data
2/77
3
Background
Rapid proliferation ofinformation available in digital
format
People have less time to absorb
more information
4
The Information Landscape
8QVWUXFWXUHG
7H[WXDO
6WUXFWXUHG
'DWDEDVHV
/DFNRIWRROVWRKDQGOHXQVWUXFWXUHGGDWD
3UREOHP
-
8/7/2019 Mining_Unstructured_Data
3/77
5
)LQG'RFXPHQWV
PDWFKLQJWKH4XHU\
'LVSOD\,QIRUPDWLRQ
UHOHYDQWWRWKH4XHU\
Extract Information from
within the documents
Actual information buried
inside documents
Long lists of documents Aggregate over entirecollection
6
Text Mining
,QSXW,QSXW,QSXW,QSXW,QSXW,QSXW,QSXW,QSXW
'RFXPHQWV'RFXPHQWV
2XWSXW2XWSXW2XWSXW2XWSXW2XWSXW2XWSXW2XWSXW2XWSXW
3DWWHUQV3DWWHUQV
&RQQHFWLRQV&RQQHFWLRQV
3URILOHV3URILOHV
7UHQGV7UHQGV
6HHLQJWKH)RUHVWIRUWKH7UHHV6HHLQJWKH)RUHVWIRUWKH7UHHV6HHLQJWKH)RUHVWIRUWKH7UHHV6HHLQJWKH)RUHVWIRUWKH7UHHV
-
8/7/2019 Mining_Unstructured_Data
4/77
7
ReadRead
ConsolidateConsolidate
Absorb / ActAbsorb / Act
UnderstandUnderstand
Find MaterialFind Material
/HW7H[W0LQLQJ'RWKH/HJZRUNIRU
-
8/7/2019 Mining_Unstructured_Data
5/77
9
Document Types
Structured documents
Output from CGI
Semi-structured documents
Seminar announcements
Job listings
Ads
Free format documents
News Scientific papers
10
Text Representations
Character Trigrams
Words Linguistic Phrases
Non-consecutive phrases
Frames
Scripts
Role annotation
Parse trees
-
8/7/2019 Mining_Unstructured_Data
6/77
11
The 100,000 foot Picture
Development
Evironment
12
Intelligent Auto-Tagging(c) 2001, Chicago Tribune.
Visit the Chicago Tribune on the Internet at
http://www.chicago.tribune.com/
Distributed by Knight Ridder/Tribune
Information Services.
By Stephen J. Hedges and Cam Simpson
Finsbury Park Mosque
-
8/7/2019 Mining_Unstructured_Data
7/77
13
Tagging a Document
BC-dynegy-enron-offer-update5Dynegy May Offer at Least $8 Bln to Acquire
Enron (Update5)By George Stein
SOURCEc.2001 Bloomberg News
BODY
Dynegy Inc
Roger HamiltonJohn Hancock Advisers Inc.
Roger Hamilton
money manager
John Hancock Advisers Inc.
Enron Corp
Moody's Investors ServiceEnron Corp
downgraded Baa2
bonds
Moody's Investors Service
.
``Dynegy has to act fast,'' said RogerHamilton, a money manager with John
Hancock Advisers Inc., which sold its Enron
shares in recent weeks. ``If Enron can't getfinancing and its bonds go to junk, they losecounterparties and their marvelous business
vanishes.''
Moody's Investors Service lowered itsrating on Enron's bonds to ``Baa2'' andStandard & Poor's cut the debt to ``BBB.'' in
the past two weeks.
14
kidney enlargement
VEGF
Vascular endothelial growth factor,VEGFdiabetic glomerularenlargement
Vascular endothelial growth factor,VEGF
diabetic glomerularenlargement
VEGF
insulin-like growth factor I, IGF-I
IGF-I
IGF-Iincrease in glomerular volume
and kidney weight
VEGF
IGF-I
IGF-I
increase in glomerular volume andkidney weight
IGF-I
VEGF
Various growth factors and cytokines have been implicated in differentforms of kidney enlargement. Vascular endothelial growth factor (VEGF)is essential for normal renal development and plays a role in diabeticglomerular enlargement. To explore a possible role for VEGF incompensatory renal changes after uninephrectomy, we examined the effectof a neutralizing VEGF-antibody (VEGF-Ab) on glomerular volume andkidney weight in mice treated for 7 days. Serum and kidney insulin-likegrowth factor I (IGF-I) levels were measured, since IGF-I has beenimplicated in the pathogenesis of compensatory renal growth, and VEGFhas been suggested to be a downstream mediator of IGF-I. Placebo-treateduninephrectomized mice displayed an early transient increase in kidneyIGF-I concentration and an increase in glomerular volume and kidneyweight. In VEGF-Ab-treated uninephrectomized animals, increasedglomerular volume was abolished, whereas renal hypertrophy was
partially blocked. Furthermore, the renal effects of VEGF-Abadministration were seen without affecting the renal IGF-I levels. Inconclusion, these results demonstrate that compensatory glomerulargrowth after uninephrectomy is VEGF dependent.
Intelligent Auto-TaggingCompensatory glomerular growth after unilateral nephrectomy is VEGFdependent.
Flyvbjerg A, Schrijvers BF, De Vriese AS, Tilton RG, Rasch R.
Medical Department M, Medical Research Laboratories, Institute ofExperimental Clinical Research, Aarhus University, DK-8000 Aarhus,Denmark.
BIOMEDICAL
-
8/7/2019 Mining_Unstructured_Data
8/77
15
A Tagged News Document
16
Original Document
000019641
3560177
November 25, 1991
News Release
Applied Materials, Inc. today announced its newest source technology, called
the Durasource, for the Endura 5500 PVD s ystem. This enhanced source
includes new magnet configurations, giving the industry's most advanced
sputtered aluminum step coverage in sub-micron contacts, and a new onepiece
target that more than doubles target life to approximately 8000 microns
of deposition compared to conventional two-piece bonded targets. The
Durasource enhancement is fully retrofittable to installed Endura 5500 PVD
systems. The Durasource technology has been specially designed for 200mm
wafer applications, although it is also available for 125mm and 1s0mm wafer
sizes. For example, step coverage symmetry is maintained within 3% between
the inner and outer walls of contacts across a 200mm wafer. Film thickness
uniformity averages 3% (3 sigma) over the life of the target.
-
8/7/2019 Mining_Unstructured_Data
9/77
17
Extraction Results I
:=DOC NR: 3560177DOC DATE: 251192DOCUMENT SOURCE: News ReleaseCONTENT: DATE TEMPLATE COMPLETED: 021292EXTRACTION TIME: 5COMMENT: article focuses on nonreportable target source but
reportable info available
/TOOL_VERSION: LOCKE.3.4/FILLRULES_VERSION: EME.4.0 :=PROCESS: MANUFACTURER:
18
Extraction Results II
:=
NAME: Applied Materials, INC
TYPE: COMPANY :=
TYPE: SPUTTERING
FILM: ALUMINUM
EQUIPMENT:
:=
NAME_OR_MODEL: Endura(TM) 5500
MANUFACTURER:
EQUIPMENT_TYPE: PVD_SYSTEM
STATUS: IN_USE
WAFER_SIZE: (200 MM)(121 MM)
COMMENT: actually three wafer sizes, third is error 1s0mm
-
8/7/2019 Mining_Unstructured_Data
10/77
19
BioMedical Example
XSome kinase
ProteinKinasePhosphorylation
Template
BES1 is phosphorylated and
appears to be destabilized by the
glycogen synthase kinase-3
(GSK-3) BIN2
{
Phosphorylation
BES1glycogen synthase
kinase-3 (GSK-3) BIN2
ProteinKinase
20
Leveraging Content Investment
$Q\W\SHRIFRQWHQW
8QVWUXFWXUHGWH[WXDOFRQWHQWFXUUHQWIRFXV
6WUXFWXUHGGDWDDXGLRYLGHRIXWXUH
)URPDQ\VRXUFH
:::ILOHV\VWHPVQHZVIHHGVHWF
6LQJOHVRXUFHRUFRPELQHGVRXUFHV
,QDQ\IRUPDW
'RFXPHQWV3')V(PDLOVDUWLFOHVHWF
5DZ RUFDWHJRUL]HG
)RUPDOLQIRUPDOFRPELQDWLRQ
-
8/7/2019 Mining_Unstructured_Data
11/77
21
The Tagging Process (a Unified View)
&OHDU7DJV6XLWH,QWHOOLJHQW$XWR,QWHOOLJHQW$XWR7DJJLQJ7DJJLQJ
7DJJLQJ&RQWUROOHU
7DJJLQJ&RQWUROOHU
6HPDQWLFWDJJHU
&ODVVLILFDWLRQWDJJHU
6WUXFWXUDOWDJJHU
5XOHERRNV5XOHERRNV5XOHERRNV
)HWFKHU
&OHDU/DE
6HPDQWLF7UDLQHU
5LFK
;0/$3,
5LFK
;0/$3,
8QVWUXFWXUHG&RQWHQW
8QVWUXFWXUHG&RQWHQW
&ODVVLILHUV&ODVVLILHUV&ODVVLILHUV
6WUXFWXUDO7HPSODWHV6WUXFWXUDO7HPSODWHV6WUXFWXUDO7HPSODWHV
6WDWLVWLFDO7UDLQHU6WDWLVWLFDO7UDLQHU
6WUXFWXUDO7UDLQHU6WUXFWXUDO7UDLQHU
,QWHOOLJHQW7DJJLQJ
22
Text Mining Horizontal Applications
7H[W0LQLQJ7H[W0LQLQJ/HDG*HQHUDWLRQ
(QWHUSULVH3RUWDOV
(,3
&RPSHWLWLYH,QWHOOLJHQFH
&RQWHQW
8QLIRUPLW\
&50$QDO\VLV
0DUNHW5HVHDUFK
1HZ3URGXFWV
2IIHULQJV
(YHQWV'HWHFWLRQV1RWLILFDWLRQV
,30DQDJHPHQW
&RQWHQW
0DQDJHPHQW
-
8/7/2019 Mining_Unstructured_Data
12/77
23
Text Mining: Vertical Applications
,QVXUDQFH
3KDUPD
3XEOLVKLQJ
)LQDQFLDO6HUYLFHV
*OREDO&KHPLFDO
/HJDO
%LRWHFK
'HIHQVH
Information Extraction
-
8/7/2019 Mining_Unstructured_Data
13/77
25
What is Information Extraction?
IE does not indicate which documents need to be read by a
user, it rather extracts pieces of information that are salient tothe user's needs.
Links between the extracted information and the originaldocuments are maintained to allow the user to referencecontext.
The kinds of information that systems extract vary in detailand reliability.
Named entities such as persons and organizations can beextracted with reliability in the 90th percentile range, but donot provide attributes, facts, or events that those entities haveor participate in.
26
Relevant IE Definitions
Entity: an object of interest such as a person ororganization.
Attribute: a property of an entity such as its name,alias, descriptor, or type.
Fact: a relationship held between two or moreentities such as Position of a Person in a Company.
Event: an activity involving several entities such as aterrorist act, airline crash, management change, newproduct introduction.
-
8/7/2019 Mining_Unstructured_Data
14/77
27
Example (Entities and Facts) Fletcher Maddox, former Dean of the UCSD
Business School, announced the formation of
La Jolla Genomatics together with his twosons. La Jolla Genomatics will release itsproduct Geninfo in June 1999. Geninfo is aturnkey system to assist biotechnologyresearchers in keeping up with the voluminousliterature in all aspects of their field.
Dr. Maddox will be the firm's CEO. His son,Oliver, is the Chief Scientist and holds patentson many of the algorithms used in Geninfo.Oliver's brother, Ambrose, follows more in hisfather's footsteps and will be the CFO of L.J.G.headquartered in the Maddox family'shometown of La Jolla, CA.
Pers ons: Or ganiz at ions : Locat ions: Art ifact s: Dates:Fletcher Maddox UCSD Business School La Jolla Geninfo June 1999
Dr. Maddox La Jolla Genomatics CA Geninfo
Oliver La J oll a Gen omati cs
Oliver L.J.G.
Ambrose
Maddox
PERSON Employee_of ORGANIZATION
Fletcher
Maddox
Fletcher
Maddox
Oliver
Ambrose
Employee_of
Employee_of
Employee_of
Employee_of
UCSD Business
School
La Jolla Genomatics
La Jolla Genomatics
La Jolla Genomatics
ARTIFACT Product_of ORGANIZATION
Geninfo Product_of La Jolla Genomatics
LOCATION Location_of ORGANIZATION
La Jolla Location_of La Jolla Genomatics
CA Location_of La Jolla Genomatics
28
Events and Attributes
COMPANY-FORMATION_EVENT:
COMPANY: La Jolla Genomatics
PRINCIPALS: Fletcher Maddox
Oliver
Ambrose
DATE:
CAPITAL:
RELEASE-EVENT:
COMPANY: La Jolla Genomatics
PRODUCT: Geninfo
DATE: June 1999
COST:
NA ME: Fle tche r Ma dd ox
Maddox
DESCRIPTOR: former Dean of the UCSD
Business School
his father
the firm's CEO
CATEGORY: PERSONNAME: Oliver
DESCRIPTOR: His son
Chief Scientist
CATEGORY: PERSON
NAME: Ambrose
DESCRIPTOR: Oliver's brother
the CFO of L.J.G.
CATEGORY: PERSON
NAME: UCSD Business Schoo l
DESCRIPTOR:
CATEGORY: ORGANIZATION
NA ME: La J oll a Gen oma ti cs
L.J.G.
DESCRIPTOR:
CATEGORY: ORGANIZATION
NAME: Geninfo
DESCRIPTOR: its product
CATEGORY: ARTIFACT
NAME: La Jolla
DESCRIPTOR: the Maddox family's hometown
CATEGORY: LOCATION
NAME: CA
DESCRIPTOR:
CATEGORY: LOCATION
-
8/7/2019 Mining_Unstructured_Data
15/77
29
IE Accuracy by Information Type
50-60%Events
60-70%Facts
80%Attributes
90-98%Entities
AccuracyInformation
Type
30
Unstructured TextPOLICE ARE INVESTIGATING A ROBBERY THAT OCCURRED AT THE 7-
ELEVEN STORE LOCATED AT 2545 LITTLE RIVER TURNPIKE IN THE
LINCOLNIA AREA ABOUT 12:30 AM FRIDAY. A 24 YEAR OLD
ALEXANDRIA AREA EMPLOYEE WAS APPROACHED BY TWO MEN WHO
DEMANDED MONEY. SHE RELINQUISHED AN UNDISCLOSED AMOUNT
OF CASH AND THE MEN LEFT. NO ONE WAS INJURED. THEY WERE
DESCRIBED AS BLACK, IN THEIR MID TWENTIES, BOTH WERE FIVE
FEET NINE INCHES TALL, WITH MEDIUM BUILDS, BLACK HAIR AND
CLEAN SHAVEN. THEY WERE BOTH WEARING BLACK PANTS AND
BLACK COATS. ANYONE WITH INFORMATION ABOUT THE INCIDENT
OR THE SUSPECTS INVOLVED IS ASKED TO CALL POLICE AT (703) 555-
5555.
-
8/7/2019 Mining_Unstructured_Data
16/77
31
Structured (Desired) Information
DayTimeTownAddressCrime
FRIDAY3:00 AMFAIRFAX7- ELEVEN
STORE LOCATED
AT 5624 OX ROAD,
ROBBERY
FRIDAY12:45 AMLINCOLNIA7- ELEVEN
STORE LOCATED
AT 2545 LITTLE
RIVER TURNPIKE,
ROBBERY
SUNDAY11:30 PMANNANDALE
8700 BLOCK OFLITTLE RIVER
TURNPIKE,
ABDUCTION
32
MUC Conferences
MUC 7
MUC 6
MUC 5
MUC 4
MUC 3
MUC 2
MUC 1
Conference
Spaces Vehicles and Missile Launches1997
Management Changes1995
Joint Venture and Micro
Electronics1993
Terrorist Activity1992
Terrorist Activity1991
Naval Operations1989
Naval Operations1987
TopicYear
-
8/7/2019 Mining_Unstructured_Data
17/77
33
The ACE Evaluation The ACE program is dedicated to the challenge of extracting content from
human language. This is a fundamental capability that the ACE programaddresses with a basic research effort that is directed to master first theextraction of entities, then the extraction of relations among theseentities, and finally the extraction of events that are causally related setsof relations.
After two years of research on the entity detection and tracking task, topsystems have achieved a capability that is useful by itself and that, in thecontext of the ACE EDT task, successfully captures and outputs well over50 percent of the value at the entity level.
Here value is defined to be the benefit derived by successfully extractingthe entities, where each individual entity provides a value that is a
function of the entity type (i.e., person, organization, etc.) and level(i.e., named, unnamed). Thus each entity contributes to the overallvalue through the incremental value that it provides.
34
40.6
63.0
11.3
24.3
9.1
11.8
0.1
0.6
0.0
0.3
0
10
20
30
40
50
60
70
80
90
100
Act value Max value
NormalizedValue/Cost
(in%
relativetoMaximumV
alue
FAC
LOC
GPE
ORG
PER
Miss 36%, False Alarm 22%, Type Error 6%
Average
of Top 4Systems
ACE Entity Detection & Tracking
Evaluation -- 2/2002Goal: Extract entities. Each
entity is assigned a value.This value is a function of itsTypeand Level. This value
is gained when the entity issuccessfully detected. Thisvalue is lost when an entityis missed, spuriouslydetected, ormischaracterized.
Table of Entity Values
0.0020.0040.010.020.04PRO
0.010.020.050.10.2NOM
0.050.10.250.51NAM
FACLOCGPEORGPER
Human performance ~80
-
8/7/2019 Mining_Unstructured_Data
18/77
35
Applications of Information
Extraction
Routing of Information Infrastructure for IR and for Categorization
(higher level features)
Event Based Summarization.
Automatic Creation of Databases and
Knowledge Bases.
36
Where would IE be useful?
Semi-Structured Text
Generic documents like News articles. Most of the information in the document is
centered around a set of easily identifiable
entities.
-
8/7/2019 Mining_Unstructured_Data
19/77
37
Approaches for Building IE
Systems
Knowledge Engineering Approach Rules are crafted by linguists in cooperation with domain
experts.
Most of the work is done by inspecting a set of relevant
documents.
Can take a lot of time to fine tune the rule set.
Best results were achieved with KB based IE systems.
Skilled/gifted developers are needed.
A strong development environment is a MUST!
38
Approaches for Building IE
Systems
Automatically Trainable Systems The techniques are based on pure statistics and almost no
linguistic knowledge They are language independent
The main input is an annotated corpus
Need a relatively small effort when building the rules,however creating the annotated corpus is extremelylaborious.
Huge number of training examples is needed in order toachieve reasonable accuracy.
Hybrid approaches can utilize the user input in thedevelopment loop.
-
8/7/2019 Mining_Unstructured_Data
20/77
39
Components of IE System
Tokenization
Morphological andLexical Analysis
Synatctic Analysis
Domain Analysis
Zoning
Part of Speech Tagging
Sense Disambiguiation
Deep Parsing
Shallow Parsing
Anaphora Resolution
Integration
Must
Advisable
Nice to have
Can pass
40
The Extraction Engine
ExtractionEngine
API
Generic
ExtractionRules
Generic
Lexicons and
Thesaruses
Customized
ExtractionRules
Domain
SpecificLexicons and
Thesaruses
Document
CoillectionEntities, Facts
and Events
-
8/7/2019 Mining_Unstructured_Data
21/77
41
Why is IE Difficult? Different Languages
Morphology is very easy in English, much harder in German andHebrew.
Identifying word and sentence boundaries is fairly easy in Europeanlanguage, much harder in Chinese and Japanese.
Some languages use orthography (like english) while others (likehebrew, arabic etc) do no have it.
Different types of style Scientific papers
Newspapers
memos
Emails
Speech transcripts
Type of Document Tables
Graphics
Small messages vs. Books
42
Morphological Analysis
Easy
English, Japanese
Listing all inflections of a word is a real possibility
Medium
French Spanish
A simple morphological component adds value.
Difficult
German, Hebrew, Arabic A sophisticated morphological component is a must!
-
8/7/2019 Mining_Unstructured_Data
22/77
43
Using Vocabularies
Size doesnt matter Large lists tend to cause more mistakes
Examples:
Said as a person name (male)
Alberta as a name of a person (female)
It might be better to have small domain
specific dictionaries
44
Part of Speech Tagging
POS can help to reduce ambiguity, and to deal
with ALL CAPS text. However
It usually fails exactly when you need it
It is domain dependent, so to get the best resultsyou need to retrain it on a relevant corpus.
It takes a lot of time to prepare a training corpus.
-
8/7/2019 Mining_Unstructured_Data
23/77
45
A simple POS Strategy
Use a tag frequency table to determine theright POS.
This will lead to elimination of rare senses.
The overhead is very small
It improve accuracy by a small percentage.
Compared to full POS it provide similar boost
to accuracy.
46
Dealing with Proper Names
The problem Impossible to enumerate
New candidates are generated all the time Hard to provide syntactic rules
Types of proper names People
Companies
Organizations
Products
Technologies Locations (cities, states, countries, rivers, mountains)
-
8/7/2019 Mining_Unstructured_Data
24/77
47
Comparing RB Systems with ML
Based Systems
90.690.3HUB4
TranscribedSpeech
90.493.7MUC7
9396.4MUC6
Wall Street
Journal
HMMRule Based
48
Building a RB Proper Name
Extractor A Lexicon is always a good start
The rules can be based on the lexicon and on: The context (preceding/following verbs or nouns)
Regular expressions Companies: Capital* [,] inc, Capital* corporation..
Locations: Capital* Lake, Capital* River
Capitalization
List structure
After the creation of an initial set of rules Run on the corpus
Analyze the results
Fix the rules and repeat
This Process can take around 2-3 weeks and result in performance of between 85-90% break even.
Better performance can be achieved with more effort (2-3 months) and thenperformance can get to 95-98%
-
8/7/2019 Mining_Unstructured_Data
25/77
Introduction to HMMs forIE
50
Motivation
We can view the named entity extraction as a
classification problem, where we classify each word
as belonging to one of the named entity classes or to
the no-name class.
One of the most popular techniques for dealing with
classifying sequences is HMM.
Example of using HMM for another NLP
classification task is that of part of speech tagging
(Church, 1988 ; Weischedel et. al., 1993).
-
8/7/2019 Mining_Unstructured_Data
26/77
51
What is HMM?
HMM (Hidden Markov Model) is a finite stateautomaton with stochastic state transitions andsymbol emissions (Rabiner 1989).
The automaton models a probabilistic generativeprocess.
In this process a sequence of symbols is produced bystarting in an initial state, transitioning to a newstate, emitting a symbol selected by the state and
repeating this transition/emission cycle until adesignated final state is reached.
52
Disadvantage of HMM
The main disadvantage of using an HMM for
Information extraction is the need for a largeamount of training data. i.e., a carefully
tagged corpus.
The corpus needs to be tagged with all the
concepts whose definitions we want to learn.
-
8/7/2019 Mining_Unstructured_Data
27/77
53
Notational Conventions
T = length of the sequence of observations (training set)
N = number of states in the model
qt = the actual state at time t
S = {S1,...SN} (finite set of possible states)
V = {O1,...OM} (finite set of observation symbols)
= {i} = {P(q1 = Si)} starting probabilities
A = {aij}=P(qt+1= Si | qt = Sj) transition probabilities
B = {bi(O
t)} = {P(O
t| q
t= S
i)} emission probabilities
54
The Classic Problems Related to
HMMs
Find P( O | ): the probability of an
observation sequence given the HMM model. Find the most likely state trajectory given
and O.
Adjust = (, A, B) to maximize P( O | ).
-
8/7/2019 Mining_Unstructured_Data
28/77
55
Calculating P( O | ) The most obvious way to do that would be to enumerate
every possible state sequence of length T (the length of theobservation sequence). Let Q = Q1,...QT , then by assumingindependence between the states we have
P(O|Q, ) =
P(Q|) =
By using Bayes theorem we have
P(O,Q|) = P(O|Q, ) P(Q|)
Finally The main problem with this is that we need to do2TNT multiplications, which is certainly not feasible even fora modest T like 10.
==
=T
i
iq
T
i
ii ObqOP i11
)(),|(
=+
1
111
T
i
qqq iia
56
The forward-backward algorithm
In order to solve that we use the forward-backward algorithmthis is far more efficient. The forward part is based on the
computation of terms called the alpha terms. We define thealpha values as follows,
We can compute the alpha values inductively in a very
efficient way. This calculation requires just N2T multiplications.
=
+
=
+
=
=
=
N
i
T
tj
N
i
ijtt
ii
iaOP
Obaij
Obi
1
1
1
1
11
)()|(
)()()(
)()(
-
8/7/2019 Mining_Unstructured_Data
29/77
57
The backward phase
In a similar manner we can define a backwardvariable called beta that computes the
probability of a partial observation sequence
from t+1 to T. The beta variable will also be
computed inductively but in a backward
fashion.
)()()(
1)(
11
1
jObai
i
tti
N
j
ijt
T
++
=
=
=
58
Solution for the second problem Our main goal is to find the optimal state sequence. We will do that by
maximizing the probabilities of each state individually.
We will start by defining a set of gamma variables the measure the probabilities
that at time t we are at state S i.
The denominator is used just to make gamma a true probability measure.
Now we can find the best state at each time slot in a local fashion.
The main problem with this approach is that the optimization is done locally, andnot on the whole sequence of states. This can lead either to a local maximum oreven to an invalid sequence. In order to solve that problem we use a well knowndynamic programming algorithm called the Viterbi algorithm.
=
==N
i
tt
tttt
t
ii
iiOP
iii
1
)()(
)()()|()()()(
Ni
iq
t
t
=1
)](max[arg
-
8/7/2019 Mining_Unstructured_Data
30/77
59
The Viterbi Algorithm
Intuition
Compute the most likely sequence starting with the emptyobservation sequence; use this result to compute the most likelysequence with an output sequence of length one; recurse untilyou have the most likely sequence for the entire sequence ofobservations.
Algorithmic Details
The delta variables compute the highest probability of a partialsequence up to time tthat ends in state Si. The psi variables
enables us to accumulate the best sequence as we move alongthe time slices.
1. Initialization:
0)(
)()(
1
11
=
=
i
Obi ii
60
Viterbi (Cont).
Recursion:
Termination:
Reconstruction:
For t = T-1,T-2,...,1. The resulting sequence,, solves Problem 2.
[ ] )()(1max
)( 1 tiijtt ObaiNij
=
[ ]ijtt ai
Nij )(
1
maxarg)( 1
=
[ ])(1
max* i
NiP
T
= [ ])(
1
maxarg*
iNi
q TT
=
)(*
11
*
++= ttt qq
**
2
*
1 , Tqqq
-
8/7/2019 Mining_Unstructured_Data
31/77
61
Viterbi (Example)
S1 S2
S3
0.1
0.10.1
0.5
0.4
0.4
0.50
.5
0.20
.8
0.80
.2
a
bab
ab
0.5
0.5
state
Time slice t ->
S1
S2
S3
0.25 0.02 0.016
0 0.04 0.008
0.1 0.08 0.016
t1 t2 t3
a b a
62
The Just Research HMM
Each HMM extracts just one field of a given document. Ifmore fields are needed, several HMMs need to be
constructed. The HMM takes the entire document as one observation
sequence.
The HMM contains two classes of states, background statesand target states. The background states emit words in whichare not interested, while the target states emit words thatconstitute the information to be extracted.
The state topology is designed by hand and only a few
transitions are allowed between the states.
-
8/7/2019 Mining_Unstructured_Data
32/77
63
Possible HMM Topologies
prefix states suffix states
initial state final state
background
state
target
state
background
state
target
state
final state
initial state
64
A more General HMM
Architecture
prefix statessuffix states
background
state
target
states
final state
initial state
-
8/7/2019 Mining_Unstructured_Data
33/77
65
Experimental Evaluation
46.7%55.3%40.1%48.1%30.9%
Status of
Acquisition
Price of
Acquisition
Abbreviation
of AcquiredCompany
Acquired
Company
Acquiring
Company
59.5%99.1%83.9%71.1%
End TimeStart TimeLocationSpeaker
66
BBNs Identifinder
An ergodic bigram model.
Each Named Class has a separate region in theHMM.
The number of states in each NC region is equal to
|V|. Each word has its own state.
Rather then using plain words, extended words are
used. An extended word is a pair , where f is a
feature of the word w.
-
8/7/2019 Mining_Unstructured_Data
34/77
67
BBNs HMM Architecture
EndEndStart
Company
Name
Person
Name
No Name
68
Possible word Features1. 2 digit number (01)
2. 4 digit number (1996)
3. alphanumeric string (A34-24)
4. digits and dashes (12-16-02)5. digits and slashes (12/16/02)
6. digits and comma (1,000)
7. digits and period (2.34)
8. any other number (100)
9. All capital letters (CLF)
10. Capital letter and a period (M.)
11. First word of a sentence (The)
12. Initial letter of the word is capitalized (Albert)
13. word in lower case (country)
14. all other words and tokens (;)
-
8/7/2019 Mining_Unstructured_Data
35/77
69
Statistical Model
The design of the formal model is done in levels.
At the first level we have the most accurate model,which require the largest amount of training data.
At the lower levels we have back-off models that areless accurate but also require much smaller amountsof training data.
We always try to use the most accurate modelpossible given the amount of available training data.
70
Computing State Transition
Probabilities When we want to analyze formally the probability of
annotating a given word sequence with a set of name
classes, we need to consider three different statistical
models:
A model for generating a name class
A model to generate the first word in a name class
A model to generate all other words (but the first word) in
a name class
-
8/7/2019 Mining_Unstructured_Data
36/77
71
Computing the Probabilities :
Details The model to generate a name class depends on the previous
name class and on the word that precedes the name class; thisis the last word in the previous name class and we annotate itby w-1. So formally this amounts to P(NC | NC-1,w-1).
The model to generate the first word in a name class dependson the current name class and the previous name class andhence is P(first| NC, NC-1).
The model to generate all other words within the same nameclass depends on the previoues word (within the same nameclass) and the current name class, so formally it is P(|-1, NC).
72
The Actual Computation
),(
),,(),|(
11
11
11
=
wNCc
wNCNCcwNCNCP
),(
),,,(),|,(
1
1
1
> MUC
50/29.693.7/40.54percent
86.6/82.197.6/82.1money
77.7/91.790.7/91.3location
68.6/92.576.4/77.6time
59/89.590.9/76.6date
83.1/95.991.1/93.7organization
84.9/88.691.9/85.5person
ace+muc7muc7r/p
-
8/7/2019 Mining_Unstructured_Data
44/77
87
Results with our new algorithm
60
70
80
90
100
10.7 14.6 19.8 27 36.7 49.9 68 92.5 126 171 233 317 432 588 801 1090 1484 2019 2749 3741
Amount of training (KWords)
Logarithmic scale - each tick
multiplies the amount by 7/6
F-measure(%)
DATE
LOCATION
ORGANIZATION
PERSON
DATE-Nymble
LOC-Nymble
ORG-Nymble
Currently availableAmount of training
88
Why is HMM not enough? The HMM model is flat, so the most it can do is assign a
tag to each token in a sentence.
This is suitable for the tasks where the tagged sequences donot nest and where there are no explicit relations betweenthe sequences.
Part-of-speech tagging and entity extraction belong to thiscategory.
Extracting relationships is different, because the taggedsequences can (and must) nest, and there are relationsbetween them which must be explicitly recognized.
< ACQUISITION> Ethicon Endo-Surgery,
Inc. , a Johnson & Johnson company, hasacquired < ACQUIRED> Obtech Medical AG
-
8/7/2019 Mining_Unstructured_Data
45/77
89
A Hybrid Approach The hybrid strategy, attempts to strike a balance between
the two knowledge engineer chores writing the extraction rules
manually tagging the documents.
In this strategy, the knowledge engineer writes SCFGrules, which are then trained on the data which isavailable.
The powerful disambiguating ability of the SCFG makeswriting rules much simpler and cleaner task.
The knowledge engineer has the control of the generalityof the rules (s)he writes, and consequently on the amountand the quality of the manually tagged training thesystem would require.
90
Defining SCFG
Classical definition: A stochastic context-freegrammar (SCFG) is a quintuple
G = (T, N, S, R, P) T is the alphabet of terminal symbols (tokens)
N is the set of nonterminals
S is the starting nonterminal
R is the set of rules
P : R [0..1] defines their probabilities.
The rules have the form n s1s2sk, where n
is a nonterminal and eachsi either token oranother nonterminal.
SCFG is a usual context-free grammar with theaddition of the P function.
-
8/7/2019 Mining_Unstructured_Data
46/77
91
The Probability Function
Ifris the rule n s1s
2s
k, then P(r) is the
frequency of expanding n using this rule.
In Bayesian terms, if it is known that a givensequence of tokens was generated by expandingn, then P(r) is the apriory likelihood that n wasexpanded using the rule r.
Thus, it follows that for every nonterminal n the
sum P(r) over all rules rheaded by n mustequal to one.
92
IE with SCFG
A very basic parsing is employed for the
bulk of a text, but within the relevant parts,the grammar is much more detailed.
The IE grammars can be said to define
sublanguages for very specific domains.
-
8/7/2019 Mining_Unstructured_Data
47/77
93
IE with SCFG In the classical definition of SCFG it is assumed that the
rules are all independent. In this case it is possible tofind the (unconditional) probability of a given parse treeby simply multiplying the probabilities of all rulesparticipating in it.
The usual parsing problem is given a sequence of tokens(a string) S, to find the most probable parse tree Twhichcould generate S. A simple generalization of the Viterbialgorithm is able to efficiently solve this problem.
In practical applications of SCFGs, it is rarely the casethat the rules are truly independent. Then, the easiestway to cope with this problem while leaving most of theformalism intact is to let the probabilities P(r) beconditioned upon the context where the rule is applied.
94
Markovian CSFG
HMM entity extractors, are a simple case ofmarkovian SCFGs.
Every possible rule which can be formed from theavailable symbols has nonzero probability.
Usually, all probabilities are initially set to be equal,and then adjusted according to the distributionsfound in the training data.
-
8/7/2019 Mining_Unstructured_Data
48/77
95
Training Issues
For some problems the available training corpora appear
to be adequate.
In particular, markovian SCFG parsers trained on thePenn Treebank perform quite well (Collins 1997,Charniak 2000, Roark 2001, etc).
But for the task of relationship extraction it turns out tobe impractical to manually tag the amount of documentsthat would be sufficient to adequately train an markovianSCFG.
At a certain point it becomes more productive to go backto the original hand-crafted system and write rules for it,even though it is a much more skilled labor!
96
SCFG Syntax A rulebook consists of declarations and rules. All
nonterminals must be declared before usage.
Some of them can be declared as output concepts, whichare the entities, events, and facts that the system isdesigned to extract. Additionally, two classes of terminalsymbols also require declaration: termlists, and ngrams. A termlist is a collection of terms from a single semantic
category, either written explicitly or loaded from external source.
An ngram can expand to any single token. But the probability ofgenerating a given token is not fixed in the rules, but learnedfrom the training dataset, and may be conditioned upon one ormore previous tokens. Thus, ngrams is one of the ways the
probabilities of the SCFG rules can be context-dependent.
-
8/7/2019 Mining_Unstructured_Data
49/77
97
Exampleoutput concept Acquisition(Acquirer, Acquired);
ngram AdjunctWord;
nonterminal Adjunct;
Adjunct :- AdjunctWord Adjunct | AdjunctWord;
termlist AcquireTerm = acquired bought (has acquired) (has bought);
Acquisition :- CompanyAcquirer [ ,Adjunct , ]
AcquireTerm CompanyAcquired;
98
Emulation of HMM Entity
Extractor in CSFG
output concept Company();
ngram CompanyFirstWord;
ngram CompanyWord;
ngram CompanyLastWord;
nonterminal CompanyNext;
Company :- CompanyFirstWord CompanyNext |CompanyFirstWord;
CompanyNext :- CompanyWord CompanyNext |CompanyLastWord;
-
8/7/2019 Mining_Unstructured_Data
50/77
99
Putting is all together
start Text;nonterminal None;
ngram NoneWord;
None :- NoneWord None | ;
Text :- None Text | Company Text |
Acquisition Text | ;
100
SCFG Training Currently there are three different classes of
trainable parameters in a TEG rulebook:
the probabilities of rules of nonterminals
the probabilities of different expansions of ngrams
the probabilities of terms in a wordclass.
All those probabilities are smoothed maximum
likelihood estimates, calculated directly from the
frequencies of the corresponding elements in the
training dataset.
-
8/7/2019 Mining_Unstructured_Data
51/77
101
Sample Rules
concept Text;
concept output Person;
ngram NGFirstName;
ngram NGLastName;
ngram NGNone;
wordclass WCHonorific = Mr Mrs Miss Ms Dr;
Person :- WCHonorific NGLastName;
Person :- NGFirstName NGLastName;
Text :- NGNone Text;
Text :- Person Text;
Text :- ;
102
Training the rule book
By default, the initial untrained frequencies of allelements are assumed to be 1. They can be changed
using syntax. Let us train this rulebook on the training set
containing one sentence: Yesterday, Dr Simmons, the
distinguished scientist, presented the discovery.
This is done in two steps. First, the sentence isparsed using the untrained rulebook, but with theconstraints specified by the annotations. In our casethe constraints are satisfied by two different parses:
-
8/7/2019 Mining_Unstructured_Data
52/77
103
2 Possible Parse TreesText Text
NGNone NGNone
Yesterday Yesterday
Text Text
NGNone NGNone
, ,
Text Text
Person Person
WCHonorific NGFirstName
Dr Dr
NGLastName NGLastName
Simmons Simmons
Text Text
NGNone NGNone
, ,
Text Text
104
How to pick the right Parse?
The difference is in the expansion of the Personnonterminal. Both Person rules can produce the
output instance, therefore there is an ambiguity.
In this case it is resolved in favor of theWCHonorific interpretation, because in theuntrained rulebook we have
P(Dr | WCHonorific) = 1/5 (choice of one termamong five equiprobable ones),
P(Dr | NGFirstName) 1/N, where N is the number
of all known words.
-
8/7/2019 Mining_Unstructured_Data
53/77
105
The Trained Rule Book
concept Text;
concept output Person;
ngram NGFirstName;
ngram NGLastName;
ngram NGNone;
wordclass WCHonorific = Mr Mrs Miss Ms Dr;
Person :- WCHonorific NGLastName;
Person :- NGFirstName NGLastName;
Text :- NGNone Text;
Text :- Person Text;
Text :- ;
106
Ngram Training
Additionally, the training program will generate a separatefile containing the statistics for the ngrams. It is similar in
nature, but a bit more complex, because the bigramsfrequencies, token features frequencies and unknown wordsfrequencies are also saved.
In order to understand the details of ngrams training it isnecessary to go over the details of their internal working.
An ngram always generates a single token. Any ngram cangenerate any token, but naturally the probability of generatingdepends on the ngram, on the token, and on the immediate
preceding context of the token.
-
8/7/2019 Mining_Unstructured_Data
54/77
107
MUC 7
90.5894.4287.0586.9190.1283.9386.6687.286.12Location
90.1990.989.4987.789.5385.9488.8489.7587.94Organization
92.2490.7893.7586.5786.8386.3186.0185.1386.91Person
FPrecRecallFPrecRecallFPrecRecall
Full SCFG systemEmulation using SCFGHMM entity extractor
108
DIAL vs. SCFG
90.5894.4287.0590.4989.591.46Location
90.1990.989.4988.0593.482.74Organization
92.2490.7893.7587.5393.881.32Person
FPrecRecallFPrecRecall
Full SCFG systemManual Rules (written in DIAL)
-
8/7/2019 Mining_Unstructured_Data
55/77
109
ACE 2
86.8484.9488.8385.8484.9686.7484.3783.2285.54GPE
64.7671.0659.4965.0671.0260.0358.0564.73552.62
Organizatio
n
85.5681.6889.8284.8281.6588.2584.3783.2285.54Person
80.2577.383.4461.1176.2450.99Role
FPrecRecallFPrecRecallFPrecRecall
Full SCFG systemErgodic SCFGHMM entity extractor
110
DIAL vs. TEG (ACE-1)
40.7450.0031.4847.8755.040.74Facility
76.2577.4875.0180.5685.7375.39Total
86.9683.6390.0389.3989.2689.53GPE
80.7180.6880.7583.5988.5678.62person
32.3844.4420.3337.4454.5420.34Location
59.564.5054.5068.5878.2558.92Organization
FPrecRecallFPrecRecall
Full TEG systemManual Rules (written in DIAL)
-
8/7/2019 Mining_Unstructured_Data
56/77
-
8/7/2019 Mining_Unstructured_Data
57/77
113
Final Comparison
71.413572.63380.560580.695{Sum}
22.975524.07447.870538.0825FAC
29.364531.59837.44235.091LOC
83.881585.25689.39288.957GPE
78.097576.42283.594584.835PERSON
49.69154.11368.589569.4895ORG
HMMG TEGDIALBest TEGF1
114
Complexity
Notation:
S: number of states in the grammar
N: number of tokens in the sentence
C: number of non-terminals in the grammar
Well behaved HMM: SN
Nymble like HMM: S2N (S=C)
TEG: SCN3
However:
Most portions can be executed as transducers
We can use smaller sliding windows
We can use approximations
-
8/7/2019 Mining_Unstructured_Data
58/77
115
Simple ROLE Rules (ACE-2) ROLE :- [Position_Before] ORGANIZATION->ROLE_2 Position ["in"
GPE] [","] PERSON->ROLE_1; ROLE :- GPE->ROLE_2 Position [","] PERSON->ROLE_1;
ROLE :- PERSON->ROLE_1 "of" GPE->ROLE_2;
ROLE :- ORGANIZATION->ROLE_2 "'" "s" [Position] PERSON->ROLE_1;
116
A little more complicated set of
rules ROLE :- [Position_Before] ORGANIZATION->ROLE_2 Position ["in" GPE]
[","] PERSON->ROLE_1;
ROLE :- GPE->ROLE_2 Position [","] PERSON->ROLE_1;
ROLE :- PERSON->ROLE_1 "of" GPE->ROLE_2;
ROLE :- ORGANIZATION->ROLE_2 "'" "s" [Position] PERSON->ROLE_1;
ROLE :- GPE->ROLE_2 [Position] PERSON->ROLE_1;
ROLE :- GPE->ROLE_2 "'" "s" ORGANIZATION->ROLE_1;
ROLE :- PERSON->ROLE_1 "," Position WCPreposition ORGANIZATION->ROLE_2;
-
8/7/2019 Mining_Unstructured_Data
59/77
-
8/7/2019 Mining_Unstructured_Data
60/77
119
Easy and hard in Coreference
Mohamed Atta, a suspected leader of the hijackers,had spent time in Belle Glade, Fla., where a crop-dusting business is located. Atta and other MiddleEastern men came to South Florida Crop Care nearlyevery weekend for two months.
Will Lee, the firm's general manager, said the menasked repeated questions about the crop-dustingbusiness. He said the questions seemed "odd," buthe
didn't find themen
suspicious until after the Sept.11 attack.
120
The Simple Coreference
Proper Names
IBM, International Business Machines, Big Blue
Osama Bin Ladin, Bin Ladin, Usama Bin Laden. (note the
variations)
Definite Noun Phrases
The Giant Computer Manufacturer, The Company, The
owner of over 600,000 patents
Pronouns
It, he , she, we..
-
8/7/2019 Mining_Unstructured_Data
61/77
121
Coreference ExampleGranite Systems provides Service Resource Management (SRM) software for
communication service providers with wireless, wireline, optical andpacket technology networks. Utilizing Granite' Xng System, carriers canmanage inventories of network resources and capacity, define networkconfigurations, order and schedule resources and provide a database ofrecord to other operational support systems (OSSs). An award-winningcompany, including an Inc. 500 company in 2000 and 2001, GraniteSystems enables clients including AT&T Wireless, KPN Belgium, COLTTelecom, ATG and Verizon to eliminate resource redundancy, improvenetwork reliability and speed service deployment. Founded in 1993, thecompany is headquartered in Manchester, NH with offices in Denver, CO;Miami, FL; London, U.K.; Nice, France; Paris, France; Madrid, Spain;
Rome, Italy; Copenhagen, Denmark; and Singapore.
122
A KE Approach to Corefernce Mark each noun phrase with the following:
Type (company, person, location)
Singular vs. plural Gender (male, female, neutral)
Syntactic (name, pronoun, definite/indefinite)
For each candidate Find accessible antecedents
Each antecedent has a different scope Proper namess scope is the whole document
Definite clausess scope is the preceding paragraphs
Pronouns might be just the previous sentence, or the same paragraph.
Filter by consistency check Order by dynamic syntactic preference
-
8/7/2019 Mining_Unstructured_Data
62/77
123
Filtering Antecedents
George Bush will not match she, or it George Bush can not be an antecedent of
The company or they
Using an Ontology we can use backgroundinformation to be smarter
Example: The big automaker is planning to get
out the car business. The company feels that itcan never longer make a profit making cars.
124
AutoSlog (Riloff, 1993)
Creates Extraction Patterns from annotated texts
(NPs).
Uses Sentence analyzer (CIRCUS, Lehnert, 1991) to
identify clause boundaries and syntactic constituents
(subject, verb, direct object, prepositional phrase)
It then uses heuristic templates to generate extraction
patterns
-
8/7/2019 Mining_Unstructured_Data
63/77
125
Example Templates
Was aimed at Passive-verb prep
Killed with Active-verb prep
Bomb against Noun prep
Bombed Active-verb
was victim aux noun
was murdered passive-verb
ExampleTemplate
126
AutoSlog-TS (Riloff, 1996)
It took 8 hours to annotate 160 documents, andhence probably a week to annotate 1000 documents.
This bottleneck is a major problem for using IE innew domains.
Hence there is a need for a system that can generateIE patterns from un-annotated documents.
AutoSlog-TS is such a system.
-
8/7/2019 Mining_Unstructured_Data
64/77
127
AutoSlog TS
Sentence Analyzer
S: World Trade CenterV: was bombedPP: by terrorists
AutoSlog Heuristics
Extraction
Patterns was killed
bombed by
Sentence Analyzer
Extraction Patterns was killed
was bombedbombed by
saw
EP REL % was bobmed 87%
bombed by 84% was killed 63%
saw 49%
128
Top 24 Extraction Patterns
Exploded on kidnappedOne of
was murderedDestroyed Was wounded in
Occurred on Responsibility for Took place on
was loctated occured was wounded
Claimed Caused took place
Death of Exploded in was injured
Attack on was kidnapped was killed
Assassination of Murder of exploded
-
8/7/2019 Mining_Unstructured_Data
65/77
129
Evaluation
Data Set: 1500 docs from MUC-4 (772 relevant)
AutoSlog generated 1237 patterns which weremanually filtered to 450 in 5 hours.
AutoSlog-TS generated 32,345 patterns, afterdiscarding singleton patterns, 11,225 were left.
Rank(EP) =
The user reviewed the the top 1970 patterns and
selected 210 of them in 85 minutes.
AutoSlog achieved better recall, while AutoSlog-TS achieved better precision.
freqfreq
freqrel2log
130
Learning Dictionaries by Bootstrapping
(Riloff and Jones, 1999)
Learn dictionary entries (semantic lexicon) and
extraction patterns simultaneously.
Use untagged text as a training source for learning.
Start with a set of seed lexicon entries and using
mutual bootstrapping learn extraction patterns and
more lexicon entries.
-
8/7/2019 Mining_Unstructured_Data
66/77
131
Mutual Bootstrapping Algorithm
Using AutoSlog generate all possible extraction patterns. Apply patterns to the corpus and save results to EPData
SemLex = Seed Words
Cat_EPList = {}
1. Score all extraction patterns in EPData
2. Best_EP = highest scoring pattern
3. Add Best_EP to Cat_EPList4. Add Best_EPs extractions to SemLex
5. Goto 1
132
Meta Bootstrapping Process
Seed Words
Permanent Semanticlexicon
Candidate Extraction Patternsand their Extractions
Temporary SemanticLexicon
Category EP List
Initialize
Add 5 Best
NPs
6HOHFWEHVWB(3
$GGEHVWB(3V
H[WUDFWLRQV
-
8/7/2019 Mining_Unstructured_Data
67/77
133
Sample Extraction Patterns
process in offices of expanded into
returned to has positionsservices in
taken in request informationdistributors in
part in message to customers in
ministers of thriveoutlets in
to enter devoted to consulting in
parts of sold to activities in
presidents of motivated seminars in
sought in positioningoperates in
become in is distributoroperations in
traveled to employedfacilities in
living in owned by offices in
terrorism locationwww companywww location
134
Experimental Evaluation
124/244
(.51)
101/194
(.52)
85/144
(.59)
68/94
(.72)
31/44
(.70)
4/4 (1)Terr. Weapon
158/250
(.63)
127/200
(.64)
100/150
(.67)
66/100
(.66)
32/50
(.64)
5/5 (1)Terr. Location
107/231
(.46)
101/181
(.56)
86/131
(.66)
63/81
(.78)
22/31
(.71)
0/1 (0)Web Title
191/250
(.76)
163/200
(.82)
129/150
(.86)
88/100
(.88)
46/50
(.92)
5/5 (1)Web Location
95/206(.46)
86/163(.53)
72/113(.64)
52/65(.80)
25/32(.78)
5/5 (1)WebCompany
Iter 50Iter 40Iter 30Iter 20Iter 10Iter 1
-
8/7/2019 Mining_Unstructured_Data
68/77
Semi Automatic Approach
Smart Tagging
136
-
8/7/2019 Mining_Unstructured_Data
69/77
137
138
-
8/7/2019 Mining_Unstructured_Data
70/77
139
140
-
8/7/2019 Mining_Unstructured_Data
71/77
141
142
-
8/7/2019 Mining_Unstructured_Data
72/77
Auditing Environments
Fixing Recall and Precision Problems
144
Auditing Events
-
8/7/2019 Mining_Unstructured_Data
73/77
145
Auditing a PersonPositionCompany
Event
146
Updating the Taxonomy and the
Thesaurus with New Entities
-
8/7/2019 Mining_Unstructured_Data
74/77
-
8/7/2019 Mining_Unstructured_Data
75/77
149
Using Hoovers
150
References I Anand T. and Kahn G., 1993. Opportunity Explorer: Navigating Large Databases Using Knowledge Discovery
Templates. In Proceedings of the 1993 workshop on Knowledge Discovery in Databases.
Appelt, Douglas E., Jerry R. Hobbs, John Bear, David Israel, and Mabry Tyson, 1993. ``FASTUS: A Finite-StateProcessor for Information Extraction from Real-World Text'', Proceedings. IJCAI-93, Chambery, France, August1993.
Appelt, Douglas E., Jerry R. Hobbs, John Bear, David Israel, Megumi Kameyama, and Mabry Tyson, 1993a. ``TheSRI MUC-5 JV-FASTUS Information Extraction System'', Proceedings, Fifth Message Understanding Conference(MUC-5), Baltimore, Maryland, August 1993.
Bookstein A., Klein S.T. and Raita T., 1995. Clumping Properties of Content-Bearing Words. In Proceedings ofSIGIR95.
Brachman R. J., Selfridge P. G., Terveen L. G., Altman B., Borgida A., Halper F., Kirk T., Lazar A., McGuinnessD. L., and Resnick L. A., 1993. Integrated Support for Data Archaeology. International Journal of Intelligent andCooperative Information Systems 2(2):159-185.
Brill E., 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging, Computational Linguistics, 21(4):543-565
Church K.W. and Hanks P., 1990. Word Association Norms, Mutual Information, and Lexicography,Computational Linguistics, 16(1):22-29
Cohen W. and Singer Y., 1996. Context Sensitive Learning Methods for Text categorization. In Proceedings ofSIGIR96.
Cohen. W., "Compiling Prior Knowledge into an Explicit Bias". Working notes of the 1992 AAAI springsymposium on knowledge assimilation. Stanford, CA, March 1992.
Daille B., Gaussier E. and Lange J.M., 1994. Towards Automatic Extraction of Monolingual and BilingualTerminology, In Proceedings of the International Conference on Computational Linguistics, COLING94, pages515-521.
Dumais, S. T., Platt, J., Heckerman, D. and Sahami, M. Inductive learning algorithms and representations for textcategorization. Proceedings of the Seventh International Conference on Information and Knowledge Management(CIKM98), 148-155, 1998.
Dunning T., 1993. Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics,19(1).
-
8/7/2019 Mining_Unstructured_Data
76/77
151
References II Feldman R. and Dagan I., 1995. KDT Knowledge Discovery in Texts. In Proceedings of the First International Conference on Knowledge
Discovery, KDD-95.
Feldman R., and Hirsh H., 1996. Exploiting Background Information in Knowledge Discovery from Text. Journal of Intelligent InformationSystems. 1996.
Feldman R., Aumann Y., Amir A., Klsgen W. and Zilberstien A., 1997. Maximal Association Rules: a New Tool for Mining for Keyword co-occurrences in Document Collections, In Proceedings of the 3rd International Conference on Knowledge Discovery, KDD-97, Newport Beach,CA.
Feldman R., Rosenfeld B., Stoppi J., Liberzon Y. and Schler, J., 2000. A Framework for Specifying Explicit Bias for Revision of ApproximateInformation Extraction Rules. KDD 2000: 189-199.
Fisher D., Soderland S., McCarthy J., Feng F. and Lehnert W., "Description of the UMass Systems as Used for MUC-6," in Proceedings of the6th Message Understanding Conference, November, 1995, pp. 127-140.
Frantzi T.K., 1997. Incorporating Context Information for the Extraction of Terms. In Preceedings of ACL-EACL97.
Frawley W. J., Piatetsky-Shapiro G., and Matheus C. J., 1991. Knowledge Discovery in Databases: an Overview. In G. Piatetsky-Shapiro and W.J. Frawley, editors, Knowledge Discovery in Databases, pages 1-27, MIT Press.
Grishman R., The role of syntax in Information Extraction, In: Advances in Text Processing: Tipster Program Phase II, Morgan Kaufmann, 1996.
Hahn, U.; and Schnattinger, K. 1997. Deep Knowledge Discovery from Natural Language Texts. In Proceedings of the 3rd InternationalConference of Knowledge Discovery and Data Mining, 175-178.
Hayes Ph. 1992. Intelligent High-Volume Processing Using Shallow, Domain-Specific Techniques. Text-BasedIntelligent Systems: CurrentResearch and Practice in Information Extraction and Retrieval. New Jersey, P.227-242.
Hayes, P.J. and Weinstein, S.P. CONSTRUE: A System for Content-Based Indexing of a Database of News Stories. Second Annual Conferenceon Innovative Applications of Artificial Intelligence, 1990.
Hobbs, J. (1986), Resolving Pronoun References, In B. J. Grosz, K. Sparck Jones, & B. L. Webber (Eds.), Readings in Natural Language
Processing (pp. 339-352), Los Altos, CA: Morgan Kaufmann Publishers, Inc. Hobbs, Jerry R., Douglas E. Appelt, John Bear, David Israel, Megumi Kameyama, and Mabry Tyson, ``FASTUS: A System for Extracting
Information from Text'', Proceedings, Human Language Technology, Princeton, New Jersey, March 1992, pp. 133-137.
Hobbs, Jerry R., Mark Stickel, Douglas Appelt, and Paul Martin, 1993, ``Interpretation as Abduction'', Artificial Intelligence, Vol. 63, Nos. 1-2,pp. 69-142. Also published as SRI International Artificial Intelligence Center Technical Note 499, December 1990.
152
References III Hull D., 1996. Stemming algorithms - a case study for detailed evaluation. Journal of the American Society for Information Science. 47(1):70-
84.
Ingria, R. & Stallard, D., A computational mechanism for pronominal reference, Proceedings, 27th Annual Meeting of the Association forComputational Linguistics, Vancouver, 1989.
Cowie J., and Lehnert W., "Information Extraction," Communications of the Association of Computing Machinery, vol. 39 (1), pp. 80-91.
Joachims, T. Text categorization with support vector machines: Learning with many relevant features. Proceedings of European Conference on
Machine Learning (ECML98), 1998 Justeson J. S. and Katz S. M., 1995. Technical Terminology: Some linguistic properties and an algorithm for identification in text. In Natural
Language Engineering, 1(1):9-27.
Klsgen W., 1992. Problems for Knowledge Discovery in Databases and Their Treatment in the Statistics Interpreter EXPLORA. InternationalJournal for Intelligent Systems, 7(7):649-673.
Lappin, S. & McCord, M., A syntactic filter on pronominal anaphora for slot grammar, Proceedings, 28th Annual Meeting of theAssociation for Computational Linguistics, University o f Pittsburgh, Association for Computational Linguistics, 1990.
Larkey, L. S. and Croft, W. B. 1996. Combining classifiers in text categorization. In Proceedings of SIGIR-96, 19th ACMInternationalConference on Research and Development in Information Retrieval (Zurich, CH, 1996), pp. 289297.
Lehnert, Wendy, Claire Cardie, David Fisher, Ellen Riloff, and Robert Williams, 1991. ``Description of the CIRCUS System as Used for MUC-3'', Proceedings, Third Message Understanding Conference (MUC-3), San Diego, California, pp. 223-233.
Lent B., Agrawal R. and Srikant R., 1997. Discovering Trends in Text Databases. In Proceedings of the 3rd International Conference onKnowledge Discovery, KDD-97, Newport Beach, CA.
Lewis, D. D. 1995a. Evaluating and optmizing autonomous text classification systems. In Proceedings of SIGIR-95, 18th ACMInternationalConference on Research and Development in Information Retrieval (Seattle, US, 1995), pp. 246254.
Lewis, D. D. and Hayes, P. J. 1994. Guest editorial for the special issue on text categorization. ACMT ransactions on Information Systems 12, 3,231.
Lewis, D. D. and Ringuette, M. 1994. A comparison of two learning algorithms for text categorization. In Proceedings of SDAIR-94, 3rd AnnualSymposium on Document Analysis and Information Retrieval (Las Vegas, US, 1994), pp. 8193.
Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka, R. 1996. Training algorithms for linear text classi fiers. In Proceedings of SIGIR-96, 19th
ACMInternational Conference on Research and Development in Information Retrieval (Z urich, CH, 1996), pp. 298306.
-
8/7/2019 Mining_Unstructured_Data
77/77
153
References IV Lin D. 1995. University of Manitoba: Description of the PIE System as Used for MUC-6 . In Proceedings of the Sixth Conference on Message
Understanding (MUC-6), Columbia, Maryland.
Piatetsky-Shapiro G. and Frawley W. (Eds.), 1991. Knowledge Discovery in Databases, AAAI Press, Menlo Park, CA.
Rajman M. and Besanon R., 1997. Text Mining: Natural Language Techniques and Text Mining Applications. In Proceedings of the seventhIFIP 2.6 Working Conference on Database Semantics (DS-7), Chapam & Hall IFIP Proceedings serie. Leysin, Switzerland, Oct 7-10, 1997.
Riloff Ellen and Lehnert Wendy, Information Extraction as a Basis for High-Precision Text Classification, ACM Transactions on InformationSystems (special issue on text categorization) or also Umass-TE-24, 1994.
Sebastiani F. Machine learning in automated text categorization, ACM Computing Surveys, Vol. 34, number 1, pages 1-47, 2002.
Sheila Tejada, Craig A. Knoblock and Steven Minton, Learning Object Identification Rules For Information Integration, Information SystemsVol. 26, No. 8, 2001, pp. 607-633.
Soderland S., Fisher D., Aseltine J., and Lehnert W., "Issues in Inductive Learning of Domain-Specific Text Extraction Rules," Proceedings ofthe Workshop on New Approaches to Learning for Natural Language Processing at the Fourteenth International Joint Conference on ArtificialIntelligence, 1995.
Srikant R. and Agrawal R., 1995. Mining generalized association rules. In Proceedings of the 21st VLDB Conference, Zurich, Swizerland.
Sundheim, Beth, ed., 1993. Proceedings, Fifth Message Understanding Conference (MUC-5), Baltimore, Maryland, August 1993. Distributed byMorgan Kaufmann Publishers, Inc., San Mateo, California.
Tipster Text Program (Phase I), 1993. Proceedings, Advanced Research Projects Agency, September 1993.
Vapnik, V., Estimation of Dependencies Based on Data [in Russian], Nauka, Moscow, 1979. (English translation: Springer Verlag, 1982.)
Vapnik, V., The Nature of Statistical Learning Theory, Springer-Verlag, 1995.
Yang, Y. and Chute, C. G. 1994. An example-based mapping method for text categorization and retrieval. ACMT ransactions on InformationSystems 12, 3, 252277.
Yang, Y. and Liu, X. 1999. A re-examination of text categorization methods. In Proceedings of SIGIR-99, 22nd ACM International Conference
on Research and Development in Information Retrieval (Berkeley, US, 1999), pp. 4249.