mining_unstructured_data

Upload: tejas-modi

Post on 08-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 Mining_Unstructured_Data

    1/77

    Mining Unstructured Data

    Ronen FeldmanComputer Science DepartmentBar-Ilan University, ISRAEL

    ClearForest [email protected]

    2

    Outline

    Introduction to TextMining

    Information Extraction Entity Extraction

    Relationship Extraction

    KE Approach

    IE Languages

    ML Approaches to IE HMM

    Anaphora resolution

    Evaluation

    Link Detection

    Introduction to LD

    Architecture of LDsystems

    9/11

    Link Analysis Demos

  • 8/7/2019 Mining_Unstructured_Data

    2/77

    3

    Background

    Rapid proliferation ofinformation available in digital

    format

    People have less time to absorb

    more information

    4

    The Information Landscape

    8QVWUXFWXUHG

    7H[WXDO

    6WUXFWXUHG

    'DWDEDVHV

    /DFNRIWRROVWRKDQGOHXQVWUXFWXUHGGDWD

    3UREOHP

  • 8/7/2019 Mining_Unstructured_Data

    3/77

    5

    )LQG'RFXPHQWV

    PDWFKLQJWKH4XHU\

    'LVSOD\,QIRUPDWLRQ

    UHOHYDQWWRWKH4XHU\

    Extract Information from

    within the documents

    Actual information buried

    inside documents

    Long lists of documents Aggregate over entirecollection

    6

    Text Mining

    ,QSXW,QSXW,QSXW,QSXW,QSXW,QSXW,QSXW,QSXW

    'RFXPHQWV'RFXPHQWV

    2XWSXW2XWSXW2XWSXW2XWSXW2XWSXW2XWSXW2XWSXW2XWSXW

    3DWWHUQV3DWWHUQV

    &RQQHFWLRQV&RQQHFWLRQV

    3URILOHV3URILOHV

    7UHQGV7UHQGV

    6HHLQJWKH)RUHVWIRUWKH7UHHV6HHLQJWKH)RUHVWIRUWKH7UHHV6HHLQJWKH)RUHVWIRUWKH7UHHV6HHLQJWKH)RUHVWIRUWKH7UHHV

  • 8/7/2019 Mining_Unstructured_Data

    4/77

    7

    ReadRead

    ConsolidateConsolidate

    Absorb / ActAbsorb / Act

    UnderstandUnderstand

    Find MaterialFind Material

    /HW7H[W0LQLQJ'RWKH/HJZRUNIRU

  • 8/7/2019 Mining_Unstructured_Data

    5/77

    9

    Document Types

    Structured documents

    Output from CGI

    Semi-structured documents

    Seminar announcements

    Job listings

    Ads

    Free format documents

    News Scientific papers

    10

    Text Representations

    Character Trigrams

    Words Linguistic Phrases

    Non-consecutive phrases

    Frames

    Scripts

    Role annotation

    Parse trees

  • 8/7/2019 Mining_Unstructured_Data

    6/77

    11

    The 100,000 foot Picture

    Development

    Evironment

    12

    Intelligent Auto-Tagging(c) 2001, Chicago Tribune.

    Visit the Chicago Tribune on the Internet at

    http://www.chicago.tribune.com/

    Distributed by Knight Ridder/Tribune

    Information Services.

    By Stephen J. Hedges and Cam Simpson

    Finsbury Park Mosque

  • 8/7/2019 Mining_Unstructured_Data

    7/77

    13

    Tagging a Document

    BC-dynegy-enron-offer-update5Dynegy May Offer at Least $8 Bln to Acquire

    Enron (Update5)By George Stein

    SOURCEc.2001 Bloomberg News

    BODY

    Dynegy Inc

    Roger HamiltonJohn Hancock Advisers Inc.

    Roger Hamilton

    money manager

    John Hancock Advisers Inc.

    Enron Corp

    Moody's Investors ServiceEnron Corp

    downgraded Baa2

    bonds

    Moody's Investors Service

    .

    ``Dynegy has to act fast,'' said RogerHamilton, a money manager with John

    Hancock Advisers Inc., which sold its Enron

    shares in recent weeks. ``If Enron can't getfinancing and its bonds go to junk, they losecounterparties and their marvelous business

    vanishes.''

    Moody's Investors Service lowered itsrating on Enron's bonds to ``Baa2'' andStandard & Poor's cut the debt to ``BBB.'' in

    the past two weeks.

    14

    kidney enlargement

    VEGF

    Vascular endothelial growth factor,VEGFdiabetic glomerularenlargement

    Vascular endothelial growth factor,VEGF

    diabetic glomerularenlargement

    VEGF

    insulin-like growth factor I, IGF-I

    IGF-I

    IGF-Iincrease in glomerular volume

    and kidney weight

    VEGF

    IGF-I

    IGF-I

    increase in glomerular volume andkidney weight

    IGF-I

    VEGF

    Various growth factors and cytokines have been implicated in differentforms of kidney enlargement. Vascular endothelial growth factor (VEGF)is essential for normal renal development and plays a role in diabeticglomerular enlargement. To explore a possible role for VEGF incompensatory renal changes after uninephrectomy, we examined the effectof a neutralizing VEGF-antibody (VEGF-Ab) on glomerular volume andkidney weight in mice treated for 7 days. Serum and kidney insulin-likegrowth factor I (IGF-I) levels were measured, since IGF-I has beenimplicated in the pathogenesis of compensatory renal growth, and VEGFhas been suggested to be a downstream mediator of IGF-I. Placebo-treateduninephrectomized mice displayed an early transient increase in kidneyIGF-I concentration and an increase in glomerular volume and kidneyweight. In VEGF-Ab-treated uninephrectomized animals, increasedglomerular volume was abolished, whereas renal hypertrophy was

    partially blocked. Furthermore, the renal effects of VEGF-Abadministration were seen without affecting the renal IGF-I levels. Inconclusion, these results demonstrate that compensatory glomerulargrowth after uninephrectomy is VEGF dependent.

    Intelligent Auto-TaggingCompensatory glomerular growth after unilateral nephrectomy is VEGFdependent.

    Flyvbjerg A, Schrijvers BF, De Vriese AS, Tilton RG, Rasch R.

    Medical Department M, Medical Research Laboratories, Institute ofExperimental Clinical Research, Aarhus University, DK-8000 Aarhus,Denmark.

    BIOMEDICAL

  • 8/7/2019 Mining_Unstructured_Data

    8/77

    15

    A Tagged News Document

    16

    Original Document

    000019641

    3560177

    November 25, 1991

    News Release

    Applied Materials, Inc. today announced its newest source technology, called

    the Durasource, for the Endura 5500 PVD s ystem. This enhanced source

    includes new magnet configurations, giving the industry's most advanced

    sputtered aluminum step coverage in sub-micron contacts, and a new onepiece

    target that more than doubles target life to approximately 8000 microns

    of deposition compared to conventional two-piece bonded targets. The

    Durasource enhancement is fully retrofittable to installed Endura 5500 PVD

    systems. The Durasource technology has been specially designed for 200mm

    wafer applications, although it is also available for 125mm and 1s0mm wafer

    sizes. For example, step coverage symmetry is maintained within 3% between

    the inner and outer walls of contacts across a 200mm wafer. Film thickness

    uniformity averages 3% (3 sigma) over the life of the target.

  • 8/7/2019 Mining_Unstructured_Data

    9/77

    17

    Extraction Results I

    :=DOC NR: 3560177DOC DATE: 251192DOCUMENT SOURCE: News ReleaseCONTENT: DATE TEMPLATE COMPLETED: 021292EXTRACTION TIME: 5COMMENT: article focuses on nonreportable target source but

    reportable info available

    /TOOL_VERSION: LOCKE.3.4/FILLRULES_VERSION: EME.4.0 :=PROCESS: MANUFACTURER:

    18

    Extraction Results II

    :=

    NAME: Applied Materials, INC

    TYPE: COMPANY :=

    TYPE: SPUTTERING

    FILM: ALUMINUM

    EQUIPMENT:

    :=

    NAME_OR_MODEL: Endura(TM) 5500

    MANUFACTURER:

    EQUIPMENT_TYPE: PVD_SYSTEM

    STATUS: IN_USE

    WAFER_SIZE: (200 MM)(121 MM)

    COMMENT: actually three wafer sizes, third is error 1s0mm

  • 8/7/2019 Mining_Unstructured_Data

    10/77

    19

    BioMedical Example

    XSome kinase

    ProteinKinasePhosphorylation

    Template

    BES1 is phosphorylated and

    appears to be destabilized by the

    glycogen synthase kinase-3

    (GSK-3) BIN2

    {

    Phosphorylation

    BES1glycogen synthase

    kinase-3 (GSK-3) BIN2

    ProteinKinase

    20

    Leveraging Content Investment

    $Q\W\SHRIFRQWHQW

    8QVWUXFWXUHGWH[WXDOFRQWHQWFXUUHQWIRFXV

    6WUXFWXUHGGDWDDXGLRYLGHRIXWXUH

    )URPDQ\VRXUFH

    :::ILOHV\VWHPVQHZVIHHGVHWF

    6LQJOHVRXUFHRUFRPELQHGVRXUFHV

    ,QDQ\IRUPDW

    'RFXPHQWV3')V(PDLOVDUWLFOHVHWF

    5DZ RUFDWHJRUL]HG

    )RUPDOLQIRUPDOFRPELQDWLRQ

  • 8/7/2019 Mining_Unstructured_Data

    11/77

    21

    The Tagging Process (a Unified View)

    &OHDU7DJV6XLWH,QWHOOLJHQW$XWR,QWHOOLJHQW$XWR7DJJLQJ7DJJLQJ

    7DJJLQJ&RQWUROOHU

    7DJJLQJ&RQWUROOHU

    6HPDQWLFWDJJHU

    &ODVVLILFDWLRQWDJJHU

    6WUXFWXUDOWDJJHU

    5XOHERRNV5XOHERRNV5XOHERRNV

    )HWFKHU

    &OHDU/DE

    6HPDQWLF7UDLQHU

    5LFK

    ;0/$3,

    5LFK

    ;0/$3,

    8QVWUXFWXUHG&RQWHQW

    8QVWUXFWXUHG&RQWHQW

    &ODVVLILHUV&ODVVLILHUV&ODVVLILHUV

    6WUXFWXUDO7HPSODWHV6WUXFWXUDO7HPSODWHV6WUXFWXUDO7HPSODWHV

    6WDWLVWLFDO7UDLQHU6WDWLVWLFDO7UDLQHU

    6WUXFWXUDO7UDLQHU6WUXFWXUDO7UDLQHU

    ,QWHOOLJHQW7DJJLQJ

    22

    Text Mining Horizontal Applications

    7H[W0LQLQJ7H[W0LQLQJ/HDG*HQHUDWLRQ

    (QWHUSULVH3RUWDOV

    (,3

    &RPSHWLWLYH,QWHOOLJHQFH

    &RQWHQW

    8QLIRUPLW\

    &50$QDO\VLV

    0DUNHW5HVHDUFK

    1HZ3URGXFWV

    2IIHULQJV

    (YHQWV'HWHFWLRQV1RWLILFDWLRQV

    ,30DQDJHPHQW

    &RQWHQW

    0DQDJHPHQW

  • 8/7/2019 Mining_Unstructured_Data

    12/77

    23

    Text Mining: Vertical Applications

    ,QVXUDQFH

    3KDUPD

    3XEOLVKLQJ

    )LQDQFLDO6HUYLFHV

    *OREDO&KHPLFDO

    /HJDO

    %LRWHFK

    'HIHQVH

    Information Extraction

  • 8/7/2019 Mining_Unstructured_Data

    13/77

    25

    What is Information Extraction?

    IE does not indicate which documents need to be read by a

    user, it rather extracts pieces of information that are salient tothe user's needs.

    Links between the extracted information and the originaldocuments are maintained to allow the user to referencecontext.

    The kinds of information that systems extract vary in detailand reliability.

    Named entities such as persons and organizations can beextracted with reliability in the 90th percentile range, but donot provide attributes, facts, or events that those entities haveor participate in.

    26

    Relevant IE Definitions

    Entity: an object of interest such as a person ororganization.

    Attribute: a property of an entity such as its name,alias, descriptor, or type.

    Fact: a relationship held between two or moreentities such as Position of a Person in a Company.

    Event: an activity involving several entities such as aterrorist act, airline crash, management change, newproduct introduction.

  • 8/7/2019 Mining_Unstructured_Data

    14/77

    27

    Example (Entities and Facts) Fletcher Maddox, former Dean of the UCSD

    Business School, announced the formation of

    La Jolla Genomatics together with his twosons. La Jolla Genomatics will release itsproduct Geninfo in June 1999. Geninfo is aturnkey system to assist biotechnologyresearchers in keeping up with the voluminousliterature in all aspects of their field.

    Dr. Maddox will be the firm's CEO. His son,Oliver, is the Chief Scientist and holds patentson many of the algorithms used in Geninfo.Oliver's brother, Ambrose, follows more in hisfather's footsteps and will be the CFO of L.J.G.headquartered in the Maddox family'shometown of La Jolla, CA.

    Pers ons: Or ganiz at ions : Locat ions: Art ifact s: Dates:Fletcher Maddox UCSD Business School La Jolla Geninfo June 1999

    Dr. Maddox La Jolla Genomatics CA Geninfo

    Oliver La J oll a Gen omati cs

    Oliver L.J.G.

    Ambrose

    Maddox

    PERSON Employee_of ORGANIZATION

    Fletcher

    Maddox

    Fletcher

    Maddox

    Oliver

    Ambrose

    Employee_of

    Employee_of

    Employee_of

    Employee_of

    UCSD Business

    School

    La Jolla Genomatics

    La Jolla Genomatics

    La Jolla Genomatics

    ARTIFACT Product_of ORGANIZATION

    Geninfo Product_of La Jolla Genomatics

    LOCATION Location_of ORGANIZATION

    La Jolla Location_of La Jolla Genomatics

    CA Location_of La Jolla Genomatics

    28

    Events and Attributes

    COMPANY-FORMATION_EVENT:

    COMPANY: La Jolla Genomatics

    PRINCIPALS: Fletcher Maddox

    Oliver

    Ambrose

    DATE:

    CAPITAL:

    RELEASE-EVENT:

    COMPANY: La Jolla Genomatics

    PRODUCT: Geninfo

    DATE: June 1999

    COST:

    NA ME: Fle tche r Ma dd ox

    Maddox

    DESCRIPTOR: former Dean of the UCSD

    Business School

    his father

    the firm's CEO

    CATEGORY: PERSONNAME: Oliver

    DESCRIPTOR: His son

    Chief Scientist

    CATEGORY: PERSON

    NAME: Ambrose

    DESCRIPTOR: Oliver's brother

    the CFO of L.J.G.

    CATEGORY: PERSON

    NAME: UCSD Business Schoo l

    DESCRIPTOR:

    CATEGORY: ORGANIZATION

    NA ME: La J oll a Gen oma ti cs

    L.J.G.

    DESCRIPTOR:

    CATEGORY: ORGANIZATION

    NAME: Geninfo

    DESCRIPTOR: its product

    CATEGORY: ARTIFACT

    NAME: La Jolla

    DESCRIPTOR: the Maddox family's hometown

    CATEGORY: LOCATION

    NAME: CA

    DESCRIPTOR:

    CATEGORY: LOCATION

  • 8/7/2019 Mining_Unstructured_Data

    15/77

    29

    IE Accuracy by Information Type

    50-60%Events

    60-70%Facts

    80%Attributes

    90-98%Entities

    AccuracyInformation

    Type

    30

    Unstructured TextPOLICE ARE INVESTIGATING A ROBBERY THAT OCCURRED AT THE 7-

    ELEVEN STORE LOCATED AT 2545 LITTLE RIVER TURNPIKE IN THE

    LINCOLNIA AREA ABOUT 12:30 AM FRIDAY. A 24 YEAR OLD

    ALEXANDRIA AREA EMPLOYEE WAS APPROACHED BY TWO MEN WHO

    DEMANDED MONEY. SHE RELINQUISHED AN UNDISCLOSED AMOUNT

    OF CASH AND THE MEN LEFT. NO ONE WAS INJURED. THEY WERE

    DESCRIBED AS BLACK, IN THEIR MID TWENTIES, BOTH WERE FIVE

    FEET NINE INCHES TALL, WITH MEDIUM BUILDS, BLACK HAIR AND

    CLEAN SHAVEN. THEY WERE BOTH WEARING BLACK PANTS AND

    BLACK COATS. ANYONE WITH INFORMATION ABOUT THE INCIDENT

    OR THE SUSPECTS INVOLVED IS ASKED TO CALL POLICE AT (703) 555-

    5555.

  • 8/7/2019 Mining_Unstructured_Data

    16/77

    31

    Structured (Desired) Information

    DayTimeTownAddressCrime

    FRIDAY3:00 AMFAIRFAX7- ELEVEN

    STORE LOCATED

    AT 5624 OX ROAD,

    ROBBERY

    FRIDAY12:45 AMLINCOLNIA7- ELEVEN

    STORE LOCATED

    AT 2545 LITTLE

    RIVER TURNPIKE,

    ROBBERY

    SUNDAY11:30 PMANNANDALE

    8700 BLOCK OFLITTLE RIVER

    TURNPIKE,

    ABDUCTION

    32

    MUC Conferences

    MUC 7

    MUC 6

    MUC 5

    MUC 4

    MUC 3

    MUC 2

    MUC 1

    Conference

    Spaces Vehicles and Missile Launches1997

    Management Changes1995

    Joint Venture and Micro

    Electronics1993

    Terrorist Activity1992

    Terrorist Activity1991

    Naval Operations1989

    Naval Operations1987

    TopicYear

  • 8/7/2019 Mining_Unstructured_Data

    17/77

    33

    The ACE Evaluation The ACE program is dedicated to the challenge of extracting content from

    human language. This is a fundamental capability that the ACE programaddresses with a basic research effort that is directed to master first theextraction of entities, then the extraction of relations among theseentities, and finally the extraction of events that are causally related setsof relations.

    After two years of research on the entity detection and tracking task, topsystems have achieved a capability that is useful by itself and that, in thecontext of the ACE EDT task, successfully captures and outputs well over50 percent of the value at the entity level.

    Here value is defined to be the benefit derived by successfully extractingthe entities, where each individual entity provides a value that is a

    function of the entity type (i.e., person, organization, etc.) and level(i.e., named, unnamed). Thus each entity contributes to the overallvalue through the incremental value that it provides.

    34

    40.6

    63.0

    11.3

    24.3

    9.1

    11.8

    0.1

    0.6

    0.0

    0.3

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Act value Max value

    NormalizedValue/Cost

    (in%

    relativetoMaximumV

    alue

    FAC

    LOC

    GPE

    ORG

    PER

    Miss 36%, False Alarm 22%, Type Error 6%

    Average

    of Top 4Systems

    ACE Entity Detection & Tracking

    Evaluation -- 2/2002Goal: Extract entities. Each

    entity is assigned a value.This value is a function of itsTypeand Level. This value

    is gained when the entity issuccessfully detected. Thisvalue is lost when an entityis missed, spuriouslydetected, ormischaracterized.

    Table of Entity Values

    0.0020.0040.010.020.04PRO

    0.010.020.050.10.2NOM

    0.050.10.250.51NAM

    FACLOCGPEORGPER

    Human performance ~80

  • 8/7/2019 Mining_Unstructured_Data

    18/77

    35

    Applications of Information

    Extraction

    Routing of Information Infrastructure for IR and for Categorization

    (higher level features)

    Event Based Summarization.

    Automatic Creation of Databases and

    Knowledge Bases.

    36

    Where would IE be useful?

    Semi-Structured Text

    Generic documents like News articles. Most of the information in the document is

    centered around a set of easily identifiable

    entities.

  • 8/7/2019 Mining_Unstructured_Data

    19/77

    37

    Approaches for Building IE

    Systems

    Knowledge Engineering Approach Rules are crafted by linguists in cooperation with domain

    experts.

    Most of the work is done by inspecting a set of relevant

    documents.

    Can take a lot of time to fine tune the rule set.

    Best results were achieved with KB based IE systems.

    Skilled/gifted developers are needed.

    A strong development environment is a MUST!

    38

    Approaches for Building IE

    Systems

    Automatically Trainable Systems The techniques are based on pure statistics and almost no

    linguistic knowledge They are language independent

    The main input is an annotated corpus

    Need a relatively small effort when building the rules,however creating the annotated corpus is extremelylaborious.

    Huge number of training examples is needed in order toachieve reasonable accuracy.

    Hybrid approaches can utilize the user input in thedevelopment loop.

  • 8/7/2019 Mining_Unstructured_Data

    20/77

    39

    Components of IE System

    Tokenization

    Morphological andLexical Analysis

    Synatctic Analysis

    Domain Analysis

    Zoning

    Part of Speech Tagging

    Sense Disambiguiation

    Deep Parsing

    Shallow Parsing

    Anaphora Resolution

    Integration

    Must

    Advisable

    Nice to have

    Can pass

    40

    The Extraction Engine

    ExtractionEngine

    API

    Generic

    ExtractionRules

    Generic

    Lexicons and

    Thesaruses

    Customized

    ExtractionRules

    Domain

    SpecificLexicons and

    Thesaruses

    Document

    CoillectionEntities, Facts

    and Events

  • 8/7/2019 Mining_Unstructured_Data

    21/77

    41

    Why is IE Difficult? Different Languages

    Morphology is very easy in English, much harder in German andHebrew.

    Identifying word and sentence boundaries is fairly easy in Europeanlanguage, much harder in Chinese and Japanese.

    Some languages use orthography (like english) while others (likehebrew, arabic etc) do no have it.

    Different types of style Scientific papers

    Newspapers

    memos

    Emails

    Speech transcripts

    Type of Document Tables

    Graphics

    Small messages vs. Books

    42

    Morphological Analysis

    Easy

    English, Japanese

    Listing all inflections of a word is a real possibility

    Medium

    French Spanish

    A simple morphological component adds value.

    Difficult

    German, Hebrew, Arabic A sophisticated morphological component is a must!

  • 8/7/2019 Mining_Unstructured_Data

    22/77

    43

    Using Vocabularies

    Size doesnt matter Large lists tend to cause more mistakes

    Examples:

    Said as a person name (male)

    Alberta as a name of a person (female)

    It might be better to have small domain

    specific dictionaries

    44

    Part of Speech Tagging

    POS can help to reduce ambiguity, and to deal

    with ALL CAPS text. However

    It usually fails exactly when you need it

    It is domain dependent, so to get the best resultsyou need to retrain it on a relevant corpus.

    It takes a lot of time to prepare a training corpus.

  • 8/7/2019 Mining_Unstructured_Data

    23/77

    45

    A simple POS Strategy

    Use a tag frequency table to determine theright POS.

    This will lead to elimination of rare senses.

    The overhead is very small

    It improve accuracy by a small percentage.

    Compared to full POS it provide similar boost

    to accuracy.

    46

    Dealing with Proper Names

    The problem Impossible to enumerate

    New candidates are generated all the time Hard to provide syntactic rules

    Types of proper names People

    Companies

    Organizations

    Products

    Technologies Locations (cities, states, countries, rivers, mountains)

  • 8/7/2019 Mining_Unstructured_Data

    24/77

    47

    Comparing RB Systems with ML

    Based Systems

    90.690.3HUB4

    TranscribedSpeech

    90.493.7MUC7

    9396.4MUC6

    Wall Street

    Journal

    HMMRule Based

    48

    Building a RB Proper Name

    Extractor A Lexicon is always a good start

    The rules can be based on the lexicon and on: The context (preceding/following verbs or nouns)

    Regular expressions Companies: Capital* [,] inc, Capital* corporation..

    Locations: Capital* Lake, Capital* River

    Capitalization

    List structure

    After the creation of an initial set of rules Run on the corpus

    Analyze the results

    Fix the rules and repeat

    This Process can take around 2-3 weeks and result in performance of between 85-90% break even.

    Better performance can be achieved with more effort (2-3 months) and thenperformance can get to 95-98%

  • 8/7/2019 Mining_Unstructured_Data

    25/77

    Introduction to HMMs forIE

    50

    Motivation

    We can view the named entity extraction as a

    classification problem, where we classify each word

    as belonging to one of the named entity classes or to

    the no-name class.

    One of the most popular techniques for dealing with

    classifying sequences is HMM.

    Example of using HMM for another NLP

    classification task is that of part of speech tagging

    (Church, 1988 ; Weischedel et. al., 1993).

  • 8/7/2019 Mining_Unstructured_Data

    26/77

    51

    What is HMM?

    HMM (Hidden Markov Model) is a finite stateautomaton with stochastic state transitions andsymbol emissions (Rabiner 1989).

    The automaton models a probabilistic generativeprocess.

    In this process a sequence of symbols is produced bystarting in an initial state, transitioning to a newstate, emitting a symbol selected by the state and

    repeating this transition/emission cycle until adesignated final state is reached.

    52

    Disadvantage of HMM

    The main disadvantage of using an HMM for

    Information extraction is the need for a largeamount of training data. i.e., a carefully

    tagged corpus.

    The corpus needs to be tagged with all the

    concepts whose definitions we want to learn.

  • 8/7/2019 Mining_Unstructured_Data

    27/77

    53

    Notational Conventions

    T = length of the sequence of observations (training set)

    N = number of states in the model

    qt = the actual state at time t

    S = {S1,...SN} (finite set of possible states)

    V = {O1,...OM} (finite set of observation symbols)

    = {i} = {P(q1 = Si)} starting probabilities

    A = {aij}=P(qt+1= Si | qt = Sj) transition probabilities

    B = {bi(O

    t)} = {P(O

    t| q

    t= S

    i)} emission probabilities

    54

    The Classic Problems Related to

    HMMs

    Find P( O | ): the probability of an

    observation sequence given the HMM model. Find the most likely state trajectory given

    and O.

    Adjust = (, A, B) to maximize P( O | ).

  • 8/7/2019 Mining_Unstructured_Data

    28/77

    55

    Calculating P( O | ) The most obvious way to do that would be to enumerate

    every possible state sequence of length T (the length of theobservation sequence). Let Q = Q1,...QT , then by assumingindependence between the states we have

    P(O|Q, ) =

    P(Q|) =

    By using Bayes theorem we have

    P(O,Q|) = P(O|Q, ) P(Q|)

    Finally The main problem with this is that we need to do2TNT multiplications, which is certainly not feasible even fora modest T like 10.

    ==

    =T

    i

    iq

    T

    i

    ii ObqOP i11

    )(),|(

    =+

    1

    111

    T

    i

    qqq iia

    56

    The forward-backward algorithm

    In order to solve that we use the forward-backward algorithmthis is far more efficient. The forward part is based on the

    computation of terms called the alpha terms. We define thealpha values as follows,

    We can compute the alpha values inductively in a very

    efficient way. This calculation requires just N2T multiplications.

    =

    +

    =

    +

    =

    =

    =

    N

    i

    T

    tj

    N

    i

    ijtt

    ii

    iaOP

    Obaij

    Obi

    1

    1

    1

    1

    11

    )()|(

    )()()(

    )()(

  • 8/7/2019 Mining_Unstructured_Data

    29/77

    57

    The backward phase

    In a similar manner we can define a backwardvariable called beta that computes the

    probability of a partial observation sequence

    from t+1 to T. The beta variable will also be

    computed inductively but in a backward

    fashion.

    )()()(

    1)(

    11

    1

    jObai

    i

    tti

    N

    j

    ijt

    T

    ++

    =

    =

    =

    58

    Solution for the second problem Our main goal is to find the optimal state sequence. We will do that by

    maximizing the probabilities of each state individually.

    We will start by defining a set of gamma variables the measure the probabilities

    that at time t we are at state S i.

    The denominator is used just to make gamma a true probability measure.

    Now we can find the best state at each time slot in a local fashion.

    The main problem with this approach is that the optimization is done locally, andnot on the whole sequence of states. This can lead either to a local maximum oreven to an invalid sequence. In order to solve that problem we use a well knowndynamic programming algorithm called the Viterbi algorithm.

    =

    ==N

    i

    tt

    tttt

    t

    ii

    iiOP

    iii

    1

    )()(

    )()()|()()()(

    Ni

    iq

    t

    t

    =1

    )](max[arg

  • 8/7/2019 Mining_Unstructured_Data

    30/77

    59

    The Viterbi Algorithm

    Intuition

    Compute the most likely sequence starting with the emptyobservation sequence; use this result to compute the most likelysequence with an output sequence of length one; recurse untilyou have the most likely sequence for the entire sequence ofobservations.

    Algorithmic Details

    The delta variables compute the highest probability of a partialsequence up to time tthat ends in state Si. The psi variables

    enables us to accumulate the best sequence as we move alongthe time slices.

    1. Initialization:

    0)(

    )()(

    1

    11

    =

    =

    i

    Obi ii

    60

    Viterbi (Cont).

    Recursion:

    Termination:

    Reconstruction:

    For t = T-1,T-2,...,1. The resulting sequence,, solves Problem 2.

    [ ] )()(1max

    )( 1 tiijtt ObaiNij

    =

    [ ]ijtt ai

    Nij )(

    1

    maxarg)( 1

    =

    [ ])(1

    max* i

    NiP

    T

    = [ ])(

    1

    maxarg*

    iNi

    q TT

    =

    )(*

    11

    *

    ++= ttt qq

    **

    2

    *

    1 , Tqqq

  • 8/7/2019 Mining_Unstructured_Data

    31/77

    61

    Viterbi (Example)

    S1 S2

    S3

    0.1

    0.10.1

    0.5

    0.4

    0.4

    0.50

    .5

    0.20

    .8

    0.80

    .2

    a

    bab

    ab

    0.5

    0.5

    state

    Time slice t ->

    S1

    S2

    S3

    0.25 0.02 0.016

    0 0.04 0.008

    0.1 0.08 0.016

    t1 t2 t3

    a b a

    62

    The Just Research HMM

    Each HMM extracts just one field of a given document. Ifmore fields are needed, several HMMs need to be

    constructed. The HMM takes the entire document as one observation

    sequence.

    The HMM contains two classes of states, background statesand target states. The background states emit words in whichare not interested, while the target states emit words thatconstitute the information to be extracted.

    The state topology is designed by hand and only a few

    transitions are allowed between the states.

  • 8/7/2019 Mining_Unstructured_Data

    32/77

    63

    Possible HMM Topologies

    prefix states suffix states

    initial state final state

    background

    state

    target

    state

    background

    state

    target

    state

    final state

    initial state

    64

    A more General HMM

    Architecture

    prefix statessuffix states

    background

    state

    target

    states

    final state

    initial state

  • 8/7/2019 Mining_Unstructured_Data

    33/77

    65

    Experimental Evaluation

    46.7%55.3%40.1%48.1%30.9%

    Status of

    Acquisition

    Price of

    Acquisition

    Abbreviation

    of AcquiredCompany

    Acquired

    Company

    Acquiring

    Company

    59.5%99.1%83.9%71.1%

    End TimeStart TimeLocationSpeaker

    66

    BBNs Identifinder

    An ergodic bigram model.

    Each Named Class has a separate region in theHMM.

    The number of states in each NC region is equal to

    |V|. Each word has its own state.

    Rather then using plain words, extended words are

    used. An extended word is a pair , where f is a

    feature of the word w.

  • 8/7/2019 Mining_Unstructured_Data

    34/77

    67

    BBNs HMM Architecture

    EndEndStart

    Company

    Name

    Person

    Name

    No Name

    68

    Possible word Features1. 2 digit number (01)

    2. 4 digit number (1996)

    3. alphanumeric string (A34-24)

    4. digits and dashes (12-16-02)5. digits and slashes (12/16/02)

    6. digits and comma (1,000)

    7. digits and period (2.34)

    8. any other number (100)

    9. All capital letters (CLF)

    10. Capital letter and a period (M.)

    11. First word of a sentence (The)

    12. Initial letter of the word is capitalized (Albert)

    13. word in lower case (country)

    14. all other words and tokens (;)

  • 8/7/2019 Mining_Unstructured_Data

    35/77

    69

    Statistical Model

    The design of the formal model is done in levels.

    At the first level we have the most accurate model,which require the largest amount of training data.

    At the lower levels we have back-off models that areless accurate but also require much smaller amountsof training data.

    We always try to use the most accurate modelpossible given the amount of available training data.

    70

    Computing State Transition

    Probabilities When we want to analyze formally the probability of

    annotating a given word sequence with a set of name

    classes, we need to consider three different statistical

    models:

    A model for generating a name class

    A model to generate the first word in a name class

    A model to generate all other words (but the first word) in

    a name class

  • 8/7/2019 Mining_Unstructured_Data

    36/77

    71

    Computing the Probabilities :

    Details The model to generate a name class depends on the previous

    name class and on the word that precedes the name class; thisis the last word in the previous name class and we annotate itby w-1. So formally this amounts to P(NC | NC-1,w-1).

    The model to generate the first word in a name class dependson the current name class and the previous name class andhence is P(first| NC, NC-1).

    The model to generate all other words within the same nameclass depends on the previoues word (within the same nameclass) and the current name class, so formally it is P(|-1, NC).

    72

    The Actual Computation

    ),(

    ),,(),|(

    11

    11

    11

    =

    wNCc

    wNCNCcwNCNCP

    ),(

    ),,,(),|,(

    1

    1

    1

    > MUC

    50/29.693.7/40.54percent

    86.6/82.197.6/82.1money

    77.7/91.790.7/91.3location

    68.6/92.576.4/77.6time

    59/89.590.9/76.6date

    83.1/95.991.1/93.7organization

    84.9/88.691.9/85.5person

    ace+muc7muc7r/p

  • 8/7/2019 Mining_Unstructured_Data

    44/77

    87

    Results with our new algorithm

    60

    70

    80

    90

    100

    10.7 14.6 19.8 27 36.7 49.9 68 92.5 126 171 233 317 432 588 801 1090 1484 2019 2749 3741

    Amount of training (KWords)

    Logarithmic scale - each tick

    multiplies the amount by 7/6

    F-measure(%)

    DATE

    LOCATION

    ORGANIZATION

    PERSON

    DATE-Nymble

    LOC-Nymble

    ORG-Nymble

    Currently availableAmount of training

    88

    Why is HMM not enough? The HMM model is flat, so the most it can do is assign a

    tag to each token in a sentence.

    This is suitable for the tasks where the tagged sequences donot nest and where there are no explicit relations betweenthe sequences.

    Part-of-speech tagging and entity extraction belong to thiscategory.

    Extracting relationships is different, because the taggedsequences can (and must) nest, and there are relationsbetween them which must be explicitly recognized.

    < ACQUISITION> Ethicon Endo-Surgery,

    Inc. , a Johnson & Johnson company, hasacquired < ACQUIRED> Obtech Medical AG

  • 8/7/2019 Mining_Unstructured_Data

    45/77

    89

    A Hybrid Approach The hybrid strategy, attempts to strike a balance between

    the two knowledge engineer chores writing the extraction rules

    manually tagging the documents.

    In this strategy, the knowledge engineer writes SCFGrules, which are then trained on the data which isavailable.

    The powerful disambiguating ability of the SCFG makeswriting rules much simpler and cleaner task.

    The knowledge engineer has the control of the generalityof the rules (s)he writes, and consequently on the amountand the quality of the manually tagged training thesystem would require.

    90

    Defining SCFG

    Classical definition: A stochastic context-freegrammar (SCFG) is a quintuple

    G = (T, N, S, R, P) T is the alphabet of terminal symbols (tokens)

    N is the set of nonterminals

    S is the starting nonterminal

    R is the set of rules

    P : R [0..1] defines their probabilities.

    The rules have the form n s1s2sk, where n

    is a nonterminal and eachsi either token oranother nonterminal.

    SCFG is a usual context-free grammar with theaddition of the P function.

  • 8/7/2019 Mining_Unstructured_Data

    46/77

    91

    The Probability Function

    Ifris the rule n s1s

    2s

    k, then P(r) is the

    frequency of expanding n using this rule.

    In Bayesian terms, if it is known that a givensequence of tokens was generated by expandingn, then P(r) is the apriory likelihood that n wasexpanded using the rule r.

    Thus, it follows that for every nonterminal n the

    sum P(r) over all rules rheaded by n mustequal to one.

    92

    IE with SCFG

    A very basic parsing is employed for the

    bulk of a text, but within the relevant parts,the grammar is much more detailed.

    The IE grammars can be said to define

    sublanguages for very specific domains.

  • 8/7/2019 Mining_Unstructured_Data

    47/77

    93

    IE with SCFG In the classical definition of SCFG it is assumed that the

    rules are all independent. In this case it is possible tofind the (unconditional) probability of a given parse treeby simply multiplying the probabilities of all rulesparticipating in it.

    The usual parsing problem is given a sequence of tokens(a string) S, to find the most probable parse tree Twhichcould generate S. A simple generalization of the Viterbialgorithm is able to efficiently solve this problem.

    In practical applications of SCFGs, it is rarely the casethat the rules are truly independent. Then, the easiestway to cope with this problem while leaving most of theformalism intact is to let the probabilities P(r) beconditioned upon the context where the rule is applied.

    94

    Markovian CSFG

    HMM entity extractors, are a simple case ofmarkovian SCFGs.

    Every possible rule which can be formed from theavailable symbols has nonzero probability.

    Usually, all probabilities are initially set to be equal,and then adjusted according to the distributionsfound in the training data.

  • 8/7/2019 Mining_Unstructured_Data

    48/77

    95

    Training Issues

    For some problems the available training corpora appear

    to be adequate.

    In particular, markovian SCFG parsers trained on thePenn Treebank perform quite well (Collins 1997,Charniak 2000, Roark 2001, etc).

    But for the task of relationship extraction it turns out tobe impractical to manually tag the amount of documentsthat would be sufficient to adequately train an markovianSCFG.

    At a certain point it becomes more productive to go backto the original hand-crafted system and write rules for it,even though it is a much more skilled labor!

    96

    SCFG Syntax A rulebook consists of declarations and rules. All

    nonterminals must be declared before usage.

    Some of them can be declared as output concepts, whichare the entities, events, and facts that the system isdesigned to extract. Additionally, two classes of terminalsymbols also require declaration: termlists, and ngrams. A termlist is a collection of terms from a single semantic

    category, either written explicitly or loaded from external source.

    An ngram can expand to any single token. But the probability ofgenerating a given token is not fixed in the rules, but learnedfrom the training dataset, and may be conditioned upon one ormore previous tokens. Thus, ngrams is one of the ways the

    probabilities of the SCFG rules can be context-dependent.

  • 8/7/2019 Mining_Unstructured_Data

    49/77

    97

    Exampleoutput concept Acquisition(Acquirer, Acquired);

    ngram AdjunctWord;

    nonterminal Adjunct;

    Adjunct :- AdjunctWord Adjunct | AdjunctWord;

    termlist AcquireTerm = acquired bought (has acquired) (has bought);

    Acquisition :- CompanyAcquirer [ ,Adjunct , ]

    AcquireTerm CompanyAcquired;

    98

    Emulation of HMM Entity

    Extractor in CSFG

    output concept Company();

    ngram CompanyFirstWord;

    ngram CompanyWord;

    ngram CompanyLastWord;

    nonterminal CompanyNext;

    Company :- CompanyFirstWord CompanyNext |CompanyFirstWord;

    CompanyNext :- CompanyWord CompanyNext |CompanyLastWord;

  • 8/7/2019 Mining_Unstructured_Data

    50/77

    99

    Putting is all together

    start Text;nonterminal None;

    ngram NoneWord;

    None :- NoneWord None | ;

    Text :- None Text | Company Text |

    Acquisition Text | ;

    100

    SCFG Training Currently there are three different classes of

    trainable parameters in a TEG rulebook:

    the probabilities of rules of nonterminals

    the probabilities of different expansions of ngrams

    the probabilities of terms in a wordclass.

    All those probabilities are smoothed maximum

    likelihood estimates, calculated directly from the

    frequencies of the corresponding elements in the

    training dataset.

  • 8/7/2019 Mining_Unstructured_Data

    51/77

    101

    Sample Rules

    concept Text;

    concept output Person;

    ngram NGFirstName;

    ngram NGLastName;

    ngram NGNone;

    wordclass WCHonorific = Mr Mrs Miss Ms Dr;

    Person :- WCHonorific NGLastName;

    Person :- NGFirstName NGLastName;

    Text :- NGNone Text;

    Text :- Person Text;

    Text :- ;

    102

    Training the rule book

    By default, the initial untrained frequencies of allelements are assumed to be 1. They can be changed

    using syntax. Let us train this rulebook on the training set

    containing one sentence: Yesterday, Dr Simmons, the

    distinguished scientist, presented the discovery.

    This is done in two steps. First, the sentence isparsed using the untrained rulebook, but with theconstraints specified by the annotations. In our casethe constraints are satisfied by two different parses:

  • 8/7/2019 Mining_Unstructured_Data

    52/77

    103

    2 Possible Parse TreesText Text

    NGNone NGNone

    Yesterday Yesterday

    Text Text

    NGNone NGNone

    , ,

    Text Text

    Person Person

    WCHonorific NGFirstName

    Dr Dr

    NGLastName NGLastName

    Simmons Simmons

    Text Text

    NGNone NGNone

    , ,

    Text Text

    104

    How to pick the right Parse?

    The difference is in the expansion of the Personnonterminal. Both Person rules can produce the

    output instance, therefore there is an ambiguity.

    In this case it is resolved in favor of theWCHonorific interpretation, because in theuntrained rulebook we have

    P(Dr | WCHonorific) = 1/5 (choice of one termamong five equiprobable ones),

    P(Dr | NGFirstName) 1/N, where N is the number

    of all known words.

  • 8/7/2019 Mining_Unstructured_Data

    53/77

    105

    The Trained Rule Book

    concept Text;

    concept output Person;

    ngram NGFirstName;

    ngram NGLastName;

    ngram NGNone;

    wordclass WCHonorific = Mr Mrs Miss Ms Dr;

    Person :- WCHonorific NGLastName;

    Person :- NGFirstName NGLastName;

    Text :- NGNone Text;

    Text :- Person Text;

    Text :- ;

    106

    Ngram Training

    Additionally, the training program will generate a separatefile containing the statistics for the ngrams. It is similar in

    nature, but a bit more complex, because the bigramsfrequencies, token features frequencies and unknown wordsfrequencies are also saved.

    In order to understand the details of ngrams training it isnecessary to go over the details of their internal working.

    An ngram always generates a single token. Any ngram cangenerate any token, but naturally the probability of generatingdepends on the ngram, on the token, and on the immediate

    preceding context of the token.

  • 8/7/2019 Mining_Unstructured_Data

    54/77

    107

    MUC 7

    90.5894.4287.0586.9190.1283.9386.6687.286.12Location

    90.1990.989.4987.789.5385.9488.8489.7587.94Organization

    92.2490.7893.7586.5786.8386.3186.0185.1386.91Person

    FPrecRecallFPrecRecallFPrecRecall

    Full SCFG systemEmulation using SCFGHMM entity extractor

    108

    DIAL vs. SCFG

    90.5894.4287.0590.4989.591.46Location

    90.1990.989.4988.0593.482.74Organization

    92.2490.7893.7587.5393.881.32Person

    FPrecRecallFPrecRecall

    Full SCFG systemManual Rules (written in DIAL)

  • 8/7/2019 Mining_Unstructured_Data

    55/77

    109

    ACE 2

    86.8484.9488.8385.8484.9686.7484.3783.2285.54GPE

    64.7671.0659.4965.0671.0260.0358.0564.73552.62

    Organizatio

    n

    85.5681.6889.8284.8281.6588.2584.3783.2285.54Person

    80.2577.383.4461.1176.2450.99Role

    FPrecRecallFPrecRecallFPrecRecall

    Full SCFG systemErgodic SCFGHMM entity extractor

    110

    DIAL vs. TEG (ACE-1)

    40.7450.0031.4847.8755.040.74Facility

    76.2577.4875.0180.5685.7375.39Total

    86.9683.6390.0389.3989.2689.53GPE

    80.7180.6880.7583.5988.5678.62person

    32.3844.4420.3337.4454.5420.34Location

    59.564.5054.5068.5878.2558.92Organization

    FPrecRecallFPrecRecall

    Full TEG systemManual Rules (written in DIAL)

  • 8/7/2019 Mining_Unstructured_Data

    56/77

  • 8/7/2019 Mining_Unstructured_Data

    57/77

    113

    Final Comparison

    71.413572.63380.560580.695{Sum}

    22.975524.07447.870538.0825FAC

    29.364531.59837.44235.091LOC

    83.881585.25689.39288.957GPE

    78.097576.42283.594584.835PERSON

    49.69154.11368.589569.4895ORG

    HMMG TEGDIALBest TEGF1

    114

    Complexity

    Notation:

    S: number of states in the grammar

    N: number of tokens in the sentence

    C: number of non-terminals in the grammar

    Well behaved HMM: SN

    Nymble like HMM: S2N (S=C)

    TEG: SCN3

    However:

    Most portions can be executed as transducers

    We can use smaller sliding windows

    We can use approximations

  • 8/7/2019 Mining_Unstructured_Data

    58/77

    115

    Simple ROLE Rules (ACE-2) ROLE :- [Position_Before] ORGANIZATION->ROLE_2 Position ["in"

    GPE] [","] PERSON->ROLE_1; ROLE :- GPE->ROLE_2 Position [","] PERSON->ROLE_1;

    ROLE :- PERSON->ROLE_1 "of" GPE->ROLE_2;

    ROLE :- ORGANIZATION->ROLE_2 "'" "s" [Position] PERSON->ROLE_1;

    116

    A little more complicated set of

    rules ROLE :- [Position_Before] ORGANIZATION->ROLE_2 Position ["in" GPE]

    [","] PERSON->ROLE_1;

    ROLE :- GPE->ROLE_2 Position [","] PERSON->ROLE_1;

    ROLE :- PERSON->ROLE_1 "of" GPE->ROLE_2;

    ROLE :- ORGANIZATION->ROLE_2 "'" "s" [Position] PERSON->ROLE_1;

    ROLE :- GPE->ROLE_2 [Position] PERSON->ROLE_1;

    ROLE :- GPE->ROLE_2 "'" "s" ORGANIZATION->ROLE_1;

    ROLE :- PERSON->ROLE_1 "," Position WCPreposition ORGANIZATION->ROLE_2;

  • 8/7/2019 Mining_Unstructured_Data

    59/77

  • 8/7/2019 Mining_Unstructured_Data

    60/77

    119

    Easy and hard in Coreference

    Mohamed Atta, a suspected leader of the hijackers,had spent time in Belle Glade, Fla., where a crop-dusting business is located. Atta and other MiddleEastern men came to South Florida Crop Care nearlyevery weekend for two months.

    Will Lee, the firm's general manager, said the menasked repeated questions about the crop-dustingbusiness. He said the questions seemed "odd," buthe

    didn't find themen

    suspicious until after the Sept.11 attack.

    120

    The Simple Coreference

    Proper Names

    IBM, International Business Machines, Big Blue

    Osama Bin Ladin, Bin Ladin, Usama Bin Laden. (note the

    variations)

    Definite Noun Phrases

    The Giant Computer Manufacturer, The Company, The

    owner of over 600,000 patents

    Pronouns

    It, he , she, we..

  • 8/7/2019 Mining_Unstructured_Data

    61/77

    121

    Coreference ExampleGranite Systems provides Service Resource Management (SRM) software for

    communication service providers with wireless, wireline, optical andpacket technology networks. Utilizing Granite' Xng System, carriers canmanage inventories of network resources and capacity, define networkconfigurations, order and schedule resources and provide a database ofrecord to other operational support systems (OSSs). An award-winningcompany, including an Inc. 500 company in 2000 and 2001, GraniteSystems enables clients including AT&T Wireless, KPN Belgium, COLTTelecom, ATG and Verizon to eliminate resource redundancy, improvenetwork reliability and speed service deployment. Founded in 1993, thecompany is headquartered in Manchester, NH with offices in Denver, CO;Miami, FL; London, U.K.; Nice, France; Paris, France; Madrid, Spain;

    Rome, Italy; Copenhagen, Denmark; and Singapore.

    122

    A KE Approach to Corefernce Mark each noun phrase with the following:

    Type (company, person, location)

    Singular vs. plural Gender (male, female, neutral)

    Syntactic (name, pronoun, definite/indefinite)

    For each candidate Find accessible antecedents

    Each antecedent has a different scope Proper namess scope is the whole document

    Definite clausess scope is the preceding paragraphs

    Pronouns might be just the previous sentence, or the same paragraph.

    Filter by consistency check Order by dynamic syntactic preference

  • 8/7/2019 Mining_Unstructured_Data

    62/77

    123

    Filtering Antecedents

    George Bush will not match she, or it George Bush can not be an antecedent of

    The company or they

    Using an Ontology we can use backgroundinformation to be smarter

    Example: The big automaker is planning to get

    out the car business. The company feels that itcan never longer make a profit making cars.

    124

    AutoSlog (Riloff, 1993)

    Creates Extraction Patterns from annotated texts

    (NPs).

    Uses Sentence analyzer (CIRCUS, Lehnert, 1991) to

    identify clause boundaries and syntactic constituents

    (subject, verb, direct object, prepositional phrase)

    It then uses heuristic templates to generate extraction

    patterns

  • 8/7/2019 Mining_Unstructured_Data

    63/77

    125

    Example Templates

    Was aimed at Passive-verb prep

    Killed with Active-verb prep

    Bomb against Noun prep

    Bombed Active-verb

    was victim aux noun

    was murdered passive-verb

    ExampleTemplate

    126

    AutoSlog-TS (Riloff, 1996)

    It took 8 hours to annotate 160 documents, andhence probably a week to annotate 1000 documents.

    This bottleneck is a major problem for using IE innew domains.

    Hence there is a need for a system that can generateIE patterns from un-annotated documents.

    AutoSlog-TS is such a system.

  • 8/7/2019 Mining_Unstructured_Data

    64/77

    127

    AutoSlog TS

    Sentence Analyzer

    S: World Trade CenterV: was bombedPP: by terrorists

    AutoSlog Heuristics

    Extraction

    Patterns was killed

    bombed by

    Sentence Analyzer

    Extraction Patterns was killed

    was bombedbombed by

    saw

    EP REL % was bobmed 87%

    bombed by 84% was killed 63%

    saw 49%

    128

    Top 24 Extraction Patterns

    Exploded on kidnappedOne of

    was murderedDestroyed Was wounded in

    Occurred on Responsibility for Took place on

    was loctated occured was wounded

    Claimed Caused took place

    Death of Exploded in was injured

    Attack on was kidnapped was killed

    Assassination of Murder of exploded

  • 8/7/2019 Mining_Unstructured_Data

    65/77

    129

    Evaluation

    Data Set: 1500 docs from MUC-4 (772 relevant)

    AutoSlog generated 1237 patterns which weremanually filtered to 450 in 5 hours.

    AutoSlog-TS generated 32,345 patterns, afterdiscarding singleton patterns, 11,225 were left.

    Rank(EP) =

    The user reviewed the the top 1970 patterns and

    selected 210 of them in 85 minutes.

    AutoSlog achieved better recall, while AutoSlog-TS achieved better precision.

    freqfreq

    freqrel2log

    130

    Learning Dictionaries by Bootstrapping

    (Riloff and Jones, 1999)

    Learn dictionary entries (semantic lexicon) and

    extraction patterns simultaneously.

    Use untagged text as a training source for learning.

    Start with a set of seed lexicon entries and using

    mutual bootstrapping learn extraction patterns and

    more lexicon entries.

  • 8/7/2019 Mining_Unstructured_Data

    66/77

    131

    Mutual Bootstrapping Algorithm

    Using AutoSlog generate all possible extraction patterns. Apply patterns to the corpus and save results to EPData

    SemLex = Seed Words

    Cat_EPList = {}

    1. Score all extraction patterns in EPData

    2. Best_EP = highest scoring pattern

    3. Add Best_EP to Cat_EPList4. Add Best_EPs extractions to SemLex

    5. Goto 1

    132

    Meta Bootstrapping Process

    Seed Words

    Permanent Semanticlexicon

    Candidate Extraction Patternsand their Extractions

    Temporary SemanticLexicon

    Category EP List

    Initialize

    Add 5 Best

    NPs

    6HOHFWEHVWB(3

    $GGEHVWB(3V

    H[WUDFWLRQV

  • 8/7/2019 Mining_Unstructured_Data

    67/77

    133

    Sample Extraction Patterns

    process in offices of expanded into

    returned to has positionsservices in

    taken in request informationdistributors in

    part in message to customers in

    ministers of thriveoutlets in

    to enter devoted to consulting in

    parts of sold to activities in

    presidents of motivated seminars in

    sought in positioningoperates in

    become in is distributoroperations in

    traveled to employedfacilities in

    living in owned by offices in

    terrorism locationwww companywww location

    134

    Experimental Evaluation

    124/244

    (.51)

    101/194

    (.52)

    85/144

    (.59)

    68/94

    (.72)

    31/44

    (.70)

    4/4 (1)Terr. Weapon

    158/250

    (.63)

    127/200

    (.64)

    100/150

    (.67)

    66/100

    (.66)

    32/50

    (.64)

    5/5 (1)Terr. Location

    107/231

    (.46)

    101/181

    (.56)

    86/131

    (.66)

    63/81

    (.78)

    22/31

    (.71)

    0/1 (0)Web Title

    191/250

    (.76)

    163/200

    (.82)

    129/150

    (.86)

    88/100

    (.88)

    46/50

    (.92)

    5/5 (1)Web Location

    95/206(.46)

    86/163(.53)

    72/113(.64)

    52/65(.80)

    25/32(.78)

    5/5 (1)WebCompany

    Iter 50Iter 40Iter 30Iter 20Iter 10Iter 1

  • 8/7/2019 Mining_Unstructured_Data

    68/77

    Semi Automatic Approach

    Smart Tagging

    136

  • 8/7/2019 Mining_Unstructured_Data

    69/77

    137

    138

  • 8/7/2019 Mining_Unstructured_Data

    70/77

    139

    140

  • 8/7/2019 Mining_Unstructured_Data

    71/77

    141

    142

  • 8/7/2019 Mining_Unstructured_Data

    72/77

    Auditing Environments

    Fixing Recall and Precision Problems

    144

    Auditing Events

  • 8/7/2019 Mining_Unstructured_Data

    73/77

    145

    Auditing a PersonPositionCompany

    Event

    146

    Updating the Taxonomy and the

    Thesaurus with New Entities

  • 8/7/2019 Mining_Unstructured_Data

    74/77

  • 8/7/2019 Mining_Unstructured_Data

    75/77

    149

    Using Hoovers

    150

    References I Anand T. and Kahn G., 1993. Opportunity Explorer: Navigating Large Databases Using Knowledge Discovery

    Templates. In Proceedings of the 1993 workshop on Knowledge Discovery in Databases.

    Appelt, Douglas E., Jerry R. Hobbs, John Bear, David Israel, and Mabry Tyson, 1993. ``FASTUS: A Finite-StateProcessor for Information Extraction from Real-World Text'', Proceedings. IJCAI-93, Chambery, France, August1993.

    Appelt, Douglas E., Jerry R. Hobbs, John Bear, David Israel, Megumi Kameyama, and Mabry Tyson, 1993a. ``TheSRI MUC-5 JV-FASTUS Information Extraction System'', Proceedings, Fifth Message Understanding Conference(MUC-5), Baltimore, Maryland, August 1993.

    Bookstein A., Klein S.T. and Raita T., 1995. Clumping Properties of Content-Bearing Words. In Proceedings ofSIGIR95.

    Brachman R. J., Selfridge P. G., Terveen L. G., Altman B., Borgida A., Halper F., Kirk T., Lazar A., McGuinnessD. L., and Resnick L. A., 1993. Integrated Support for Data Archaeology. International Journal of Intelligent andCooperative Information Systems 2(2):159-185.

    Brill E., 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging, Computational Linguistics, 21(4):543-565

    Church K.W. and Hanks P., 1990. Word Association Norms, Mutual Information, and Lexicography,Computational Linguistics, 16(1):22-29

    Cohen W. and Singer Y., 1996. Context Sensitive Learning Methods for Text categorization. In Proceedings ofSIGIR96.

    Cohen. W., "Compiling Prior Knowledge into an Explicit Bias". Working notes of the 1992 AAAI springsymposium on knowledge assimilation. Stanford, CA, March 1992.

    Daille B., Gaussier E. and Lange J.M., 1994. Towards Automatic Extraction of Monolingual and BilingualTerminology, In Proceedings of the International Conference on Computational Linguistics, COLING94, pages515-521.

    Dumais, S. T., Platt, J., Heckerman, D. and Sahami, M. Inductive learning algorithms and representations for textcategorization. Proceedings of the Seventh International Conference on Information and Knowledge Management(CIKM98), 148-155, 1998.

    Dunning T., 1993. Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics,19(1).

  • 8/7/2019 Mining_Unstructured_Data

    76/77

    151

    References II Feldman R. and Dagan I., 1995. KDT Knowledge Discovery in Texts. In Proceedings of the First International Conference on Knowledge

    Discovery, KDD-95.

    Feldman R., and Hirsh H., 1996. Exploiting Background Information in Knowledge Discovery from Text. Journal of Intelligent InformationSystems. 1996.

    Feldman R., Aumann Y., Amir A., Klsgen W. and Zilberstien A., 1997. Maximal Association Rules: a New Tool for Mining for Keyword co-occurrences in Document Collections, In Proceedings of the 3rd International Conference on Knowledge Discovery, KDD-97, Newport Beach,CA.

    Feldman R., Rosenfeld B., Stoppi J., Liberzon Y. and Schler, J., 2000. A Framework for Specifying Explicit Bias for Revision of ApproximateInformation Extraction Rules. KDD 2000: 189-199.

    Fisher D., Soderland S., McCarthy J., Feng F. and Lehnert W., "Description of the UMass Systems as Used for MUC-6," in Proceedings of the6th Message Understanding Conference, November, 1995, pp. 127-140.

    Frantzi T.K., 1997. Incorporating Context Information for the Extraction of Terms. In Preceedings of ACL-EACL97.

    Frawley W. J., Piatetsky-Shapiro G., and Matheus C. J., 1991. Knowledge Discovery in Databases: an Overview. In G. Piatetsky-Shapiro and W.J. Frawley, editors, Knowledge Discovery in Databases, pages 1-27, MIT Press.

    Grishman R., The role of syntax in Information Extraction, In: Advances in Text Processing: Tipster Program Phase II, Morgan Kaufmann, 1996.

    Hahn, U.; and Schnattinger, K. 1997. Deep Knowledge Discovery from Natural Language Texts. In Proceedings of the 3rd InternationalConference of Knowledge Discovery and Data Mining, 175-178.

    Hayes Ph. 1992. Intelligent High-Volume Processing Using Shallow, Domain-Specific Techniques. Text-BasedIntelligent Systems: CurrentResearch and Practice in Information Extraction and Retrieval. New Jersey, P.227-242.

    Hayes, P.J. and Weinstein, S.P. CONSTRUE: A System for Content-Based Indexing of a Database of News Stories. Second Annual Conferenceon Innovative Applications of Artificial Intelligence, 1990.

    Hobbs, J. (1986), Resolving Pronoun References, In B. J. Grosz, K. Sparck Jones, & B. L. Webber (Eds.), Readings in Natural Language

    Processing (pp. 339-352), Los Altos, CA: Morgan Kaufmann Publishers, Inc. Hobbs, Jerry R., Douglas E. Appelt, John Bear, David Israel, Megumi Kameyama, and Mabry Tyson, ``FASTUS: A System for Extracting

    Information from Text'', Proceedings, Human Language Technology, Princeton, New Jersey, March 1992, pp. 133-137.

    Hobbs, Jerry R., Mark Stickel, Douglas Appelt, and Paul Martin, 1993, ``Interpretation as Abduction'', Artificial Intelligence, Vol. 63, Nos. 1-2,pp. 69-142. Also published as SRI International Artificial Intelligence Center Technical Note 499, December 1990.

    152

    References III Hull D., 1996. Stemming algorithms - a case study for detailed evaluation. Journal of the American Society for Information Science. 47(1):70-

    84.

    Ingria, R. & Stallard, D., A computational mechanism for pronominal reference, Proceedings, 27th Annual Meeting of the Association forComputational Linguistics, Vancouver, 1989.

    Cowie J., and Lehnert W., "Information Extraction," Communications of the Association of Computing Machinery, vol. 39 (1), pp. 80-91.

    Joachims, T. Text categorization with support vector machines: Learning with many relevant features. Proceedings of European Conference on

    Machine Learning (ECML98), 1998 Justeson J. S. and Katz S. M., 1995. Technical Terminology: Some linguistic properties and an algorithm for identification in text. In Natural

    Language Engineering, 1(1):9-27.

    Klsgen W., 1992. Problems for Knowledge Discovery in Databases and Their Treatment in the Statistics Interpreter EXPLORA. InternationalJournal for Intelligent Systems, 7(7):649-673.

    Lappin, S. & McCord, M., A syntactic filter on pronominal anaphora for slot grammar, Proceedings, 28th Annual Meeting of theAssociation for Computational Linguistics, University o f Pittsburgh, Association for Computational Linguistics, 1990.

    Larkey, L. S. and Croft, W. B. 1996. Combining classifiers in text categorization. In Proceedings of SIGIR-96, 19th ACMInternationalConference on Research and Development in Information Retrieval (Zurich, CH, 1996), pp. 289297.

    Lehnert, Wendy, Claire Cardie, David Fisher, Ellen Riloff, and Robert Williams, 1991. ``Description of the CIRCUS System as Used for MUC-3'', Proceedings, Third Message Understanding Conference (MUC-3), San Diego, California, pp. 223-233.

    Lent B., Agrawal R. and Srikant R., 1997. Discovering Trends in Text Databases. In Proceedings of the 3rd International Conference onKnowledge Discovery, KDD-97, Newport Beach, CA.

    Lewis, D. D. 1995a. Evaluating and optmizing autonomous text classification systems. In Proceedings of SIGIR-95, 18th ACMInternationalConference on Research and Development in Information Retrieval (Seattle, US, 1995), pp. 246254.

    Lewis, D. D. and Hayes, P. J. 1994. Guest editorial for the special issue on text categorization. ACMT ransactions on Information Systems 12, 3,231.

    Lewis, D. D. and Ringuette, M. 1994. A comparison of two learning algorithms for text categorization. In Proceedings of SDAIR-94, 3rd AnnualSymposium on Document Analysis and Information Retrieval (Las Vegas, US, 1994), pp. 8193.

    Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka, R. 1996. Training algorithms for linear text classi fiers. In Proceedings of SIGIR-96, 19th

    ACMInternational Conference on Research and Development in Information Retrieval (Z urich, CH, 1996), pp. 298306.

  • 8/7/2019 Mining_Unstructured_Data

    77/77

    153

    References IV Lin D. 1995. University of Manitoba: Description of the PIE System as Used for MUC-6 . In Proceedings of the Sixth Conference on Message

    Understanding (MUC-6), Columbia, Maryland.

    Piatetsky-Shapiro G. and Frawley W. (Eds.), 1991. Knowledge Discovery in Databases, AAAI Press, Menlo Park, CA.

    Rajman M. and Besanon R., 1997. Text Mining: Natural Language Techniques and Text Mining Applications. In Proceedings of the seventhIFIP 2.6 Working Conference on Database Semantics (DS-7), Chapam & Hall IFIP Proceedings serie. Leysin, Switzerland, Oct 7-10, 1997.

    Riloff Ellen and Lehnert Wendy, Information Extraction as a Basis for High-Precision Text Classification, ACM Transactions on InformationSystems (special issue on text categorization) or also Umass-TE-24, 1994.

    Sebastiani F. Machine learning in automated text categorization, ACM Computing Surveys, Vol. 34, number 1, pages 1-47, 2002.

    Sheila Tejada, Craig A. Knoblock and Steven Minton, Learning Object Identification Rules For Information Integration, Information SystemsVol. 26, No. 8, 2001, pp. 607-633.

    Soderland S., Fisher D., Aseltine J., and Lehnert W., "Issues in Inductive Learning of Domain-Specific Text Extraction Rules," Proceedings ofthe Workshop on New Approaches to Learning for Natural Language Processing at the Fourteenth International Joint Conference on ArtificialIntelligence, 1995.

    Srikant R. and Agrawal R., 1995. Mining generalized association rules. In Proceedings of the 21st VLDB Conference, Zurich, Swizerland.

    Sundheim, Beth, ed., 1993. Proceedings, Fifth Message Understanding Conference (MUC-5), Baltimore, Maryland, August 1993. Distributed byMorgan Kaufmann Publishers, Inc., San Mateo, California.

    Tipster Text Program (Phase I), 1993. Proceedings, Advanced Research Projects Agency, September 1993.

    Vapnik, V., Estimation of Dependencies Based on Data [in Russian], Nauka, Moscow, 1979. (English translation: Springer Verlag, 1982.)

    Vapnik, V., The Nature of Statistical Learning Theory, Springer-Verlag, 1995.

    Yang, Y. and Chute, C. G. 1994. An example-based mapping method for text categorization and retrieval. ACMT ransactions on InformationSystems 12, 3, 252277.

    Yang, Y. and Liu, X. 1999. A re-examination of text categorization methods. In Proceedings of SIGIR-99, 22nd ACM International Conference

    on Research and Development in Information Retrieval (Berkeley, US, 1999), pp. 4249.