information extraction sources: sarawagi, s. (2008). information extraction. foundations and trends...
TRANSCRIPT
Information Extraction
Sources:• Sarawagi, S. (2008). Information extraction.
Foundations and Trends in Databases, 1(3), 261–377. • Hobbs, J. R., & Riloff, E. (2010). Information extraction.
Handbook of Natural Language Processing, 2.
CONTEXT
History
• Genesis = recognition of named entities (organization & people names)
• Online access = pushes towards – personal desktops -> structured databases, – scientific publications -> structured records, – Internet -> structured fact finding queries.
Driving workshops / conferences
– 1987-97: MUC (Message Understanding Conference)Filling slots, named entities & coreference (95-)
– 1999-08: ACE (Automatic Content Extraction) « supporting various classification, filtering, and selection applications by extracting and representing language content »
– 2008-now: TAC (Text Automated Comprehension)• Knowledge Base Population (09-11)• Others: Textual entailment, Summarization, QA (until
2009)
Example: MUC0. MESSAGE: ID TST1-MUC3-00011. MESSAGE: TEMPLATE 12. INCIDENT: DATE 02 FEB 903. INCIDENT: LOCATION GUATEMALA: SANTO TOMAS (FARM)4. INCIDENT: TYPE ATTACK5. INCIDENT: STAGE OF EXECUTION ACCOMPLISHED6. INCIDENT: INSTRUMENT ID -7. INCIDENT: INSTRUMENT TYPE -8. PERP: INCIDENT CATEGORY TERRORIST ACT9. PERP: INDIVIDUAL ID "GUERRILLA COLUMN" / "GUERRILLAS"10. PERP: ORGANIZATION ID "GUATEMALAN NATIONAL REVOLUTIONARY UNITY" / "URNG"11. PERP: ORGANIZATION CONFIDENCE REPORTED AS FACT / CLAIMED OR ADMITTED: "GUATEMALAN NATIONAL REVOLUTIONARY UNITY" / "URNG"12. PHYS TGT: ID "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM"13. PHYS TGT: TYPE GOVERNMENT OFFICE OR RESIDENCE: "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM"14. PHYS TGT: NUMBER 1: "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM"15. PHYS TGT: FOREIGN NATION -16. PHYS TGT: EFFECT OF INCIDENT -17. PHYS TGT: TOTAL NUMBER -18. HUM TGT: NAME "CEREZO"19. HUM TGT: DESCRIPTION "PRESIDENT": "CEREZO" "CIVILIAN"20. HUM TGT: TYPE GOVERNMENT OFFICIAL: "CEREZO" CIVILIAN: "CIVILIAN"21. HUM TGT: NUMBER 1: "CEREZO" 1: "CIVILIAN"22. HUM TGT: FOREIGN NATION -23. HUM TGT: EFFECT OF INCIDENT NO INJURY: "CEREZO" DEATH: "CIVILIAN"24. HUM TGT: TOTAL NUMBER -
Application• Enterprise Applications
– News Tracking (terrorists, disease)– Customer care (linking mails to products, etc.)– Data Cleaning– Classified Ads
• Personal Information Management (PIM)• Scientific Applications (e.g. bio-informatics)• Web Oriented
– Citation databases– Opinion databases– Community websites (DBLife, Rexa - UMASS)– Comparison Shopping– Ad Placement on Webpages – Structured Web Searches
IE - Taxonomy
• Types of structures extracted– Entities, Records, Relationships– Open/Closed IE
• Sources– Granularity of extraction– Heterogenity: machine generated, (semi)structured, open
• Input resources– Structured DB– Labelled Unstructured Text– Preprocessing (tokenizer, chunker, parser<)
Process (I)
• Annotated documents• Rules hand-crafted by humans (1500 hours!)
Process (I)
• Annotated documents• Rules hand-crafted by humans (1500 hours!)• Rules generated by a system• Rules evaluated by humans
Process (II)
• Annotated documents• Rules hand-crafted by humans (1500 hours!)
• Rules generated by a system• Rules learnt
Process (III)
• Annotated documents• Rules hand-crafted by humans (1500 hours!)
• Rules generated by a system
• Rules learnt• Models– Logic: First Order Logic– Sequence: e.g. HMM– Classifiers: e.g. MEM, CRF
• Decomposition into a series of subproblems– Complex words, basic phrases, complex phrases, events and
merging
Process (IV)
• Annotated documents• Relevant & non relevant documents• Rules hand-crafted by humans (1500 hours!)
• Rules generated by a system
• Rules learnt• Models
– Logic: First Order Logic– Sequence: e.g. HMM– Classifiers: e.g. MEM, CRF
Process (V)
• Annotated documents• Relevant & non relevant documents
• Seeds -> boostrapping• Rules hand-crafted by humans (1500 hours!)
• Rules generated by a system
• Rules learnt• Models
– Logic: First Order Logic– Sequence: e.g. HMM– Classifiers: e.g. MEM, CRF
RECOGNIZING ENTITIES / FILLING SLOTS
Rule based systems
• Rules to mark an entity (or more)– Before the start of the entity– Tokens of the entity– After the end of the entity
• Rules to mark the boundaries• Conflicts between rules– Larger span– Merge (if same action)– Order the rules
Entity Extraction – rule based
Learning rules
• Algorithms are based on– Coverage [how many cases are covered by the
rule]– Precision
• Two approaches– Top-down (e.g. FOIL): start with coverage = 100%– Bottom-up: start with precision = 100%
Rules – Autoslog
• Rule Learning– Look at sentences containing targets– Heuristic: looking for a linguistic pattern
Riloff, E. (1993). Automatically constructing a dictionary for information extraction tasks, 811–811.
Rules – LIEPHuffman, S. B. (2005). Learning information extraction patterns from examples.
Learn (sets of meta-heuristics) by using syntactic paths that relate two role-filling constituents, e.g. [subject(Bob,named),object(named,CE0)].Followed by generalization (matching + disjonction)
Statistical models
• How to label– IOB sequences (Inside, Outside, Beginning)– Sequences– Segmentation
Alleged/B guerrilla/I urban/I commandos/I launched/O two/B highpower/I bombs/I against/O a/B car/I dealership/I in/O down- town/O San/B Salvador/I this/B morning/I.
– Grammar based (longer dependencies)• Many ML models:– HMM– ME, CRF– SVM
Statistical models (cont’d)
• Features– Word– Orthographic– Dictionary– …
• First order– Position:– Segment:
Examples of features
Statistical models (cont’d)
• Learning:– Likelihood
– Max-Margin
PREDICTING RELATIONSHIPS
Overall
• Goal: classify (E1,E2,x)• Features– Surface tokens (words, entities)
[Entity label of E1 = Person, Entity label of E2 = Location]
– Parse tree (syntaxic, dependency graph)[(POS = (noun,verb,noun), flag = “(1,none,2)”, type = “dependency”]
Models
• Standard classifier (e.g. SVM)• Kernel-based methods– e.g. measure of common properties between two
paths in the dependency tree– Convolution based kernels
• Rule-based methods
Extracting entities for a set of relationships
• Three steps– Learn extraction patterns for the seeds• Find documents where entities appear close to each
other• Filtering
– Generate candidate triplets• Pattern or keyword-based
– Validation• # of occurrences
MANAGEMENT
Summary
• Performance– Document selection: subset, crawling– Queries to DB: related entities (top-k retrieval)
• Handling changes– Detecting when a page has changed
• Integration– Detecting duplicates entities– Redundant extractions (open IE)
EVALUATION
Metrics
• Metrics– Precision-Recall– F-measure (-> harmonic mean)
The 60% barrier