information extraction cs 652 information extraction and integration
Post on 19-Dec-2015
231 views
TRANSCRIPT
Information Extraction(IE)TaskInformation Retrieval(IR) and IEHistory of IEEvaluation MetricsApproaches to IEFree, Structured, and Semistructured TextWeb DocumentsIE SystemsDiscussion
IR and IEIR Retrieves relevant documents from
collections Information theory, probabilistic theory, and
statistics
IE Extracts relevant information from
documents Computational linguistics and natural
language processing
History of IE
Large amount of both online and offline textual data.Message Understanding Conference (MUC) Quantitative evaluation of IE systems Tasks
Latin American terrorism Joint ventures Microelectronics Company management changes
Approaches to IEKnowledge Engineering Approach Grammars are constructed by hand Domain patterns are discovered by human
experts through introspection and inspection of a corpus
Much laborious tuning and “hill climbing”
Automatic Training Approach Use statistical methods when possible Learn rules from annotated corpora Learn rules from interaction with user
Knowledge EngineeringAdvantages With skills and experience, good performing
systems are not conceptually hard to develop.
The best performing systems have been hand crafted.
Disadvantages Very laborious development process Some changes to specifications can be hard
to accommodate Required expertise may not be available
Automatic Training Advantages Domain portability is relatively straightforward System expertise is not required for customization “Data driven” rule acquisition ensures full
coverage of examples
Disadvantages Training data may not exist, and may be very
expensive to acquire Large volume of training data may be required Changes to specifications may require
reannotation of large quantities of training data
TextsFree Text Natural language processing
Structured Text Textual information in a database or file
following a predefined and strict format
Semistructured Text Ungrammatical Telegraphic
Web Documents
Web Document Categorization
[Hsu,1998]Structured Itemised information Uniform syntactic clues (e.g.,
delimiters, attribute orders, …)
Semistructured (e.g., missing attributes, multi-value attributes, …)Unstructured (e.g., linguistic knowledge is required, …)
HASTEN [1995]
The Parliament building was bombed by Carlos.
Egraphs(SemanticLabel, StructuralElement)
WHISK [1999]The Parliament building was bombed by Carlos.
WHISK Rule:*(PhyObj)*@passive *F ‘bombed’ * {PP
‘by’ *F (Person)}
Context-based patterns
ComparisonExtractiongranularity
SemanticClassConstraint
Single_SlotRule
Multi_SlotRule
SyntacticConstraints
AutoSlog
Liep
Palka
Hasten
Crystal
WHISK
Web DocumentsSemistructured and Unstructured RAPIER (E. Califf, 1997) SRV (D. Freitag, 1998) WHISK (S. Soderland, 1998)
Semistructured and Structured WIEN (N. Kushmerick, 1997) SoftMealy (C-H. Hsu, 1998) STALKER (I. Muslea, S. Minton, C. Knoblock,
1998)
Inductive Learning
TaskInductive InferenceLearning Systems Zero-order First-order, e.g., Inductive Logic
Programming (ILP)
RAPIER [1997]Inductive Logic ProgrammingExtraction Rules Syntactic information Semantic information
Advantage Efficient learning (bottom-up)
Drawback Single-slot extraction
SRV [1998]Relational Algorithm (top-down)Features Simple features (e.g., length, character
type, …) Relational features (e.g., next-token, …)
Advantages Expressive rule representation
Drawbacks Single-slot rule generation Large-volume of training data
WHISK [1998]Covering Algorithm (top-down)Advantages Learn multi-slot extraction rules Handle various order of items-to-be-extracted Handle document types from free text to
structured text
Drawbacks Must see all the permutations of items Less expressive feature set Need large volume of training data
Wrapper Induction
Wrapper: an IE application for one particular information sourceDelimiter-based RulesNo linguistic constraints
WIEN [1997]Assumes Items are always in fixed, known order
Introduces several types of wrappersAdvantages Fast to learn and extract
Drawbacks Can not handle permutations and missing
items Must label entire pages Does not use semantic classes
SoftMealy [1998]Learns a transducerAdvantages Learns order of items Allows item permutations and missing items Allows both the use of semantic classes and
disjunctions
Drawbacks Must see all possible permutations Can not use delimiters that do not
immediately precede and follow the relevant items
STALKER [1998,1999,2001]
Hierarchical Information ExtractionEmbedded Catalog Tree (ECT) FormalismAdvantages Extracts nested data Allows item permutations and missing items Need not see all of the permutations One hard-to-extract item does not affect others
Drawbacks Does not exploit item order
ApplicationsProduct Descriptions (ShopBot)Restaurant Guides (STALKER)Seminar Announcements (SRV)Job Advertisement (RAPIER)Executive Succession (WHISK)