Information Retrieval course ::: Information Management
TechnologiesKalliopi Zervanou
Overview
The need for information processing Structured vs. unstructured data (text) The challenges of text Textual information processing technologies
The need for info processing
Large amounts of data in electronic form
Need for large scale & fast info processing
Most information to be found in texttext
Types of Data
Structured data
Semi-structured data
Unstructured, free-text data
Structured Data: e.g. Databases
Title: Introduction to Information RetrievalAuthor: C.D.Manning, P.Raghavan, H.SchützeDoc type: BookPublisher: Cambridge University PressPub date: 2008Id: CM20BLocation: Computer Science sectionKeywords: Information Retrieval, Indexing, …
Semi-Structured Data (e.g. XML)
<?xml version="1.0" encoding="utf-8" ?><cmsbwsa_iisg_nl> <bwsa> <path> bios/bymholt.html </path> <voornaam> Berend </voornaam> <achternaam> Bymholt </achternaam> <geboortejaar> 1864 </geboortejaar> <geboortedatum>07-09</geboortedatum> <sterfjaar>1947</sterfjaar> <sterfdatum>05-27</sterfdatum> <extrainfo> socialistisch en anarchistisch publicist en auteur van de
Geschiedenis der Arbeidersbeweging in Nederland</extrainfo> <id>77</id> </bwsa>
...</cmsbwsa_iisg_nl>
Free-Text/ Unstructured data
Bertelsmann 9-mth profit slips on start-up lossesFRANKFURT, Nov 10 (Reuters) - Media conglomerate Bertelsmann posted a
slight decline in nine-month operating profit due to start-up losses related to new businesses.
Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. It had cut its outlook in August due to costs for new projects and rising energy prices.
Bertelsmann owns publishers Gruner + Jahr and Random House as well as European TV broadcaster RTL Group and Arvato, an outsourcing service provider.
Operating earnings before interest and tax (EBIT) eased by 1.1 percent to 1.03 billion euros ($1.4 billion) in the first nine months of 2011, Bertelsmann said.
Data Mining
analysis of structured data detection of unknown
interesting patterns: groups of data records
(cluster analysis)
unusual records (anomaly detection)
data dependencies (association rule mining)
Text Mining / Text Analytics
analysis of text (semi-/unstructured data) detection of unknown, interesting information:
group documents (classification/clustering)extract information (content descriptors, concepts of
interest)associate/link information (e.g. concept relations) discover previously unknown facts
The challenges of text
Full text understanding beyond current technology
Human understanding based on contextcontext
Context: text, but also world knowledge
Text: ambiguity (syntactic, semantic, lexical, pragmatic)
Doc CollectionIR
Important Info
IE
Relevant Docs
Summarisation (or Abstracting)
( Indexing )
Index Terms
Terminology
ATR
Data Bases
- Thesauri- Lexicons
- Ontologies- Gazetteers
Data Mining
Reasoning,etc…
Derived Info
Process Resource
Stru
ctured
Info
Relevant Info
NE …EVENT …
UNSTRUCTUREDUNSTRUCTURED
DATADATA
UNSTRUCTUREDUNSTRUCTURED
DATADATA
STRUCTUREDSTRUCTURED
DATADATA
STRUCTUREDSTRUCTURED
DATADATA
IR: Select relevant documents
Query: “query term” Relevant: Documents containing the “term” Methods:
Indexing or Automatic Term Recognition
Automatic Term Recognition
supervised/ unsupervised task Methods: rule based, statistics-based,
machine learning, hybrid
Objective:detect words or phrases denoting specialised concepts, i.e. termsterms
Objective:detect words or phrases denoting specialised concepts, i.e. termsterms
ATR: example
C-value Candidate term
338.13958 trade union [trade union, Trades Union,…]213.127 ernst papanek [Ernst Papanek]200.55471 new york [New York]143.48147 press clipping[Press clippings, press -clippings,…]139.07053 world war [world war, world wars, World Wars,…]134.47055 print material [printed materials, Printed material,…]131.19386 executive committee [executive committee, …]124.91502 communist party [Communist party,…]94.48066 second world war [Second World War, …]91.18482 spanish civil war [Spanish Civil War, …]90.80228 great britain [Great Britain, Great -Britain]
Document clustering
unsupervised task“clusters”, group categories unknown
machine learning and statistics-based approaches
Objective:group documents based on their content / semantic similarities
Objective:group documents based on their content / semantic similarities
Objective:classify documents based on their content / semantics
Objective:classify documents based on their content / semantics
Document classification
supervised task we know the classes/categories
use of machine learning, or statistics-based methods
Doc CollectionIR
Important Info
IE
Relevant Docs
Summarisation (or Abstracting)
( Indexing )
Index Terms
Terminology
ATR
Data Bases
- Thesauri- Lexicons
- Ontologies- Gazetteers
Data Mining
Reasoning,etc…
Derived Info
Process Resource
Stru
ctured
Info
Relevant Info
NE …EVENT …
Summarisation or Abstracting
Bertelsmann 9-mth profit slips on start-up lossesFRANKFURT, Nov 10 (Reuters) - Media conglomerate Bertelsmann posted a
slight decline in nine-month operating profit due to start-up losses related to new businesses.
Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. It had cut its outlook in August due to costs for new projects and rising energy prices.
Bertelsmann owns publishers Gruner + Jahr and Random House as well as European TV broadcaster RTL Group and Arvato, an outsourcing service provider.
Operating earnings before interest and tax (EBIT) eased by 1.1 percent to 1.03 billion euros ($1.4 billion) in the first nine months of 2011, Bertelsmann said.
Information Extraction
supervised, or unsupervised/generic task Methods: rule-based, machine learning
Objective:detect specific types of info in documents, e.g. names, events, relations
Objective:detect specific types of info in documents, e.g. names, events, relations
IE tasks
Named Entity (NE) recognise entities/concepts of interest, e.g. persons, organisations, dates & times
Co-reference (CO) recognise mentions to the same entity
Template Relation (TR) & Scenario Template (ST) recognise relations among concepts, e.g. concept properties & entities involved in facts & events of interest
IE Tasks
Bertelsmann said operating earnings before interest
and tax (EBIT) rose 35 percent to 215 million euros
($272.1 million) compared with 2005, and sales were
up 17.3 percent at 4.5 billion euros.
Europe's largest media group on Thursday said it still
expects its 2011 operating profit to decline slightly
year-on-year.
ORGANISATION
PERCENT
DATE
AMOUNT
ORGANISATION=“Bertelsmann” DATE=“2011-11-10”
IE Tasks
Bertelsmann said operating earnings before interest
and tax (EBIT) rose 35 percent to 215 million euros
($272.1 million) compared with 2005, and sales were
up 17.3 percent at 4.5 billion euros.
Europe's largest media group on Thursday said it still
expects its 2011 operating profit to decline slightly
year-on-year.
SALES_of
Event_type: sales
Organisation_type: Company
Organisation_name: Bertelsmann
Sector: media
Sales_mode: increase
Sales_amount: 4.500.000.000
Currency: euros
Period: ??
Date: ??
Sentiment analysis/Opinion mining
Polarity classification (positive/negative) Objectivity/Subjectivity detection
Doc CollectionIR
Important Info
IE
Relevant Docs
Summarisation (or Abstracting)
( Indexing )
Index Terms
Terminology
ATR
Data Bases
- Thesauri- Lexicons
- Ontologies- Gazetteers
Data Mining
Reasoning,etc…
Derived Info
Process Resource
Stru
ctured
Info
Relevant Info
NE …EVENT …
Structured Data: e.g. Databases
Title: Introduction to Information RetrievalAuthor: C.D.Manning, P.Raghavan, H.SchützeDoc type: BookPublisher: Cambridge University PressPub date: 2008Id: CM20BLocation: Computer Science sectionKeywords: Information Retrieval, Indexing, …
Structured Data: Ontologies
Structure of concepts:Entities (concepts, objects)Properties (concept properties)Relations (links between concepts)Domain specific relations, e.g., “has_capital”
Objective: describe domain knowledge and reason about
concepts & relations
Einstein's riddle
we have five houses in a row, each house is painted with a different colour, each house has a single inhabitant
each inhabitant is of different nationalitydrinks different beverage, owns a different pet,smokes different brands of cigarettes
Source: http://en.wikipedia.org/wiki/Zebra_puzzle
Einstein's riddle
1. There are five houses.
2. The EnglishmanEnglishman lives in the red housered house.
3. The SpaniardSpaniard owns the dog.
4.4. CoffeeCoffee is drunk in the green housegreen house.
5. The UkrainianUkrainian drinks tea.
Source: http://en.wikipedia.org/wiki/Zebra_puzzle
Einstein's riddleSource: http://en.wikipedia.org/wiki/Zebra_puzzle
6. The green housegreen house is immediately to the right of the ivory houseivory house.
7. The Old Gold smoker owns snailssnails.
8. Kools are smoked in the yellow houseyellow house.
9.9. MilkMilk is drunk in the middle house.
10. The NorwegianNorwegian lives in the first house.
Einstein's riddle
11. The man who smokes Chesterfields lives in the house next to the man with the fox.
12.12. KoolsKools are smoked in a house next to the house where the horse is kept.
13. The Lucky Strike smoker drinks orange juiceorange juice.
14. The JapaneseJapanese smokes Parliaments.
15. The NorwegianNorwegian lives next to the blue houseblue house.
Source: http://en.wikipedia.org/wiki/Zebra_puzzle
Einstein's riddle
Who drinks water?
Who owns a zebra?
Source: http://en.wikipedia.org/wiki/Zebra_puzzle
Ontology: hierarchical structure
Thing/Root
Inhabitant
Colour
Pet
Beverage
House-1
House-2
House-3
House-4
House-5
House House...
Englishman
Spaniard
Japanese
Norwegean
Ukranian
Spaniard...
Red
Green
Blue
IvoryYellow
Green...
Dog
Horse
Snails
Fox
Zebra
Ontology
“is-a” or taxonomic relationships
Denote the “kind” of a concept
But ontologies: more than taxonomic relationships!
Thing/Root
Inhabitant
Colour
Pet
Brand
House-1
House-2House House...
Englishman
SpaniardSpaniard...
Red
GreenGreen...
Dog
Horse...
Beverage
Ontology: properties
Thing/Root
Inhabitant
Colour
Pet
House
Has_colour:(Colour>Is_ColourOf:[House])
[Colour]
Has_inhabitant:(Inhabitant>LivesIn:[House])
[Inhabitant]
Is_rightTo: [House]
House-1
Brand
Beverage
Ontology: properties
Thing/Root
Inhabitant
Colour
Pet
House
LivesIn:(House>Has_inhabitant:[Inhabitant])
[House]
Has_pet:(Pet>Has_owner: [Inhabitant])
[Pet]
Drinks:(Beverage>Drunk_by: [Inhabitant])
[Beverage]
Uses_brand:(Brand>Used_by: [Inhabitant])
[Brand]
Spaniard
Brand
Beverage