![Page 1: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/1.jpg)
Information Retrieval course ::: Information Management
TechnologiesKalliopi Zervanou
![Page 2: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/2.jpg)
Overview
The need for information processing Structured vs. unstructured data (text) The challenges of text Textual information processing technologies
![Page 3: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/3.jpg)
The need for info processing
Large amounts of data in electronic form
Need for large scale & fast info processing
Most information to be found in texttext
![Page 4: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/4.jpg)
Types of Data
Structured data
Semi-structured data
Unstructured, free-text data
![Page 5: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/5.jpg)
Structured Data: e.g. Databases
Title: Introduction to Information RetrievalAuthor: C.D.Manning, P.Raghavan, H.SchützeDoc type: BookPublisher: Cambridge University PressPub date: 2008Id: CM20BLocation: Computer Science sectionKeywords: Information Retrieval, Indexing, …
![Page 6: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/6.jpg)
Semi-Structured Data (e.g. XML)
<?xml version="1.0" encoding="utf-8" ?><cmsbwsa_iisg_nl> <bwsa> <path> bios/bymholt.html </path> <voornaam> Berend </voornaam> <achternaam> Bymholt </achternaam> <geboortejaar> 1864 </geboortejaar> <geboortedatum>07-09</geboortedatum> <sterfjaar>1947</sterfjaar> <sterfdatum>05-27</sterfdatum> <extrainfo> socialistisch en anarchistisch publicist en auteur van de
Geschiedenis der Arbeidersbeweging in Nederland</extrainfo> <id>77</id> </bwsa>
...</cmsbwsa_iisg_nl>
![Page 7: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/7.jpg)
Free-Text/ Unstructured data
Bertelsmann 9-mth profit slips on start-up lossesFRANKFURT, Nov 10 (Reuters) - Media conglomerate Bertelsmann posted a
slight decline in nine-month operating profit due to start-up losses related to new businesses.
Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. It had cut its outlook in August due to costs for new projects and rising energy prices.
Bertelsmann owns publishers Gruner + Jahr and Random House as well as European TV broadcaster RTL Group and Arvato, an outsourcing service provider.
Operating earnings before interest and tax (EBIT) eased by 1.1 percent to 1.03 billion euros ($1.4 billion) in the first nine months of 2011, Bertelsmann said.
![Page 8: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/8.jpg)
Data Mining
analysis of structured data detection of unknown
interesting patterns: groups of data records
(cluster analysis)
unusual records (anomaly detection)
data dependencies (association rule mining)
![Page 9: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/9.jpg)
Text Mining / Text Analytics
analysis of text (semi-/unstructured data) detection of unknown, interesting information:
group documents (classification/clustering)extract information (content descriptors, concepts of
interest)associate/link information (e.g. concept relations) discover previously unknown facts
![Page 10: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/10.jpg)
The challenges of text
Full text understanding beyond current technology
Human understanding based on contextcontext
Context: text, but also world knowledge
Text: ambiguity (syntactic, semantic, lexical, pragmatic)
![Page 11: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/11.jpg)
Doc CollectionIR
Important Info
IE
Relevant Docs
Summarisation (or Abstracting)
( Indexing )
Index Terms
Terminology
ATR
Data Bases
- Thesauri- Lexicons
- Ontologies- Gazetteers
Data Mining
Reasoning,etc…
Derived Info
Process Resource
Stru
ctured
Info
Relevant Info
NE …EVENT …
UNSTRUCTUREDUNSTRUCTURED
DATADATA
UNSTRUCTUREDUNSTRUCTURED
DATADATA
STRUCTUREDSTRUCTURED
DATADATA
STRUCTUREDSTRUCTURED
DATADATA
![Page 12: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/12.jpg)
IR: Select relevant documents
Query: “query term” Relevant: Documents containing the “term” Methods:
Indexing or Automatic Term Recognition
![Page 13: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/13.jpg)
Automatic Term Recognition
supervised/ unsupervised task Methods: rule based, statistics-based,
machine learning, hybrid
Objective:detect words or phrases denoting specialised concepts, i.e. termsterms
Objective:detect words or phrases denoting specialised concepts, i.e. termsterms
![Page 14: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/14.jpg)
ATR: example
C-value Candidate term
338.13958 trade union [trade union, Trades Union,…]213.127 ernst papanek [Ernst Papanek]200.55471 new york [New York]143.48147 press clipping[Press clippings, press -clippings,…]139.07053 world war [world war, world wars, World Wars,…]134.47055 print material [printed materials, Printed material,…]131.19386 executive committee [executive committee, …]124.91502 communist party [Communist party,…]94.48066 second world war [Second World War, …]91.18482 spanish civil war [Spanish Civil War, …]90.80228 great britain [Great Britain, Great -Britain]
![Page 15: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/15.jpg)
Document clustering
unsupervised task“clusters”, group categories unknown
machine learning and statistics-based approaches
Objective:group documents based on their content / semantic similarities
Objective:group documents based on their content / semantic similarities
![Page 16: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/16.jpg)
Objective:classify documents based on their content / semantics
Objective:classify documents based on their content / semantics
Document classification
supervised task we know the classes/categories
use of machine learning, or statistics-based methods
![Page 17: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/17.jpg)
Doc CollectionIR
Important Info
IE
Relevant Docs
Summarisation (or Abstracting)
( Indexing )
Index Terms
Terminology
ATR
Data Bases
- Thesauri- Lexicons
- Ontologies- Gazetteers
Data Mining
Reasoning,etc…
Derived Info
Process Resource
Stru
ctured
Info
Relevant Info
NE …EVENT …
![Page 18: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/18.jpg)
Summarisation or Abstracting
Bertelsmann 9-mth profit slips on start-up lossesFRANKFURT, Nov 10 (Reuters) - Media conglomerate Bertelsmann posted a
slight decline in nine-month operating profit due to start-up losses related to new businesses.
Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. It had cut its outlook in August due to costs for new projects and rising energy prices.
Bertelsmann owns publishers Gruner + Jahr and Random House as well as European TV broadcaster RTL Group and Arvato, an outsourcing service provider.
Operating earnings before interest and tax (EBIT) eased by 1.1 percent to 1.03 billion euros ($1.4 billion) in the first nine months of 2011, Bertelsmann said.
![Page 19: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/19.jpg)
Information Extraction
supervised, or unsupervised/generic task Methods: rule-based, machine learning
Objective:detect specific types of info in documents, e.g. names, events, relations
Objective:detect specific types of info in documents, e.g. names, events, relations
![Page 20: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/20.jpg)
IE tasks
Named Entity (NE) recognise entities/concepts of interest, e.g. persons, organisations, dates & times
Co-reference (CO) recognise mentions to the same entity
Template Relation (TR) & Scenario Template (ST) recognise relations among concepts, e.g. concept properties & entities involved in facts & events of interest
![Page 21: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/21.jpg)
IE Tasks
Bertelsmann said operating earnings before interest
and tax (EBIT) rose 35 percent to 215 million euros
($272.1 million) compared with 2005, and sales were
up 17.3 percent at 4.5 billion euros.
Europe's largest media group on Thursday said it still
expects its 2011 operating profit to decline slightly
year-on-year.
ORGANISATION
PERCENT
DATE
AMOUNT
ORGANISATION=“Bertelsmann” DATE=“2011-11-10”
![Page 22: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/22.jpg)
IE Tasks
Bertelsmann said operating earnings before interest
and tax (EBIT) rose 35 percent to 215 million euros
($272.1 million) compared with 2005, and sales were
up 17.3 percent at 4.5 billion euros.
Europe's largest media group on Thursday said it still
expects its 2011 operating profit to decline slightly
year-on-year.
SALES_of
Event_type: sales
Organisation_type: Company
Organisation_name: Bertelsmann
Sector: media
Sales_mode: increase
Sales_amount: 4.500.000.000
Currency: euros
Period: ??
Date: ??
![Page 23: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/23.jpg)
Sentiment analysis/Opinion mining
Polarity classification (positive/negative) Objectivity/Subjectivity detection
![Page 24: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/24.jpg)
Doc CollectionIR
Important Info
IE
Relevant Docs
Summarisation (or Abstracting)
( Indexing )
Index Terms
Terminology
ATR
Data Bases
- Thesauri- Lexicons
- Ontologies- Gazetteers
Data Mining
Reasoning,etc…
Derived Info
Process Resource
Stru
ctured
Info
Relevant Info
NE …EVENT …
![Page 25: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/25.jpg)
Structured Data: e.g. Databases
Title: Introduction to Information RetrievalAuthor: C.D.Manning, P.Raghavan, H.SchützeDoc type: BookPublisher: Cambridge University PressPub date: 2008Id: CM20BLocation: Computer Science sectionKeywords: Information Retrieval, Indexing, …
![Page 26: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/26.jpg)
Structured Data: Ontologies
Structure of concepts:Entities (concepts, objects)Properties (concept properties)Relations (links between concepts)Domain specific relations, e.g., “has_capital”
Objective: describe domain knowledge and reason about
concepts & relations
![Page 27: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/27.jpg)
Einstein's riddle
we have five houses in a row, each house is painted with a different colour, each house has a single inhabitant
each inhabitant is of different nationalitydrinks different beverage, owns a different pet,smokes different brands of cigarettes
Source: http://en.wikipedia.org/wiki/Zebra_puzzle
![Page 28: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/28.jpg)
Einstein's riddle
1. There are five houses.
2. The EnglishmanEnglishman lives in the red housered house.
3. The SpaniardSpaniard owns the dog.
4.4. CoffeeCoffee is drunk in the green housegreen house.
5. The UkrainianUkrainian drinks tea.
Source: http://en.wikipedia.org/wiki/Zebra_puzzle
![Page 29: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/29.jpg)
Einstein's riddleSource: http://en.wikipedia.org/wiki/Zebra_puzzle
6. The green housegreen house is immediately to the right of the ivory houseivory house.
7. The Old Gold smoker owns snailssnails.
8. Kools are smoked in the yellow houseyellow house.
9.9. MilkMilk is drunk in the middle house.
10. The NorwegianNorwegian lives in the first house.
![Page 30: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/30.jpg)
Einstein's riddle
11. The man who smokes Chesterfields lives in the house next to the man with the fox.
12.12. KoolsKools are smoked in a house next to the house where the horse is kept.
13. The Lucky Strike smoker drinks orange juiceorange juice.
14. The JapaneseJapanese smokes Parliaments.
15. The NorwegianNorwegian lives next to the blue houseblue house.
Source: http://en.wikipedia.org/wiki/Zebra_puzzle
![Page 31: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/31.jpg)
Einstein's riddle
Who drinks water?
Who owns a zebra?
Source: http://en.wikipedia.org/wiki/Zebra_puzzle
![Page 32: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/32.jpg)
Ontology: hierarchical structure
Thing/Root
Inhabitant
Colour
Pet
Beverage
House-1
House-2
House-3
House-4
House-5
House House...
Englishman
Spaniard
Japanese
Norwegean
Ukranian
Spaniard...
Red
Green
Blue
IvoryYellow
Green...
Dog
Horse
Snails
Fox
Zebra
![Page 33: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/33.jpg)
Ontology
“is-a” or taxonomic relationships
Denote the “kind” of a concept
But ontologies: more than taxonomic relationships!
Thing/Root
Inhabitant
Colour
Pet
Brand
House-1
House-2House House...
Englishman
SpaniardSpaniard...
Red
GreenGreen...
Dog
Horse...
Beverage
![Page 34: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/34.jpg)
Ontology: properties
Thing/Root
Inhabitant
Colour
Pet
House
Has_colour:(Colour>Is_ColourOf:[House])
[Colour]
Has_inhabitant:(Inhabitant>LivesIn:[House])
[Inhabitant]
Is_rightTo: [House]
House-1
Brand
Beverage
![Page 35: Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou k.zervanou@uvt.nl](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56649eb45503460f94bbbac8/html5/thumbnails/35.jpg)
Ontology: properties
Thing/Root
Inhabitant
Colour
Pet
House
LivesIn:(House>Has_inhabitant:[Inhabitant])
[House]
Has_pet:(Pet>Has_owner: [Inhabitant])
[Pet]
Drinks:(Beverage>Drunk_by: [Inhabitant])
[Beverage]
Uses_brand:(Brand>Used_by: [Inhabitant])
[Brand]
Spaniard
Brand
Beverage