introduction to information retrievalwidit2.knu.ac.kr/~kiyang/teaching/gse/f18/lectures/1.gse... ·...
TRANSCRIPT
Introduction to Information Retrieval
What is IR?
Sit down before fact as a little child,
be prepared to give up every conceived notion, follow humbly wherever and whatever abysses nature leads, or you will learn nothing.
-- Thomas Huxley -- Search Engines 2
Google Query = What is IR? Query = What is information retrieval? Ask.com Query = What is IR? Query = What is information retrieval? Yahoo! Query = What is IR? Query = What is information retrieval?
Google Korea Query = What is IR? Query = What is information retrieval? Naver Query = What is IR? Query = What is information retrieval? Daum Query = What is IR? Query = What is information retrieval?
IR: Key Questions
What are we looking for? How do we find it? Why is it difficult?
Search Engines 3
“A prudent question is one-half of wisdom” Francis Bacon
IR: What are we looking for?
We are ► Looking for X.
• Q&A: population of China • Known-item Search: “Cather in the Rye”
► Looking for something like/about X. • General/background info: Taliban • Collection Development: IR Literature • Similar to (known) X: like “Cather in the Rye” • WhatyoumacallX: “the rye-boy story”
► Looking for something • Problem Resoultion: how can we fight terrorism? • Knowledge Development: what is IR?
► Looking • Need something, but don’t know what
what’s it all about? • Serendipity: Web surfing
Search Engines 4
IR: How do we find it? Brute force search
► Easy to build, maintain, and use ► Searcher does all the work; Hard to get satisfaction
Organize/structure the data ► Intuitive to use ► Hard to build and maintain ► Knowledge of builder’s language & organization structure is crucial
Use a search tool ► Easier to build and maintain: Less manipulation of data ► Sometimes works, sometimes not (Helps to know the language of the data)
Ask the experts ► Easy and satisfying to use (by definition) ► “Expert” knowledge is transitory, hard to encapsulate
Go with the crowd ► Relatively easy to build and maintain
► Limited utility: doesn’t work with “unpopular” X
Zen-Fusion search.
Search Engines 5
Information Seeking Process: Dynamic, Interactive, Iterative
User Intermediary Information
What am I looking for? - Identification of info. need What question do I ask? - Query formulation
What is the searcher looking for? - Discovery of user’s info. need How should the question be posed? - Query representation Where is the relevant information? - Query-document matching
What data to collect? - Collection development What information to index? - Indexing/Representation How to represent it? - Data structure
Search Engines 6
Information Seeking Models Berry-picking Model (딸기따기 모델)
► Interesting information is scattered like berries among bushes.
► Information seeking is a dynamic, non-linear process, where information need/queries continually shift.
► Information needs are not satisfied by a single, final retrieved set of documents, but rather by a series of selections and bits of information found along the way.
Traditional Model ► Linear process:
1. Problem identification 2. Identification of information need 3. Query formulation 4. Result evaluation
► Static information need ► The goal is to retrieve a perfect
match of the information need
Search Engines 7
Bates, 1989 Broader, 2002
IR Research: Overview
Search Engines 8
Information Organization: - Add structure & annotation
Information Retrieval - Create a searchable index
Information Access - Retrieve information
Data Mining - Discover Knowledge
IR Research: Information Retrieval
Search Engines 9
Representation - indexing, term weighting
Searchable Index Raw Data
Query Formulation - “What is information retrieval?”
Search Results - (ranked) document list
D1 wd1 wd2 wd3
D2 wd2 wd4 wd1 wd2
D3 wd1 wd4
Index Term D1 D2 D3
wd1 (information) 1 1 1
wd2 (model) 0 1 1
wd3 (retrieval) 1 2 0
wd4 (seminar) 1 0 0
Rank docID score
1 D2 3
2 D1 2
3 D3 1
D1: information retrieval seminars D2: retrieval models and information retrieval D3: information model
IR Research: Information Organization
Search Engines 10
Representation - NLP & Machine Learning
Organized Data Raw Data
Query Formulation - “What is IR?”
Search Results - document groups
IR Research: Natural Language Processing Goal
► Understanding/effective processing of natural language • Not just pattern matching
Research area, technique, tool for ► Knowledge Discovery, Data Mining
Lexical Analysis using ► Part-of-Speech (POS) tagging ► Sentence Parsing
Search Engines 11
IR Research: Machine Learning Research Area, technique, tool for
► Information Organization, Knowledge Discovery, Data Mining Information Organization via
► Supervised Learning (Automatic Classification) ► Unsupervised Learning (Clustering)
Search Engines 12
Class 1
Class 2
Class 1
Class 2 Classification
Clustering
IR Research: Lifecycle 1. Identify a research question 2. Find out what others have done (i.e. Literature Review)
3. Design an experiment i. Form a hypothesis ii. Determine specifications (task, data, system, evaluation, user) iii. Construct a strategy to accomplish task
4. Conduct the experiments i. Design an IR system architecture based on the experiment design ii. Implement the system iii. Tune system modules with training data iv. Execute retrieval runs with test data
5. Write papers i. Analyze results ii. Execute post-experiment runs iii. Analyze the post-experiment results iv. Write a conference paper v. Present the paper at a conference vi. Conduct a follow-up study vii. Analyze the follow-up study results viii. write a Journal paper
Search Engines 13
What is TextREtrievalConference? Annual Information Retrieval conference
► Sponsored by • National Institute of Standards & Technology (NIST) • Defense Advanced Research Project Agency (DARPA) • Other U.S. agencies (e.g. DOD)
► Attended by • International researchers from academic,
commercial, and government institutions
Goals ► Advance IR research based on large-scale data ► Refine IR evaluation methodologies ► Create test collections for various aspects of IR ► Stimulate exchange of ideas & communication among academia, industry, and
government
Search Engines 14
Voorhees, 2014
TREC Tasks: Tracks
Search Engines 15
Voorhees, 2014