information retrieval thur jan 23 2014 data…. framework for today’s lecture…
TRANSCRIPT
STRUCTURED vs unstructured data
easy to envision structured data in terms of “tables”
4
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
50000Ivy Smith
Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith.
• typically refers to free text• email is a good example of unstructured data.
it's indexed by date, time, sender, recipient, and subject, but the body of an email remains unstructured
• other examples of unstructured data include books, documents, medical records, and social media posts
structured vs UNSTRUCTURED data
Document collection(corpus)
Index
Query
Representation function Representation
function
Matching function
Results
CATEGORIESSUBJECT HEADINGS
What is Metadata?
• Classic definition: data about data• Metadata is structured information that
describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource. (NISO)
• 3 primary “types”: – Descriptive– Structural– Administrative (rights management, preservation)
http://search.lib.unc.edu/search?R=UNCb4448196
More Metadata: A Cataloging Record
The Idea of Facets
• Facets are a way of labeling data– A kind of Metadata (data about data)– Can be thought of as properties of items
• Facets vs. Categories– Items are placed INTO a category system– Multiple facet labels are ASSIGNED TO items
Facets Epicurious example http://www.epicurious.com/
• Create INDEPENDENT categories (facets)– Each facet has labels (sometimes arranged in a
hierarchy)
• Assign labels from the facets to every item– Example: recipe collection
Course
Main Course
CookingMethod
Stir-fry
Cuisine
Thai
Ingredient
Bell Pepper
Curry
Chicken
The Idea of Facets• Break out all the important concepts into their
own facets• Sometimes the facets are hierarchical– Assign labels to items from any level of the
hierarchy
Preparation Method Fry Saute Boil Bake Broil Freeze
Desserts Cakes Cookies Dairy Ice Cream Sorbet Flan
Fruits Cherries Berries Blueberries Strawberries Bananas Pineapple
Using Facets
• Now there are multiple ways to get to each item
Preparation Method Fry Saute Boil Bake Broil Freeze
Desserts Cakes Cookies Dairy Ice Cream Sherbet Flan
Fruits Cherries Berries Blueberries Strawberries Bananas Pineapple
Fruit > PineappleDessert > Cake
Preparation > Bake
Dessert > Dairy > SherbetFruit > Berries > Strawberries
Preparation > Freeze
caveat: semi-structured data
• in fact almost no data is absolutely “unstructured”
• e.g., this slide has distinctly identified zones such as the title and bullets
• facilitates “semi-structured” search such as– title contains data and bullets contain structure
Let’s look at a database of magazine & journal articles…
…Academic Search Complete
>> UNC Libraries Homepage: http://www.lib.unc.edu/
>> E-Research Tools
>> Frequently Used
>> Academic Search Complete [off-campus log in with onyen/password
Organization / Search
• We organize to enable retrieval• The more effort we put into organizing information, the more
effectively it can be retrieved• The more effort we put into retrieving information, the less it
needs to be organized first• We need to think in terms of investment, allocation of costs
and benefits between the organizer and retriever• The allocation differs according to the relationship between
them; who does the work and who gets the benefit?