1 you are a document too: web mining and ir for next-generation information literacy bettina berendt...
Post on 18-Dec-2015
218 views
TRANSCRIPT
1
You are a document too:
Web mining and IR
for next-generation
information literacy
Bettina BerendtK.U. Leuven, Belgium
www.berendt.de
2
From
IR / WM tools for solving a (information-getting) problem
to
IR / WM as cognitive tools for thinking about what the problem is
“Another grand challenge“
4
About me: My public (and mine-able) profile
: Information Systems: Computer Science / Cognitive Science: Artificial Intelligence: Business Science: Economics
: Computer Science
5
Agenda
Outlook
(Some) Questions
Goal
Concepts
(Some) answers
Goal:IR / WM for teaching and learning Information Literacy
Concepts:From information to communication & privacy
(Some) answers:IR / WM tools elucidate communication patterns
7
Should an unknown Web user get these news ....
... or these?
[Genderlens. See Liu & Mihalcea, Proc. ICWSM 2007]
12
How do embattled politicians minimize their responsibility?
I acknowledgethat mistakes
were made here(just not be me)
14
The goal: to use Information Retrieval / Web Mining for teaching and learning Information Literacy
Information literacy:
a set of competencies that an informed citizen of an information
society ought to possess to participate intelligently and actively in
that society
“the ability to
recognize when
information is
needed and to
locate, evaluate
and use effectively
the needed
information“
15
Information Literacy
“the ability to
recognize when
information is
needed and to
locate, evaluate
and use effectively
the needed
information“
1. Task Definition1.1 Define the information problem1.2 Identify information needed
2. Information Seeking Strategies 2.1 Determine all possible sources2.2 Select the best sources
3. Location and Access3.1 Locate sources (intellectually and physically)3.2 Find information within sources
4. Use of Information4.1 Engage (e.g., read, hear, view, touch)4.2 Extract relevant information
5. Synthesis5.1 Organize from multiple sources5.2 Present the information
6. Evaluation6.1 Judge the product (effectiveness)6.2 Judge the process (efficiency)
16
a set of competencies that an informed citizen of an information society ought to possess to participate intelligently and actively in that society
Using Information Retrieval / Web Mining for teaching and learning Information Literacy
1. Task Definition1.1 Define the information problem1.2 Identify information needed
2. Information Seeking Strategies 2.1 Determine all possible sources2.2 Select the best sources
3. Location and Access3.1 Locate sources (intellectually and physically)3.2 Find information within sources
4. Use of Information4.1 Engage (e.g., read, hear, view, touch)4.2 Extract relevant information
5. Synthesis5.1 Organize from multiple sources5.2 Present the information
6. Evaluation6.1 Judge the product (effectiveness)6.2 Judge the process (efficiency)
Information: get, produce, communicate IR / WM
IR / WM
19
Some history / motivations
Information Retrieval
1940s : US military confronts problems of indexing and retrieval of wartime scientific research documents captured from Germans
1950s : Growing concern in the US for a “science gap“ with the USSR mechanized literature searching systems
Information Literacy
1983 report A Nation at Risk: The Imperative for Educational Reform : a “rising tide of mediocrity” is eroding the very foundations of the American educational system
Data Mining
1990s : ‘If a business knew more about its customers, these wouldn‘t run away to competitors‘
All motivated by combinations of scarcity and abundance
20
Information retrieval as a communication process
Authors
Documents
Users with tasks and
goals
Informationneeds
QueriesDocument
representations
Founddocuments
Evaluation of the documents
Kuropka, Advances in Inf. Systems and Mgt. Science, 2004 (simplified)
21
... this assumes “mutually wanted“ communication
wants to disclose info
wants to get info
informationowner(subject)
informationseeker
Information
Intention
Intention
22
4 cases
wants to disclose info
wants to get info
does not want to disclose
wants to get info
does not want to disclose
does not want to get infodoes not want to get info
wants to disclose info
25
Case 1: Example 2
“... I want to make three brief points about the resignations of the eight United States' attorneys, a topic that I know is foremost in your minds. First, those eight attorneys deserved better. ... Each is a fine lawyer and dedicated professional. I regret how they were treated, and I apologize to them and to their families for allowing this matter to become an unfortunate and undignified public spectacle. I accept full responsibility for this. Second, I want to address allegations that I have failed to tell the truth about my involvement in these resignations. These attacks on my integrity have been very painful to me. ...“
wants to disclose info
wants to get info
26
... but ...
does not want to disclose
wants to get info
... but ...[Method: Learning a trie from the string sequences]
http://services.alphaworks.ibm.com/manyeyes/view/SgoRsIsOtha6bhEf6arzI2-
27
Going further in analysing implicit messages:What differentiates news souces?
[Fortuna, Galleguillos, & Cristianini, in press]
[Method: Nearest neighbour / best reciprocal hit for document matching;Kernel Canonical Correlation Analysisand vector operationsfor finding topics and characteristic keywords]
wants to disclose info
wants to get info
? ... depends
29
Her queries(sample from an anonymized search-query log)
http://www.nytimes.com/2006/08/09/technology/09aol.html
31
Input data and prediction problem
Informal observations of correlations between browsing behaviour and demographic attributes (gender, age)
Problem:
How to predict a user‘s gender from the Web pages s/he klicked on
Basic idea:
users user-to-page matrix pages document-to-term matrix terms
[Jian Hu, Hua-Jun Zeng, Hua Li, Cheng Niu, Zheng Chen (2007). Demographic Prediction Based on User’s Browsing Behavior. In Proc. WWW 2007]
32
[Method: Learn a classifier]
1. Define the gender tendency of a Web page Proportion of requests for the page by male/female (c) users,
relative to all requests
(R : user-to-page matrix)
2. Learn the gender tendency of Web pages Pages: with variance on gender ≥ threshold
Linear form of support-vector machine regression
Features: content words with highest information gain
target attribute: gender tendency
3. Predict the user‘s gender Naive Bayes
Features: visited pages
target attribute: gender (and some more optimization)
33
Results
A kind of analogue of the BOWused for predicting genderfrom produced content(words over all visited pages)
41
[Typical method: learn a classification model (usually with at least some features being words)]
[ Ntoulas et al., Proc. WWW 2006]
42
Case 3: Example 2
does not want to get info
wants to disclose info
Solution approach?:Learn
classification models
A blog reader:“I don‘t mind personal blogs, but if they get to really really personal stuff, like if they‘re going to start talking about suicide, it‘s not something that you wanna share ... I avoid reading content that I consider too personal.“
[Baumer, Suevoshi, & Tomlinson. In Proc. ICWSM 2008]
43
Case 4
Do these people beat their kids?
does not want to disclose
does not want to get info
Move this communication out
of the long tail ?!
<a picture of a happy family>
45
[Method Information visualization / history flow – ex. visualizing conflict, here: “edit wars“]
[Viégas, Wattenberg, & Dave, Proc. CHI 2004]
47
Web mining for articulation and reflection
Repetition Organisation Elaboration
[Berendt, in Neues
Handbuch Hochschul-
lehre, 2006;BRMIC ‘01]
Proxy server
LogfileASP
[Methods: Usage tracking, semantic graph coarsening]
49
Challenge 1: Understanding and keeping up with the communications arms race
“membership - or 'log in' - is the new anonymous.“
[Digital Methods Initiative (2007). Comparison between Anonymous Palestinian and Israeli Wikipedia Edits. wiki2.issuecrawler.net/twiki/bin/view/Dmi/ComparisonBetweenAnonymousPalestinianAndIsraeliWikipediaEdits]
52
Network effects (3): Inferences
Friendship is generally symmetric
If A wants to hide her friendships,
But B shows that “A is my friend“,
B has disclosed private information of A.
(More elaborate problems follow from this ...)
For a discussion, see
Preibusch, S., Hoser, B., Gürses, S., & Berendt, B. (2007). Ubiquitous social networks - opportunities and challenges for privacy-aware user modelling. In Proceedings of the Workshop on Data Mining for User Modelling at UM 2007, Corfu, Greece, June 2007.
53
Network effects (4): Requirements interact
[Preibusch, S., Hoser, B., Gürses, S., & Berendt, B. (2007). Ubiquitous social networks - opportunities and challenges for privacy-aware user modelling.
In Proceedings of the Workshop on Data Mining for User Modelling at UM 2007, Corfu, Greece, June 2007 .]
54
Challenge 3: Countermeasures against re-identification and their effect on democracy (and other things)
Is this the same
person?
55
Keeping identities apart – the basic setting
Paper published by the MovieLens team (collaborative-filtering movie ratings) who were considering publishing a ratings dataset, see http://movielens.umn.edu/
Public dataset: users mention films in forum posts
Private dataset (may be released e.g. for research purposes): users‘ ratings
Film IDs can easily be extracted from the posts
Observation: Every user will talk about items from a sparse relation space (those – generally few – films s/he has seen)
[Frankowski, D., Cosley, D., Sen, S., Terveen, L., & Riedl, J. (2006). You are what you say: Privacy risks of public mentions. In Proc. SIGIR‘06]
56
[Method: Compute similarities between people (films as features)]
Given a target user t from the forum users, find similar users (in terms of which items they related to) in the ratings dataset
Rank these users u by their likelihood of being t
Evalute:
If t is in the top k of this list, then t is k-identified
Count percentage of users who are k-identified
E.g. measure likelihood by TF.IDF (m: item)
60
Summary and conclusions
Information-related activities involve disclosing and withholding. Each information-related activity has (at least) one source, one
manifestation as data/document, one user and one stakeholder; network effects abound.
The dichotomy of information-seeking users and information-containing data/documents has vanished.
a new operationalisation of information literacy: getting, producing, communicating, … information
IR/WM tools can support this type of information literacy For whom is that interesting?
Researchers, Instructors Practitioners Citizens
Who can do something about this? Researchers, Instructors Practitioners Anyone who funds such work ...
62
Picture and some more literature credits
pp. 1 and 61: http://farm2.static.flickr.com/1062/932116791_490db77985_m.jpg
pp.8 and 30: http://www.theage.com.au/news/World/Charles-coronation-to-move-with-the-times/2004/12/26/1103996438678.html
pp. 9 and 28: http://www.nytimes.com/2006/08/09/technology/09aol.htm l
pp. 10 and 40: http://seiplecandis.googlepages.com/exerterton.jpg
Pp. 10 and 42: http://www.crowncombo.com/articles/2005/100205_monster/monster04.jpg
pp. 11 and 35: http://www.radarmagazine.com/features/images/2006/12/atomic-energy-lab-01.jpg
pp. 12 and 25: http://graphics8.nytimes.com/images/2007/07/24/us/24gonzales-2-600.jpg, with inspiration by http://www.lifeclever.com/wp-content/uploads/2007/03/gonzales_passive.jpg
p. 13: http://www.ffc-turbine.de/graphs/news/070303_nadineangerer.jpg
p. 14: based on http://en.wikipedia.org/wiki/Information_literacy , „yellow definition“ quoted from there and based on Shapiro, J.J. & Hughes, S.K. (1996). Information Literacy as a Liberal Art. Enlightenment proposals for a new curriculum. Educom Review, 31 (2), http://www.educause.edu/pub/er/review/reviewarticles/31231.html; „light blue definition“ based on Presidential Committee on Information Literacy. 1989, p. 1 (see Wikipedia page)
p. 15: http://www.big6.com/what-is-the-big6%E2%84%A2/
p. 19 uses input from en.wikipedia.org/wiki/Information_literacy and en.wikipedia.org/wiki/Information_retrieval
p. 24: http://media.mcclatchydc.com/smedia/2007/10/08/16/854-8web-clinton-obama-minor.standalone.prod_affiliate.91.jpg
p. 25 (text): http://services.alphaworks.ibm.com/manyeyes/static-resources/data/89ade5ae14e1dd2c0114ff78100c0b61.txt
p. 36: from the Wikipedia page (some editing done for illustration)
p. 37: http://wikiscanner.virgil.gr/
p. 39: http://de.wikipedia.org/wiki/Wikipedia:Wikiscanner
p. 46: http://www.surrealcoconut.com/surrealism_gallery/coulage/chocolate1.html
p. 51: http://eu.inmagine.com/img/imagewerksrf/iwf06015/iwf019005.jpg, http://img.timeinc.net/time/time100/2007/images/queen_elizabeth.jpg, http://www.barmala.de/wp-content/uploads/2005/02/spam.jpg