data mining mtat.03.183 text mining social network
TRANSCRIPT
10.12.2009
1
Data Mining MTAT.03.183
Text MiningText MiningSocial Network Analysis
Graph Mining
Jaak Vilo
2009 Fall
Topics
• Information Retrieval
• Text Mining
• Web Mining
• Social Network Analysis
– friends, epidemiology, co‐authoring, co‐citation, espionageespionage, …
• Graph Mining
• STACC
Jaak Vilo and other authors UT: Data Mining 2009 2
10.12.2009
2
Books
• Feldman, Sanger: The Text Mining Handbook
• Information Retrieval
• Web Mining
Jaak Vilo and other authors UT: Data Mining 2009 3
Materials
• Modern Information Retrieval by Ricardo Baeza‐Yates and Berthier Ribeiro‐Neto.Berthier Ribeiro Neto. – http://people.ischool.berkeley.edu/~hearst/irbook/– http://www.amazon.co.uk/Modern‐Information‐Retrieval‐ACM‐
Press/dp/020139829X/ref=sr_1_1?ie=UTF8&s=books&qid=1228237684&sr=8‐1
– http://books.google.com/books?id=HLyAAAAACAAJ&dq=modern+information+retrieval
– New edition in May 2009 (overdue?)
• Google Books: Information Retrievalhtt //b k l /b k ? i f ti t i l– http://books.google.com/books?q=information+retrieval
• ESSCaSS’08 : Ricardo Baeza‐Yates and Filippo Menczer– http://courses.cs.ut.ee/schools/esscass2008/Main/Materials
10.12.2009
3
Topics
• Information Retrieval
• Text Mining
• Web Mining
• Social Network Analysis
– friends, epidemiology, co‐authoring, co‐citation, espionageespionage, …
• Graph Mining
Jaak Vilo and other authors UT: Data Mining 2009 5
ConceptsInformation Retrieval ‐ the study of systems for representing, indexing (organising), searching (retrieving) and recalling (delivering) data(retrieving), and recalling (delivering) data.
Information Filtering ‐ given a large amount of data, return the data that the user wants to see
Information Need ‐ what the user really wants to know; a query is an approximation to the i f ti dinformation need.
Query ‐ a string of words that characterizes the information that the user seeks
Browsing ‐ a sequence of user interaction tasks that characterizes the information that the user seeks
10.12.2009
4
Documents
• News, articles
L l l d t• Laws, legal documents
• Scientific publications, patents
• E‐mail
• Technical documents
• Books
• Encyclopediae
• Dictionaries
• …
Jaak Vilo and other authors UT: Data Mining 2009 7
Information Retrieval
• DB of indexed documents
• Query
• Find documents relevant to query
Jaak Vilo and other authors UT: Data Mining 2009 8
10.12.2009
5
Jaak Vilo and other authors UT: Data Mining 2009 9
More features
• Metadata
• Context
• Hypertext – xrefs
• Language
• Structured vs unstructured
• Semantics
• Tags
• …Jaak Vilo and other authors UT: Data Mining 2009 10
10.12.2009
6
Jaak Vilo and other authors UT: Data Mining 2009 11
Jaak Vilo and other authors UT: Data Mining 2009 12
10.12.2009
7
TP
FP
FN
TN
T=TrueF=False
Jaak Vilo and other authors UT: Data Mining 2009 13
P=PositivesN=Negative
Jaak Vilo and other authors UT: Data Mining 2009 14
10.12.2009
8
Jaak Vilo and other authors UT: Data Mining 2009 15
Jaak Vilo and other authors UT: Data Mining 2009 16
10.12.2009
9
Web search
• data mining university of tartu fall 2009
• Java
• Paris Hilton
• climate change
Jaak Vilo and other authors UT: Data Mining 2009 18
10.12.2009
10
Jaak Vilo and other authors UT: Data Mining 2009 19
Jaak Vilo and other authors UT: Data Mining 2009 20
10.12.2009
11
Jaak Vilo and other authors UT: Data Mining 2009 21
Web search enginesRooted in Information Retrieval (IR) systems
•Prepare a keyword index for corpus
•Respond to keyword queries with a ranked list of•Respond to keyword queries with a ranked list of documents.
ARCHIE•Earliest application of rudimentary IR systems to the Internet
•Title search across sites serving files over FTP
10.12.2009
12
Document representation
• Feature vector
– Bag of words
• vector Di – which words are present
– Query Q
– similarity M(Q, Di) = Q ∙ Di (scalar product)
– Weighted version – easy to generalise
Jaak Vilo and other authors UT: Data Mining 2009 23
TF‐IDF
• Term Frequency – Inverted Document Frequency
h f d f• Weights: frequent in doc, infrequent in DB
– Measures frequency, and relevance
– TF‐IDF(w,d) = tf(w.d)*idf(w)
TF‐IDF_Weight(w,d) = Term_Frequency(w,d)*log(N/DocFreq(w))
• log( N / 1+DocFreq(w) ) to avoid division by 0 if not in the database at all…
Jaak Vilo and other authors UT: Data Mining 2009 24
10.12.2009
13
Boolean queries: Examples
• Simple queries involving relationships b t t d d tbetween terms and documents– Java (documents containing word Java)
– Java AND (NOT coffee) Java -coffee
• Proximity queries– phrase “Java beans” or the term API
Mining the Web Chakrabarti and Ramakrishnan 25
– phrase Java beans or the term API
– Documents where Java and island occur in the same sentence
Document preprocessing
• Tokenization– Filtering away tags
– Tokens regarded as nonempty sequence of characters excluding spaces and punctuations.
– Token represented by a suitable integer, tid,
Mining the Web Chakrabarti and Ramakrishnan 26
typically 32 bits
– Optional: stemming/conflation of words
– Result: document (did) transformed into a sequence of integers (tid, pos)
10.12.2009
14
Storing tokens
• Straight-forward implementation using a l ti l d t brelational database
– Example figure
– Space scales to almost 10 times
• Accesses to table show common pattern– reduce the storage by mapping tids to a
Mining the Web Chakrabarti and Ramakrishnan 27
– reduce the storage by mapping tids to a lexicographically sorted buffer of (did, pos) tuples.
– Indexing = transposing document-term matrix
Mining the Web Chakrabarti and Ramakrishnan 28
Two variants of the inverted index data structure, usually stored on disk. The simplerversion in the middle does not store term offset information; the version to the right stores termoffsets. The mapping from terms to documents and positions (written as “document/position”) maybe implemented using a B-tree or a hash-table.
10.12.2009
15
Storage
• For dynamic corpora– Berkeley DB2 storage manager
– Can frequently add, modify and delete documents
• For static collections– Index compression techniques (to be
Mining the Web Chakrabarti and Ramakrishnan 29
Index compression techniques (to be discussed)
Stopwords• Function words and connectives• Appear in large number of documents and little
i i i ti d tuse in pinpointing documents• Indexing stopwords
– Stopwords not indexed• For reducing index space and improving performance
– Replace stopwords with a placeholder (to remember the offset)
Mining the Web Chakrabarti and Ramakrishnan 30
• Issues– Queries containing only stopwords ruled out– Polysemous words that are stopwords in one sense
but not in others• E.g.; can as a verb vs. can as a noun
10.12.2009
16
Stemming
• Conflating words to help match a query term with a morphological variant in the corpus.
• Remove inflections that convey parts of speech, tense and number
• E.g.: university and universal both stem to universe.
• Techniques– morphological analysis (e.g., Porter's algorithm)
– dictionary lookup (e.g., WordNet).
• Stemming may increase recall but at the price of precisionAbbre iations pol sem and names coined in the technical and
Mining the Web Chakrabarti and Ramakrishnan 31
– Abbreviations, polysemy and names coined in the technical and commercial sectors
– E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to “gate”, may be bad !
Batch indexing and updates
• Incremental indexingTime consuming due to random disk IO– Time-consuming due to random disk IO
– High level of disk block fragmentation
• Simple sort-merges.– To replace the indexed update of variable-length
postings
• For a dynamic collection
Mining the Web Chakrabarti and Ramakrishnan 32
• For a dynamic collection– single document-level change may need to update
hundreds to thousands of records.
– Solution : create an additional “stop-press” index.
10.12.2009
17
Mining the Web Chakrabarti and Ramakrishnan 33
Maintaining indices over dynamic collections.
Relevance ranking
• Keyword queriesIn natural language– In natural language
– Not precise, unlike SQL• Boolean decision for response unacceptable
– Solution• Rate each document for how likely it is to satisfy the user's
information need
S t i d i d f th
Mining the Web Chakrabarti and Ramakrishnan 34
• Sort in decreasing order of the score
• Present results in a ranked list.
• No algorithmic way of ensuring that the ranking strategy always favors the information need– Query: only a part of the user's information need
10.12.2009
18
Responding to queries
• Set-valued response– Response set may be very large
• (E.g., by recent estimates, over 12 million Web pages contain the word java.)
• Demanding selective query from user
• Guessing user's information need and
Mining the Web Chakrabarti and Ramakrishnan 35
granking responses
• Evaluating rankings
Mining the Web Chakrabarti and Ramakrishnan 36
Precision and interpolated precision plotted against recall for the given relevance vector. Missing are zeroes.kr
10.12.2009
19
Beyond keyword search
Jaak Vilo and other authors UT: Data Mining 2009 37
Bayesian Inferencing
Bayesian inference network for relevance ranking. A Manual specification of
Mining the Web Chakrabarti and Ramakrishnan 38
document is relevant to the extent that setting its corresponding belief node to true lets us assign a high degree of belief in the node corresponding to the query.
mappings between terms to approximate concepts.
10.12.2009
20
Bayesian Inferencing (contd.)
• Four layers1.Document layer
2.Representation layer
3.Query concept layer
4.Query
• Each node is associated with a random
Mining the Web Chakrabarti and Ramakrishnan 39
• Each node is associated with a random Boolean variable, reflecting belief
• Directed arcs signify that the belief of a node is a function of the belief of its immediate parents (and so on..)
NLP – Natural Language Processing
• Tokenisation – words, sentences
• Part‐of‐Speech (POS) tagging
– articles, nouns, verbs, adjective, number, … (87+?)
• Syntactical Parsing
• Shallow Parsing
Jaak Vilo and other authors UT: Data Mining 2009 40
10.12.2009
21
Text Categorisation
• Indexing texts with controlled vocabularies
• Document sorting, text filtering
– topic
– spam
• Hierarchical web page categorisation
Y h ! t– Yahoo! , etc
Jaak Vilo and other authors UT: Data Mining 2009 41
• Single label vs multilabel
– overlapping
– binary
– Multilabel can be represented by binary classifiers
• Document or category – pivoted
• Hard vs Soft• Hard vs Soft
Jaak Vilo and other authors UT: Data Mining 2009 42
10.12.2009
22
Bridging the gap: raw data vs actionable information
Document Management
SystemsSearch
Technicaldocumentation
CRM
News Stories
Personalisation
Analysis
Alerting
Tagging
Jaak Vilo and other authors UT: Data Mining 2009 43
News Stories
Web pages
Decision Support
Machine Readable Machine Understandable
Topics
• Information Retrieval
• Text Mining
• Web Mining
• Social Network Analysis
– friends, epidemiology, co‐authoring, co‐citation, espionageespionage, …
• Graph Mining
Jaak Vilo and other authors UT: Data Mining 2009 44
10.12.2009
23
Information Extraction (IE)
• Information extraction tasks…
Jaak Vilo and other authors UT: Data Mining 2009 45
Typical subtasks of IE are
• Content Noise Removal– remove noise contents. For example, tagclouds, navigational menu, related contents, and context
l t d d ti trelated advertisements.
• Named Entity Recognition (NER)– recognition of entity names (for people and organizations), place names, temporal expressions, and
certain types of numerical expressions.
• Coreference resolution: – detection of coreference and anaphoric links between text entities. In IE tasks, this is typically
restricted in finding links between previously extracted named entities. For example, "International Business Machines" and "IBM" refer to the same real world entity.
• Terminology extraction: finding the relevant terms for a given corpusgy g g p
• Relationship Extraction: identification of relations between entities, such as: – PERSON works for ORGANIZATION (extracted from the sentence "Bill works for IBM.")
– PERSON located in LOCATION (extracted from the sentence "Bill is in France.")
Jaak Vilo and other authors UT: Data Mining 2009 46
10.12.2009
24
Example from Feldman
Jaak Vilo and other authors UT: Data Mining 2009 47
Intelligent Auto-Tagging(c) 2001, Chicago Tribune. Visit the Chicago Tribune on the Internet at http://www.chicago.tribune.com/ Distributed by Knight Ridder/Tribune
<Facility>Finsbury Park Mosque</Facility>
<Country>England</Country>
<Country>England</Country>
<Country>France </Country>
Information Services. By Stephen J. Hedges and Cam Simpson
<PersonPositionOrganization><OFFLEN OFFSET="3576" LENGTH=“33" /><Person>Abu Hamza al-Masri</Person><Position>chief cleric</Position><Organization>Finsbury Park Mosque</Organization>
y g y
<Country>United States</Country>
<Country>Belgium</Country>
<Person>Abu Hamza al-Masri</Person>
…….
The Finsbury Park Mosque is the center of radical Muslim activism in England. Through its doors have passed at least three of the men now held on suspicion of terrorist activity in France, England and Belgium, as well as one Algerian man in prison in the United States.
g y q g</PersonPositionOrganization>
<PersonArrest><OFFLEN OFFSET="3814" LENGTH="61" /><Person>Abu Hamza al-Masri</Person><Location>London</Location><Date>1999</Date><Reason>his alleged involvement in a Yemen bomb
plot</Reason></PersonArrest>
<City>London</City>
``The mosque's chief cleric, Abu Hamza al-Masri lost two hands fighting the Soviet Union in Afghanistan and he advocates the elimination of Western influence from Muslim countries. He was arrested in London in 1999 for his alleged involvement in a Yemen bomb plot, but was set free after Yemen failed to produce enough evidence to have him extradited. .''
……
10.12.2009
25
Intelligence Article
IE Accuracy by Information Type
Information Type
Accuracy
Entities 90-98%
Attributes 80%
Facts 60-70%
Events 50-60%
10.12.2009
26
Jaak Vilo and other authors UT: Data Mining 2009 51
Ontologies
• Names, synonyms, objects…
• Controlled vocabularies
• Ontologies – definitions and relationships
Jaak Vilo and other authors UT: Data Mining 2009 52
10.12.2009
27
Topics
• Information Retrieval
• Text Mining
• WebMining
• Social Network Analysis
– friends, epidemiology, co‐authoring, co‐citation, espionageespionage, …
• Graph Mining
Jaak Vilo and other authors UT: Data Mining 2009 53
History of Hypertext
• Citation,– Hyperlinking
• Ramayana, Mahabharata, Talmud– branching, non-linear discourse, nested
commentary,
• Dictionary encyclopedia
Mining the Web Chakrabarti and Ramakrishnan 54
Dictionary, encyclopedia– self-contained networks of textual nodes
– joined by referential links
10.12.2009
28
Hypertext systems
• Memex [Vannevar Bush, 1945]– stands for “memory extension”stands for memory extension– photoelectrical-mechanical storage and
computing device– Aim: to create and help follow hyperlinks
across documents
• Hypertext
Mining the Web Chakrabarti and Ramakrishnan 55
yp– Coined by Ted Nelson– Xanadu hypertext (1981): system with
• robust two-way hyperlinks, version management, controversy management, annotation and copyright management.
• http://www.robotwisdom.com/web/timeline.html
H T i t' Ti li f H t t Hi t• HyperTerrorist's Timeline of Hypertext History
• Jorn Barger . . . . . . . . . . . . . . . . . . . . . . 4 March 1996
Jaak Vilo and other authors UT: Data Mining 2009 56
10.12.2009
29
World-wide Web• Initiated at CERN (the European Organization for Nuclear
Research)B Ti B L– By Tim Berners-Lee
• GUIs– Berners-Lee (1990)
– Erwise and Viola(1992), Midas (1993)
• Mosaic (1993)– a hypertext GUI for the X-window system
Mining the Web Chakrabarti and Ramakrishnan 57
– HTML: markup language for rendering hypertext
– HTTP: hypertext transport protocol for sending HTML and other data over the Internet
– CERN HTTPD: server of hypertext documents
World Wide WebHypertext documents
•Text
•Links•Links
Web•billions of documents
•authored by millions of diverse people
•edited by no one in particulary p
•distributed over millions of computers, connected by variety of media
Chakrabartihttp://www.cse.iitb.ac.in/~soumen/mining-the-web/
10.12.2009
30
Mining the Web Chakrabarti and Ramakrishnan 59
The early days of the Web : CERN HTTP traffic grows by 1000between 1991-1994 (image courtesy W3C)
Mining the Web Chakrabarti and Ramakrishnan 60
The early days of the Web: The number of servers grows from a few hundred to a million between 1991 and 1997 (image courtesy Nielsen)
10.12.2009
31
1994: the landmark year
• Foundation of the “Mosaic C i ti C ti "Communications Corporation"
• first World-wide Web conference
• MIT and CERN agreed to set up the World-wide Web Consortium (W3C).
Mining the Web Chakrabarti and Ramakrishnan 61
Web: A populist, participatory medium
• number of writers =(approx) number of dreaders.
• the evolution of MEMES– ideas, theories etc that spread from person to
person by imitation.
– Now they have constructed the Internet !!
Mining the Web Chakrabarti and Ramakrishnan 62
Now they have constructed the Internet !!
– E.g.: “Free speech online", chain letters, and email viruses
10.12.2009
32
Abundance and authority crisis
• liberal and informal culture of content generation and disseminationand dissemination.
• Very little uniform civil code.
• redundancy and non-standard form and content.
• millions of qualifying pages for most broad queries
Mining the Web Chakrabarti and Ramakrishnan 63
– Example: java or kayaking
• no authoritative information about the reliability of a site
Problems due to Uniform accessibility
• little support for adapting to the b k d f ifibackground of specific users.
• commercial interests routinely influence the operation of Web search– “Search Engine Optimization“ !!
Mining the Web Chakrabarti and Ramakrishnan 64
10.12.2009
33
Hypertext data
• Semi-structured or unstructured– No schema
• Large number of attributes
Mining the Web Chakrabarti and Ramakrishnan 65
Jaak Vilo and other authors UT: Data Mining 2009 66
10.12.2009
34
Jaak Vilo and other authors UT: Data Mining 2009 67
Jaak Vilo and other authors UT: Data Mining 2009 68
10.12.2009
35
Jaak Vilo and other authors UT: Data Mining 2009 69
Jaak Vilo and other authors UT: Data Mining 2009 70
10.12.2009
36
Crawling and indexing
• Purpose of crawling and indexingquick fetching of large number of Web pages into a– quick fetching of large number of Web pages into a local repository
– indexing based on keywords
– Ordering responses to maximize user’s chances of the first few responses satisfying his information need.
Mining the Web Chakrabarti and Ramakrishnan 71
• Earliest search engine: Lycos (Jan 1994)
• Followed by….– Alta Vista (1995), HotBot and Inktomi, Excite
Topic directories
• Yahoo! directory– to locate useful Web sites
• WWW-Wärk (www.ee), Neti (www.neti.ee)
• Efforts for organizing knowledge into ontologies
Centralized: (Yahoo!)
Mining the Web Chakrabarti and Ramakrishnan 72
– Centralized: (Yahoo!)
– Decentralized: About.COM and the Open Directory
10.12.2009
37
Clustering and classification
• Clustering– discover groups in the set of documents such that– discover groups in the set of documents such that
documents within a group are more similar than documents across groups.
– Subjective disagreements due to • different similarity measures• Large feature sets
• Classification
Mining the Web Chakrabarti and Ramakrishnan 73
– For assisting human efforts in maintaining taxonomies – E.g.: IBM's Lotus Notes text processing system &
Universal Database text extenders
Hyperlink analysis
• Take advantage of the structure of the Web graphgraph.– Indicators of prestige of a page (E.g. citations)
– HITS & PageRank
• Bibliometry– bibliographic citation graph of academic papers
Topic distillation
Mining the Web Chakrabarti and Ramakrishnan 74
• Topic distillation– Adapting to idioms of Web authorship and linking
styles
10.12.2009
38
Resource discovery and vertical portals
• Federations of crawling and search iservices
– each specializing in specific topical areas.
• Goal-driven Web resource discovery– language analysis does not scale to billions of
documents
Mining the Web Chakrabarti and Ramakrishnan 75
documents
– counter by throwing more hardware
Structured vs. Web data mining
• Traditional data mining– data is structured and relational– data is structured and relational– well-defined tables, columns, rows, keys, and
constraints.
• Web data– readily available data rich in features and patterns– spontaneous formation and evolution of
t i i d d h l t
Mining the Web Chakrabarti and Ramakrishnan 76
• topic-induced graph clusters • hyperlink-induced communities
• Goal of book: discovering patterns which are spontaneously driven by semantics,
10.12.2009
39
Crawl “all” Web pages?
• Problem: no catalog of all accessible URL th W bURLs on the Web.
• Solution: – start from a given set of URLs
– Progressively fetch and scan them for new outlinking URLs
Mining the Web Chakrabarti and Ramakrishnan 77
outlinking URLs
– fetch these pages in turn…..
– Submit the text in page to a text indexing system
– and so on……….
Crawling procedure
• Simple Great deal of engineering goes into ind str– Great deal of engineering goes into industry-strength crawlers
– Industry crawlers crawl a substantial fraction of the Web
– E.g.: Alta Vista, Northern Lights, Inktomi
• No guarantee that all accessible Web
Mining the Web Chakrabarti and Ramakrishnan 78
• No guarantee that all accessible Web pages will be located in this fashion
• Crawler may never halt …….– pages will be added continually even as it is
running.
10.12.2009
40
Crawling overheads
• Delays involved in– Resolving the host name in the URL to an IP
address using DNS
– Connecting a socket to the server and sending the request
– Receiving the requested page in response
Mining the Web Chakrabarti and Ramakrishnan 79
• Solution: Overlap the above delays by– fetching many pages at the same time
Anatomy of a crawler.
• Page fetching threads– Starts with DNS resolution– Starts with DNS resolution – Finishes when the entire page has been fetched
• Each page – stored in compressed form to disk/tape – scanned for outlinks
• Work pool of outlinks
Mining the Web Chakrabarti and Ramakrishnan 80
– maintain network utilization without overloading it• Dealt with by load manager
• Continue till he crawler has collected a sufficientnumber of pages.
10.12.2009
41
Mining the Web Chakrabarti and Ramakrishnan 81
Typical anatomy of a large-scale crawler.
Large-scale crawlers: performance and reliability considerations
• Need to fetch many pages at same time– utilize the network bandwidth– single page fetch may involve several seconds of network
latency
• Highly concurrent and parallelized DNS lookups• Use of asynchronous sockets
– Explicit encoding of the state of a fetch context in a data structure
– Polling socket to check for completion of network transfers
Mining the Web Chakrabarti and Ramakrishnan 82
Polling socket to check for completion of network transfers– Multi-processing or multi-threading: Impractical
• Care in URL extraction– Eliminating duplicates to reduce redundant fetches– Avoiding “spider traps”
10.12.2009
42
Jaak Vilo and other authors UT: Data Mining 2009 83
Jaak Vilo and other authors UT: Data Mining 2009 84
10.12.2009
43
What does WWW look like?
Jaak Vilo and other authors UT: Data Mining 2009 85
Jaak Vilo and other authors UT: Data Mining 2009 86
10.12.2009
44
Jaak Vilo and other authors UT: Data Mining 2009 87
Jaak Vilo and other authors UT: Data Mining 2009 88
10.12.2009
45
Jaak Vilo and other authors UT: Data Mining 2009 89
Graph G=(V,E)
• V – vertices (nodes)
• E – edges (arcs, connections)
Jaak Vilo and other authors UT: Data Mining 2009 90
10.12.2009
46
B
A
2009/12/10CS4335 Design and Analysis of Algorithms
/WANG LushengPage 91
Find a shortest path from station A to station B.
-need serious thinking to get a correct algorithm.
Representation For An Undirected Graph
92
10.12.2009
47
Representation For A Directed Graph
1 bit per entry
93
Jaak Vilo and other authors UT: Data Mining 2009 94
10.12.2009
48
Jaak Vilo and other authors UT: Data Mining 2009 95
Jaak Vilo and other authors UT: Data Mining 2009 96
10.12.2009
49
Jaak Vilo and other authors UT: Data Mining 2009 97
Jaak Vilo and other authors UT: Data Mining 2009 98
10.12.2009
50
Jaak Vilo and other authors UT: Data Mining 2009 99
Jaak Vilo and other authors UT: Data Mining 2009 100
10.12.2009
51
Jaak Vilo and other authors UT: Data Mining 2009 101
Jaak Vilo and other authors UT: Data Mining 2009 102
10.12.2009
52
Jaak Vilo and other authors UT: Data Mining 2009 103
Topics
• Information Retrieval
• Text Mining
• Web Mining
• Social Network Analysis
– friends, epidemiology, co‐authoring, co‐citation, espionageespionage, …
• Graph Mining
Jaak Vilo and other authors UT: Data Mining 2009 104
10.12.2009
53
Jaak Vilo and other authors UT: Data Mining 2009 105
Skype: 400+MFacebook: 350+M
Social network of 9/11 hijackers
Jaak Vilo and other authors UT: Data Mining 2009 106
10.12.2009
54
Running Example
Hijackers by Flight
Flight 77 Pentagon Flight 11 WTC 1 Flight 175 WTC 2 Flight 93 PAFlight 77 : Pentagon Flight 11 : WTC 1 Flight 175 : WTC 2 Flight 93: PA
Khalid Al-Midhar Satam Al Suqami Marwan Al-Shehhi Saeed Alghamdi
Majed Moqed Waleed M. Alshehri Fayez Ahmed Ahmed Alhaznawi
Nawaq Alhamzi Wail Alshehri Ahmed Alghamdi Ahmed Alnami
Salem Alhamzi Mohamed Atta Hamza Alghamdi Ziad Jarrahi
Hani Hanjour Abdulaziz Alomari Mohald Alshehri
10.12.2009
55
Jaak Vilo and other authors UT: Data Mining 2009 109
Jaak Vilo and other authors UT: Data Mining 2009 110
10.12.2009
56
Jaak Vilo and other authors UT: Data Mining 2009 111
Jaak Vilo and other authors UT: Data Mining 2009 112
http://www.cs.cmu.edu/~jure/pubs/msn-www08.pdf
10.12.2009
57
240M nodes30B conversations180M, 1.3B
average 6.6 degrees
social aspects
Jaak Vilo and other authors UT: Data Mining 2009 113
Jaak Vilo and other authors UT: Data Mining 2009 114
10.12.2009
58
Jaak Vilo and other authors UT: Data Mining 2009 115
Jaak Vilo and other authors UT: Data Mining 2009 116
10.12.2009
59
Jaak Vilo and other authors UT: Data Mining 2009 117
Jaak Vilo and other authors UT: Data Mining 2009 118
10.12.2009
60
Smaller scale questions
FriendsFriends
Bars
Jaak Vilo and other authors UT: Data Mining 2009 119
Drinks
http://yury.name/webguide/02webguide.pdf
Friends
v’
Friends
Bars
Jaak Vilo and other authors UT: Data Mining 2009 120
Drinks
10.12.2009
61
Suggestions:
Friends
v’
Friends
Bars
Jaak Vilo and other authors UT: Data Mining 2009 121
Drinks
Possible questions in SNA
• Prestige
• Centrality
• Co‐citation/co‐occurrence
• Radius/diameter
• …
Jaak Vilo and other authors UT: Data Mining 2009 122
10.12.2009
62
Prestige
• p = vector of prestige values
• E ‐ citations
• E[i,j] – document i cites document j
• Calculate a new prestige where prestige of citing documents is added to current prestige
Jaak Vilo and other authors UT: Data Mining 2009 123
Jaak Vilo and other authors UT: Data Mining 2009 124
10.12.2009
63
Prestige
• p’ = ET pp p
• Find fixpoint, starting with p=(1,…n)T
u
upvuEvp ][],[]['
pEupuvEvp T
u
T ][],[]['
Find fixpoint, starting with p (1,…n)• Ensure p is normalised at each step, ||p||=1
• p converges to the principal eigenvector of ET
Jaak Vilo and other authors UT: Data Mining 2009 125
Attributes
Objects
Jaak Vilo and other authors UT: Data Mining 2009 126
10.12.2009
64
Jaak Vilo and other authors UT: Data Mining 2009 127
Centrality
• Importance notion based on centrality
• removing a central node disconnects the graph to large components
• d(u,v) the shortest‐path distance between u and v
• r(u) max d(u v) radius of node u• r(u) = maxv d(u,v) ‐ radius of node u
• arg minu r(u) ‐ center of the graph
• Various other notions of centrality in the literatureJaak Vilo and other authors UT: Data Mining 2009 128
10.12.2009
65
Topics
• Information Retrieval
• Text Mining
• Web Mining
• Social Network Analysis
– friends, epidemiology, co‐authoring, co‐citation, espionageespionage, …
• Graph Mining
Jaak Vilo and other authors UT: Data Mining 2009 129
Jaak Vilo and other authors UT: Data Mining 2009 130
10.12.2009
66
Finding the modules
Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007
IntAct: Protein interactions (PPI), 18773 interactionsIntAct: PPI via orthologs from IntAct, 6705 interactions MEM: gene expression similarity over 89 tumor datasets, 46286 interactionsTransfac: gene regulation data, 5183 interactions
Public datasets for H.sapiens
Finding the modules
Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007
IntAct: Protein interactions (PPI), 18773 interactionsIntAct: PPI via orthologs from IntAct, 6705 interactions MEM: gene expression similarity over 89 tumor datasets, 46286 interactionsTransfac: gene regulation data, 5183 interactions
Public datasets for H.sapiens
10.12.2009
67
Module evaluation
Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007
GO: JAK-STAT cascade, Kinase inhibitor activity
Insulin receptor signaling pw.KEGG: Type II diabetes mellitus
GO:Transforming growth factor beta signaling pw.embryonic development, gastrulationKEGG: Cell cycle, cancers, WNT pw.
GO: Brain developmentPigment granule
Melanine metabolic process
MCL clustering algorithm
• Markov (Chain Monte Carlo) Clustering
Stijn van Dongen
– http://www.micans.org/mcl/
• Random walks according to edge weights
• Follow the different paths according to their probability
• Regions that are traversed “often” form clusters
10.12.2009
69
Approximation and compression = learning
= X
Jaak Vilo and other authors UT: Data Mining 2009 137
Jaak Vilo and other authors UT: Data Mining 2009 138
http://www.cs.cmu.edu/~christos/PUBLICATIONS/sdm07-lsm.pdf
10.12.2009
70
We need to understand
• how networks emerge
• what are their properties
• features
• how to calculate interesting/important characteristics
Jaak Vilo and other authors UT: Data Mining 2009 139
Jaak Vilo and other authors UT: Data Mining 2009 140
http://www.cs.cmu.edu/~deepay/mywww/papers/csur06.pdfACM Computing Surveys, Vol. 38, March 2006,
10.12.2009
71
Jaak Vilo and other authors UT: Data Mining 2009 141
• Graphs and networks – also other types
• http://www.weizmann.ac.il/mcb/UriAlon/
– => Complex networks
Jaak Vilo and other authors UT: Data Mining 2009 142
10.12.2009
72
Jaak Vilo and other authors UT: Data Mining 2009 143
Jaak Vilo and other authors UT: Data Mining 2009 144
10.12.2009
73
Jaak Vilo and other authors UT: Data Mining 2009 145
Jaak Vilo and other authors UT: Data Mining 2009 146
10.12.2009
74
Jaak Vilo and other authors UT: Data Mining 2009 147
Jaak Vilo and other authors UT: Data Mining 2009 148
10.12.2009
75
Jaak Vilo and other authors UT: Data Mining 2009 149
Jaak Vilo and other authors UT: Data Mining 2009 150
10.12.2009
76
Jaak Vilo and other authors UT: Data Mining 2009 151
Jaak Vilo and other authors UT: Data Mining 2009 152
10.12.2009
77
Jaak Vilo and other authors UT: Data Mining 2009 153
Jaak Vilo and other authors UT: Data Mining 2009 154
10.12.2009
78
Jaak Vilo and other authors UT: Data Mining 2009 155
Jaak Vilo and other authors UT: Data Mining 2009 156
10.12.2009
79
Jaak Vilo and other authors UT: Data Mining 2009 157
Jaak Vilo and other authors UT: Data Mining 2009 158
10.12.2009
80
Jaak Vilo and other authors UT: Data Mining 2009 159
Ricardo Baeza-Yates
STACCSoftwareTechnologies and Applications
Competence CenterJaak Vilo
November 26, 2009
10.12.2009
81
Partners
Research Tracks and Programs
Data Integration and Mining (DIM)
1.1 Web Analytics and Social
Network Analysis
1.2 Biomedical data integration and mining
1.3 Privacy‐Preserving Data
Mining
Software and Services Engineering
Delfi, Logica, Quretec, Regio, Skype
ITK, Cybernetica, Quretec Cybernetica, Swedbank,Quretec
162
2.1 Smart Internet Interfaces
2.2 Smart Services2.3 Software Development Productivity
Delfi, Regio, Webmedia
Webmedia, Regio, Logica, Cybernetica
Webmedia, Logica, Cybernetica, KnowIT
10.12.2009
82
• Methods for analysing the structure and dynamics of very large social networks and raw web usage data in order to
1.1 Web Analytics and Social Network Analysis
large social networks and raw web usage data in order to discover user clusters, user goals, service or product consumption patterns, customer churning patterns, spam and fraud patterns and other patterns of individual or collective user behaviour.
• Application to user interface personalization and re‐organization, personalized search, targeted advertising, peer‐g , p , g g, pto‐peer network monitoring, and derivation of e‐Business metrics associated to advertising, customer acquisition and retention/churning, and business intelligence.
163Delfi, Logica, Quretec, Regio, Skype
10.12.2009
84
Activities
1. Social Network Analysis
• mining of very large (400M nodes, x 10 edges) graphs for elementary properties
2. Web (and software) log mining
• warehousing
• clustering, viusalisation, decision support …clustering, viusalisation, decision support …
3. User Intent Prediction
• real time learning of user intent and goals
• collaborative filtering
10.12.2009
85
• To develop data integration and analysis methods for electronic patient records and biomarker data
1.2 Biomedical data integration and mining
for electronic patient records and biomarker data with the goal of improving the (early) diagnosis and medical treatment of the diseases.
• Complex disease of COPD as the proof of principle
• Mining Electronic patient records and E‐Health data
• UT – data mining; TTU – medical know‐how and data
• Hospital and e‐health solutions
169ITK, Cybernetica, Quretec, + E-Health, Medicum, Finnish NIH
10.12.2009
86
Activities
1. COPD patient cohort buildup based on ki t h t i tismoking etc characteristics
2. Data warehousing and decision support for hospital e‐health data
3. Introducing ontologies into clinic
4 Text mining of medical records4. Text mining of medical records
To develop and to evaluate privacy‐enhancing methods for datastorage and processing, along two complementary directions:
1.3 Privacy‐Preserving Data Mining
storage and processing, along two complementary directions:
1. Security of micro‐data releases and query auditing: Detection and elimination of possible privacy breaches indata to be published, and detection of queries to (medical and financial) databases that may breach the privacy of individuals.
2 Privacy‐preserving data aggregation: Development of secure2. Privacy‐preserving data aggregation: Development of secure and practically efficient methods for aggregating data from multiple sources, that leak nothing beyond the end aggregate results.
172Cybernetica, Swedbank, Skype
10.12.2009
87
Privacy preserving DM
Q: sexual behavior ofQ: sexual behavior of HIV patients?
# sex partners: 2 34 0
Privacy preserving DM
AVERAGE = 12876113 764224
9823
1235 68612
AVERAGE 12876113126531
764224
# sex partners: 2 34 0
10.12.2009
88
Privacy preserving DM
AVERAGE = 12876113 764224 AVERAGE 12876113126531
764224
ShareMind
10.12.2009
89
Activities in 1.3
1. Micro‐data protection mechanisms
2. An environment for developing privacy‐preserving applications
3. Demonstration: questionnaire system
4. Protocol analysis for secret‐shared applicationsapplications
4 Sep 2008
10.12.2009
90
42 nodes2 x 4-core cpu
336 core
HPC @ UT anno 2009
336 core
32 GB / node1.2 TB RAM
Ultrafast network(Infiniband)
UT Inst of CS:32-core, 256GB RAM
Oktoober 2009
The first record is dated 10/01/2009, last record is dated 10/31/2009.
Wallclock Average AverageUsername Group #jobs days Percent #nodes q-days Full name-------- ----- ----- --------- ------- ------- ------- ---------
TOTAL - 36187 7293.91 100.00 1.64 0.07cipo users 363 3445.66 47.24 1.85 0.07 Heiki Kasemagi
thcmob users 33397 1427.46 19.57 1.00 0.06 Juri Reimandbronto hirlam 119 759.92 10.42 8.21 0.12 Andres Luhamaa
ivan users 102 566.69 7.77 48.87 0.13 Ivan Suhhonenkobayazit1 users 649 510.45 7.00 1.00 0.31 Bayazit Yunusbaevpriit85 users 69 184.23 2.53 1.00 0.17 Priit Priimagi
maitb users 467 180.12 2.47 1.00 0.21 Mait Metspalualfonsog users 28 106.21 1.46 1.01 0.02 Alfonso Tlatoani Garcia-Sosa
jaas users 8 57.53 0.79 13.80 0.04 Jaas Jezovsiims users 124 21 12 0 29 1 00 0 01 Siim Sobersiims users 124 21.12 0.29 1.00 0.01 Siim Sober
t6nuesko users 629 16.87 0.23 1.00 0.08 Tonu Eskoa72094 users 40 13.72 0.19 1.00 0.03 Kaur Alasoo
wire users 43 2.07 0.03 1.00 0.00 Lauri Juhan Liivamagilauria users 95 0.89 0.01 9.31 0.00 Lauri Antonreidar users 46 0.81 0.01 1.00 1.29 Reidar Andreson
kasak105 users 4 0.17 0.00 8.00 0.09 Kait Kasakeero users 4 0.00 0.00 1.00 0.00 Eero Vainikko