1/16/20161 introduction to graphs 15-111 advanced programming concepts/data structures ananda...
DESCRIPTION
1/16/20163TRANSCRIPT
05/03/23 1
Introduction toGraphs
15-111 Advanced Programming
Concepts/Data Structures
Ananda Gunawardena
05/03/23 2
An Airline route Map
05/03/23 3
05/03/23 4
Introduction • Many real world problems can be modeled using graphs– Airline Route Map
• What is the fastest way to get from Pittsburgh to St Louis?• What is the cheapest way to get from Pittsburgh to St Louis?
– Electric Circuits• Circuit elements - transistors, resistors, capacitors• is everything connected together?
– Depends on interconnections (wires)• If this circuit is built will it work?
– Depends on wires and objects they connect.
05/03/23 5
Graphs• More applications
– Job Scheduling• Interconnections indicate which jobs to be performed before others• When should each task be performed
• All these questions can be answered using a mathematical structure named a “graph”. We will answer the questions– what are graphs?– what are their basic properties?
05/03/23 6
Graph Definitions• Graph
– A set of vertices(nodes) V = {v1, v2, …., vn}– A set of edges(arcs) that connects the vertices E={e1, e2,
…, em}– Each edge ei is a pair (v, w) where v, w in V – |V| = number of vertices (cardinality)– |E| = number of edges
• Graphs can be– directed (order (v,w) matters)– Undirected (order of (v,w) doesn’t matter)
• Edges can be – weighted (cost associated with the edge)– eg: Neural Network, airline route map(vanguard airlines)
05/03/23 7
Graph Representation• How do we represent a graph internally?• Two ways
– adjacency matrix– Adjacency list
• Adjacency Matrix– Use matrix entries to represent edges in the graph
• Adjacency List– Use an array of lists to represent edges in the graph
(we will discuss this later)
05/03/23 8
Adjacency Matrix• Adjacency Matrix
– For each edge (v,w) in E, set A[v][w] = edge_cost– Non existent edges with logical infinity
• Cost of implementation– O(|V|2) time for initialization– O(|V|2) space
• ok for dense graphs• unacceptable for sparse graphs
05/03/23 9
Adjacency List• Adjacency List
– Ideal solution for sparse graphs– For each vertex keep a list of all adjacent vertices– Adjacent vertices are the vertices that are connected to the vertex
directly by an edge.– Example
List 0
List 1
List 2
1 2
2 0 1
1
05/03/23 10
Adjacency List• The number of list nodes equals to number of edges
– O(|E|) space • Space is also required to store the lists
– O(|V|) for |V| lists• Note that the number of edges is at least round(|V|/2)
– assuming each vertex is in some edge– Therefore disregard any O(|V|) term when O(|E|) is
present• Adjacency list can be constructed in linear time (wrt to
edges)
05/03/23 11
Breadth First Traversal
• Algorithm– Start from any node in the graph– Traverse its neighbors (nodes that are directly
connected to it) using some heuristic– Next traverse the neighbors of the neighbors
etc.. Until some limit is reach or all the nodes in the graph are visited
– Use a queue to perform the breadth first traversal
05/03/23 12
Depth First Traversal
• Algorithm– Start from any node in the graph– Traverse deeper and deeper until dead end– Back track and traverse other nodes that are
not visited– Use a stack to perform the depth first
traversal
05/03/23 13
Web as a Graph
URL 1
URL 2
URL 7
URL 5
URL 3
URL 6
URL 4
05/03/23 14
Web Algorithms
05/03/23 15
Web Algorithms• Search
– Google, MSN, Altavista• Image search
– games• Routing• Distributed Computing• Shortest Path Algorithms
– Google Maps, MapQuest• Semantic Web
– XML metadata• Etc.
05/03/23 16
Web Search Engines A Cool Application of Graphs
05/03/23 17
Building a Search Engine• Crawl the web• Build a web index• Then when we build/search, we may have
to sort the index– Google sorts more than 100 billion index
items• Novel algorithms, novel data structures, distributed
computing
05/03/23 18
A basic Search Engine Architecture
05/03/23 19
Google Architecture
05/03/23 20
Google’s server farm
05/03/23 21
Web Crawlers Start with an initial page P0. Find URLs on P0 and
add them to a queue When done with P0, pass it to an indexing program,
get a page P1 from the queue and repeat Can be specialized (e.g. only look for email
addresses) Issues
Which page to look at next? (Special subjects, recency) How deep within a site do you go (depth search)? How frequently to visit pages?
05/03/23 22
So, why Spider the Web?
Refresh Collection by deleting dead links
OK if index is slightly smaller
Done every 1-2 weeks in best engines
Finding new sites
Respider the entire web
Done every 2-4 weeks in best engines
05/03/23 23
Cost of Spidering
Spider can (and does) run in parallel on hundreds of severs
Very high network connectivity (e.g. T3 line)
Servers can migrate from spidering to query processing depending on time-of-day load
Running a full web spider takes days even with hundreds of dedicated servers
05/03/23 24
Indexing Arrangement of data (data structure) to permit
fast searching Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak Sorting helps. Why?
Permits binary search. About log2n probes into list log2(1 billion) ~ 30
Permits interpolation search. About log2(log2n) probes log2 log2(1 billion) ~ 5
05/03/23 25
Inverted Files A file is a list of words by position
- First entry is the word in position 1 (first word)- Entry 4562 is the word in position 4562 (4562nd word)- Last entry is the last word
An inverted file is a list of positions by word!
POS1
10
20
30
36
FILE
a (1, 4, 40)entry (11, 20, 31)file (2, 38)list (5, 41)position (9, 16, 26)positions (44)word (14, 19, 24, 29, 35, 45)words (7)4562 (21, 27)
INVERTED FILE
05/03/23 26
Inverted Files for Multiple Documents
107 4 322 354 381 405232 6 15 195 248 1897 1951 2192677 1 481713 3 42 312 802
WORD NDOCS PTRjezebel 20jezer 3jezerit 1jeziah 1jeziel 1jezliah 1jezoar 1jezrahliah 1jezreel 39
jezoar
34 6 1 118 2087 3922 3981 500244 3 215 2291 301056 4 5 22 134 992
DOCID OCCUR POS 1 POS 2 . . .
566 3 203 245 287
67 1 132. . .
“jezebel” occurs6 times in document 34,3 times in document 44,4 times in document 56 . . .
LEXICON
WORD INDEX
05/03/23 27
Ranking (Scoring) Hits Hits must be presented in some order What order?
Relevance, recency, popularity, reliability, alphabetic? Some ranking methods
Presence of keywords in title of document Closeness of keywords to start of document Frequency of keyword in document Link popularity (how many pages point to this one)