introduction to computing using python data storage and processing databases and sql python...
TRANSCRIPT
Introduction to Computing Using Python
Data Storage and Processing
Databases and SQL Python Database Programming List comprehension and MapReduce Parallel Computing
Introduction to Computing Using Python
Data storage
Beijing × 3Paris × 5Chicago × 5
Chicago × 3Beijing × 6
Bogota × 3Beijing × 2Paris × 1
Chicago × 3Paris × 2Nairobi × 1
Nairobi × 7Bogota × 2
one.html four.html
two.html
three.html five.html
The data collected by a web crawler can be stored in a text file
Introduction to Computing Using Python
Data storage
URL word counthttp://reed.cs.depaul.edu/lperkovic/one.html Paris 5http://reed.cs.depaul.edu/lperkovic/one.html Beijing 3http://reed.cs.depaul.edu/lperkovic/one.html Chicago 5
URL link http://reed.cs.depaul.edu/lperkovic/one.html http://reed.cs.depaul.edu/lperkovic/two.htmlhttp://reed.cs.depaul.edu/lperkovic/one.html http://reed.cs.depaul.edu/lperkovic/three.html
URL word counthttp://reed.cs.depaul.edu/lperkovic/two.html Bogota 3http://reed.cs.depaul.edu/lperkovic/two.html Paris 1http://reed.cs.depaul.edu/lperkovic/two.html Beijing 2
URL link http://reed.cs.depaul.edu/lperkovic/two.html http://reed.cs.depaul.edu/lperkovic/four.html
URL word counthttp://reed.cs.depaul.edu/lperkovic/four.html Paris 2...
Introduction to Computing Using Python
Data storage
A search engine app may then need to access this file to make queries such as
1. In which web pages does word X appear in?2. What is the ranking of web pages containing word X, based on
the number of occurrences of word X in the page?3. How many pages contain word X?4. What pages have a hyperlink to page Y?5. What is the total number of occurrences of word ‘Paris’ across
all web pages?6. How many outgoing links does each visited page have?7. How many incoming links does each visited page have?8. What pages have a link to a page containing word X?9. What page containing word X has the most incoming links?
A text file is not ideal for this ...
Introduction to Computing Using Python
Data storage
Beijing × 3Paris × 5Chicago × 5
Chicago × 3Beijing × 6
Bogota × 3Beijing × 2Paris × 1
Chicago × 3Paris × 2Nairobi × 1
Nairobi × 7Bogota × 2
one.html four.html
two.html
three.html five.html
The data collected by a web crawler can be stored in a text file ...
Introduction to Computing Using Python
Database files
The data collected by a web crawler can be stored in a text file ...
... or in a database file
Hyperlinks
Keywords
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
Url Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
Introduction to Computing Using Python
Database files
A database file consists of one or more tables
Each table has a name and consists of rows and columns Each column has a name and contains data of a specific type
Hyperlinks
Keywords
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
Url Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
Each row is a database record
Introduction to Computing Using Python
Database files
Database files are not read from or written to directly
Instead, “read/write” commands are sent to a special type of server program called a database engine that manages the database
The database engine accesses the database file on the user’s behalf
The commands accepted by database engines are statements written in the Structured Query Language (SQL)
Introduction to Computing Using Python
SQL SELECT FROM statement
Link
two.html
three.html
four.html
four.html
five.html
one.html
two.html
four.html
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
SELECT Link FROM Hyperlinks
HyperlinksSQL statement SELECT is used make queries into a database
result table
Introduction to Computing Using Python
SQL SELECT FROM statement
SQL statement SELECT is used make queries into a database.
SELECT Url, Word FROM Keywords
KeywordsUrl Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
Url Word
one.html Beijing
one.html Paris
one.html Chicago
two.html Bogota
two.html Beijing
two.html Paris
three.html Chicago
three.html Beijing
four.html Chicago
four.html Paris
four.html Nairobi
five.html Nairobi
five.html Bogota
Introduction to Computing Using Python
SQL SELECT FROM statement
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
SELECT * FROM Hyperlinks
HyperlinksSQL statement SELECT is used make queries into a database
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
Introduction to Computing Using Python
SQL DISTINCT keyword
Link
two.html
three.html
four.html
five.html
one.html
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
SELECT DISTINCT Link FROM Hyperlinks
HyperlinksSQL keyword DISTINCT removes duplicate records in the result table
Introduction to Computing Using Python
SQL WHERE clause
SQL clause WHERE is used to select only those records that satisfy a condition
SELECT Url FROM KeywordsWHERE Word = 'Paris'
KeywordsUrl Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
Url
one.html
two.html
four.html
“In which pages does word X appear in?”
Operator Explanation= Equal<> Not equal> Greater than< Less than>= Greater than or equal<= Less than or equalBETWEEN Within an inclusive range
Introduction to Computing Using Python
SQL WHERE clause
SQL clause WHERE is used to select only those records that satisfy a condition
SELECT Column(s) FROM TableWHERE Column operator valueSELECT Column(s) FROM TableWHERE Column BETWEEN value1 AND value2
Url Freqone.html 5
two.html 2
four.html 1
Introduction to Computing Using Python
SQL keyword DESC
SQL keyword DESC is used to order the records in the result table in descending orderSELECT Url, Freq FROM KeywordsWHERE Word = 'Paris'ORDER by Freq DESC
KeywordsUrl Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
“What is the ranking of web pages containing word X, based on the number of occurrences of string X in the page?”
Introduction to Computing Using Python
Exercise
Hyperlinks
Keywords
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
Url Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
Write an SQL query that returns:1. The URL of every page that has a link to web
page four.html
SELECT DISTINCT Url FROM HyperlinksWHERE Link = 'four.html'
Introduction to Computing Using Python
Exercise
Hyperlinks
Keywords
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
Url Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
Write an SQL query that returns:2. The URL of every page that has an incoming link
from page four.html
SELECT DISTINCT Link FROM Hyperlinks WHERE Url = 'four.html'
Introduction to Computing Using Python
Exercise
Hyperlinks
Keywords
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
Url Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
Write an SQL query that returns:3. The URL and word for every word that appears
exactly three times in the web page associated with the URL
SELECT Url, Word from KeywordsWHERE Freq = 3
Introduction to Computing Using Python
Exercise
Hyperlinks
Keywords
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
Url Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
Write an SQL query that returns:4. The URL, word, and frequency for every word
that appears between 3 and 5 times, inclusive, in the web page associated with the URL
SELECT * from Keywords WHERE Freq BETWEEN 3 AND 5
Introduction to Computing Using Python
SQL built-in functions
SQL includes built-in math functions such as COUNT() and SUM()
SELECT COUNT(*) FROM Keywords WHERE Word = 'Paris'
KeywordsUrl Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
3
“How many pages contain the word Paris?”
Introduction to Computing Using Python
SQL built-in functions
SQL includes built-in math functions such as COUNT() and SUM()
SELECT SUM(Freq) FROM Keywords WHERE Word = 'Paris'
KeywordsUrl Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
8
Urlone.html 2
two.html 1
three.html 1
four.html 1
five.html 3
Introduction to Computing Using Python
SQL GROUP BY clause
SQL clause GROUP BY groups the records of a table that have the same value in a column
SELECT Url, COUNT(*) FROM HyperlinksGROUP BY Url
KeywordsUrl Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
“How many outgoing links does each web page have?”
Introduction to Computing Using Python
Exercise
Hyperlinks
Keywords
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
Url Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
Write an SQL query that returns:1. The number of words, including duplicates, that
page two.html contains
SELECT SUM(Freq) From Keywords WHERE Url = 'two.html'
Introduction to Computing Using Python
Exercise
Hyperlinks
Keywords
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
Url Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
Write an SQL query that returns:2. The number of distinct words page two.html
contains
SELECT Count(*) From KeywordsWHERE Url = 'two.html'
Introduction to Computing Using Python
Exercise
Hyperlinks
Keywords
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
Url Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
Write an SQL query that returns:3. The number of words, including duplicates, that
each web page has
SELECT Url, SUM(Freq) FROM Keywords GROUP BY Url
Introduction to Computing Using Python
Exercise
Hyperlinks
Keywords
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
Url Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
Write an SQL query that returns:4. The number of incoming links each web page
has
SELECT Link, COUNT(*) FROM Hyperlinks GROUP BY Link
“What web pages have a link to a page containing word ‘Bogota’?”
Introduction to Computing Using Python
SQL queries involving multiple tables
Hyperlinks
Keywords
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
Url Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
This question requires a lookup of both tables:• Look up Keywords to find the set S of URLs of
pages containing word ‘Bogota’• Then look up Keywords to find the URLs of
pages with links to pages in S
Introduction to Computing Using Python
SQL queries involving multiple tables
Hyperlinks
Keywords
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
Url Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
The SELECT statement can be used on multiple tables.
SELECT * FROM Hyperlinks, Keywords
Introduction to Computing Using Python
SQL queries involving multiple tables
The SELECT statement can be used on multiple tables.
Url Link Url Word Freq
one.html two.html one.html Beijing 3
one.html two.html one.html Paris 5
one.html two.html one.html Chicago 5
one.html two.html two.html Bogota 3
... ... ... ... ...
five.html
four.html four.html Nairobi 5
five.html
four.html five.html Nairobi 7
five.html
four.html five.html Bogota 2
SELECT * FROM Hyperlinks, Keywords
104 records, each a combination of a record in Hyperlinks and a record in Keywords
The result table is the cross join of tables Hyperlink and Keywords
• It has five named columns corresponding to the two columns of table Hyperlinks and three columns of table Keywords.
(Hyperlink) (Keywords)
result table
Introduction to Computing Using Python
SQL queries involving multiple tables
Hyperlink
Keywords
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
Url Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
The SELECT statement can be used on multiple tables.
SELECT * FROM Hyperlinks, Keywords WHERE Hyperlinks.Url = Keywords.Url
Introduction to Computing Using Python
SQL queries involving multiple tables
The SELECT statement can be used on multiple tables.
Url Link Url Word Freq
one.html two.html two.html Bogota 3
one.html two.html two.html Beijing 2
one.html two.html two.html Paris 1
one.html three.html three.html Chicago 3
... ... ... ... ...
five.html four.html four.html Paris 2
five.html four.html four.html Nairobi 5
SELECT * FROM Hyperlinks, Keywords WHERE Hyperlinks.Url = Keywords.Url
(Hyperlink) (Keywords)
Introduction to Computing Using Python
SQL queries involving multiple tables
Hyperlink
Keywords
Url Link
one.html two.html
one.html three.html
two.html four.html
three.html four.html
four.html five.html
five.html one.html
five.html two.html
five.html four.html
Url Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
SELECT * FROM Hyperlinks, Keywords WHERE Keywords.Word = 'Bogota' AND Hyperlinks.Link = Keywords.Url
“What web pages have a link to a page containing word ‘Bogota’?”
Introduction to Computing Using Python
SQL queries involving multiple tables
Url Link Url Word Freq
one.html two.html two.html Bogota 3
four.html five.html five.html Bogota 2
five.html two.html two.html Bogota 3
(Hyperlink) (Keywords)
SELECT * FROM Hyperlinks, Keywords WHERE Keywords.Word = 'Bogota' AND Hyperlinks.Link = Keywords.Url
“What web pages have a link to a page containing word ‘Bogota’?”
Introduction to Computing Using Python
SQL queries involving multiple tables
Url
one.html
four.html
five.html
SELECT Hyperlinks.Url FROM Hyperlinks, Keywords WHERE Keywords.Word = 'Bogota' AND Hyperlinks.Link = Keywords.Url
“What web pages have a link to a page containing word ‘Bogota’?”
Introduction to Computing Using Python
SQL CREATE TABLE statement
SQL statement CREATE TABLE is used to create a table in a database fileCREATE TABLE Keywords( Url text, Word text, Freq int)
KeywordsUrl Word Freq
Introduction to Computing Using Python
SQL CREATE TABLE statement
SQL statement CREATE TABLE is used to create a table in a database fileCREATE TABLE TableName( Column1 dataType1, Column2 dataType2, ...)
TableNameColumn1 Column2 ...
SQL Type Python Type Explanation
INTEGER int Holds integer values
REAL float Holds floating-point values
TEXT str Holds string values, delimited with quotes
BLOB bytes Holds sequence of bytes
Introduction to Computing Using Python
SQL INSERT statement
SQL statement INSERT is used to add a record to a table
INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)
KeywordsUrl Word FreqUrl Word Freq
one.html Beijing 3
Introduction to Computing Using Python
SQL UPDATE statement
SQL statement UPDATE is used to modify a record in a table
UPDATE Keywords SET Freq = 4WHERE Url = 'two.html' AND Word = 'Bogota'
KeywordsUrl Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 3
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
Url Word Freq
one.html Beijing 3
one.html Paris 5
one.html Chicago 5
two.html Bogota 4
two.html Beijing 2
two.html Paris 1
three.html Chicago 3
three.html Beijing 6
four.html Chicago 3
four.html Paris 2
four.html Nairobi 5
five.html Nairobi 7
five.html Bogota 2
Introduction to Computing Using Python
Standard Library module sqlite3
The Python Standard Library includes module sqlite3 that provides an API for accessing database files
• It is an interface to a library of functions that accesses the database files directly
>>> import sqlite3>>> con = sqlite3.connect('web.db')
sqlite3 function connect() takes as input the name of a database and returns an object of type Connection, a type defined in module sqlite3
• The Connection object con is associated with database file web.db• If database file web.db does not exists in the current working directory,
a new database file web.db is created
Introduction to Computing Using Python
Standard Library module sqlite3
The Python Standard Library includes module sqlite3 that provides an API for accessing database files
• It is an interface to a library of functions that accesses the database files directly
>>> import sqlite3>>> con = sqlite3.connect('web.db')>>> cur = con.cursor()
Connection method cursor() returns an object of type Cursor, another type defined in the module sqlite3
• Cursor objects are responsible for executing SQL statements
Introduction to Computing Using Python
Standard Library module sqlite3
The Python Standard Library includes module sqlite3 provides an API for accessing database files
• It is an interface to a library of functions that accesses the database files directly
>>> import sqlite3>>> con = sqlite3.connect('web.db')>>> cur = con.cursor()>>> cur.execute("CREATE TABLE Keywords (Url text, Word text, Freq int)")<sqlite3.Cursor object at 0x100575730>
The Cursor class supports method execute() which takes an SQL statement as a string, and executes it
>>> import sqlite3>>> con = sqlite3.connect('web.db')>>> cur = con.cursor()>>> cur.execute("CREATE TABLE Keywords (Url text, Word text, Freq int)")<sqlite3.Cursor object at 0x100575730>>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>
Hardcoded values
Introduction to Computing Using Python
Parameter substitution
In general, the values used in an SQL statement will not be hardcoded in the program but come from Python variables
>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>>>> url, word, freq = 'one.html', 'Paris', 5>>>
Introduction to Computing Using Python
Parameter substitution
Parameter substitution is the technique used to construct SQL statements that make use of Python variable values
• similar to string formatting
>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>>>> url, word, freq = 'one.html', 'Paris', 5>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq))<sqlite3.Cursor object at 0x100575730>
tuple
Introduction to Computing Using Python
Parameter substitution
>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>>>> url, word, freq = 'one.html', 'Paris', 5>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq))<sqlite3.Cursor object at 0x100575730>>>> record = ('one.html','Chicago', 5)>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", record)<sqlite3.Cursor object at 0x100575730>
Parameter substitution is the technique used to construct SQL statements that make use of Python variable values
• similar to string formatting
Introduction to Computing Using Python
Parameter substitution
Changes to a database file are not written to the database file immediately; they are only recorded temporarily, in memory
In order to ensure that the changes are written to the database file,the commit() method must be called on the Connection object
>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>>>> url, word, freq = 'one.html', 'Paris', 5>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq))<sqlite3.Cursor object at 0x100575730>>>> record = ('one.html','Chicago', 5)>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", record)<sqlite3.Cursor object at 0x100575730>>>> con.commit()>>>
A database file should be closed just like any other file
>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>>>> url, word, freq = 'one.html', 'Paris', 5>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq))<sqlite3.Cursor object at 0x100575730>>>> record = ('one.html','Chicago', 5)>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", record)<sqlite3.Cursor object at 0x100575730>>>> con.commit()>>> con.close()
Introduction to Computing Using Python
Querying a database
>>> import sqlite3>>> con = sqlite3.connect('links.db')>>> cur = con.cursor()>>> cur.execute('SELECT * FROM Keywords')<sqlite3.Cursor object at 0x102686960>>>> cur.fetchall()[('one.html', 'Beijing', 3), ('one.html', 'Paris', 5), ('one.html', 'Chicago', 5), ('two.html', 'Bogota', 5), ('two.html', 'Beijing', 2), ('two.html', 'Paris', 1), ('three.html', 'Chicago', 3), ('three.html', 'Beijing', 6), ('four.html', 'Chicago', 3), ('four.html', 'Paris', 2), ('four.html', 'Nairobi', 5), ('five.html', 'Nairobi', 7), ('five.html', 'Bogota', 2)]>>>
The result of a query is stored in the Cursor object
To obtain the result as a list of tuple objects, Cursor method fetchall() is used
Introduction to Computing Using Python
Querying a database
>>> cur.execute('SELECT * FROM Keywords')<sqlite3.Cursor object at 0x102686960>>>> for record in cur:
print(record)
('one.html', 'Beijing', 3)('one.html', 'Paris', 5)('one.html', 'Chicago', 5)('two.html', 'Bogota', 5)('two.html', 'Beijing', 2)('two.html', 'Paris', 1)('three.html', 'Chicago', 3)('three.html', 'Beijing', 6)('four.html', 'Chicago', 3)('four.html', 'Paris', 2)('four.html', 'Nairobi', 5)('five.html', 'Nairobi', 7)('five.html', 'Bogota', 2)>>>
An alternative is to iterate over the Cursor object
Introduction to Computing Using Python
Querying a database
>>> word = 'Paris'>>> cur.execute('SELECT Url FROM Keywords WHERE Word = ?', (word,))<sqlite3.Cursor object at 0x102686960>>>> cur.fetchall()[('one.html',), ('two.html',), ('four.html',)]>>> word, n = 'Beijing', 2>>> cur.execute("SELECT * FROM Keywords WHERE Word = ? AND Freq > ?", (word, n))<sqlite3.Cursor object at 0x102686960>>>> cur.fetchall()[('one.html', 'Beijing', 3), ('three.html', 'Beijing', 6)]>>>
Parameter substitution is again used whenever Python variable values are needed in the SQL statement
Introduction to Computing Using Python
List comprehension
>>> lines['First Line\n', 'Second\n', '\n', 'and Fourth.\n']>>>
Suppose we want to construct a list from an “old” list by modifying each “old” list item in the same way
['First Line\n', 'Second\n', '\n', 'and Fourth.\n']
['First Line', 'Second', '', 'and Fourth.']
>>> lines['First Line\n', 'Second\n', '\n', 'and Fourth.\n']>>> newlines = []>>> for i in range(len(lines)):
newlines.append(lines[i][:-1])
>>> newlines['First Line', 'Second', '', 'and Fourth.']>>>
>>> lines['First Line\n', 'Second\n', '\n', 'and Fourth.\n']>>> newlines = []>>> for i in range(len(lines)):
newlines.append(lines[i][:-1])
>>> newlines['First Line', 'Second', '', 'and Fourth.']>>> newlines = [line[:-1] for line in lines]>>> newlines['First Line', 'Second', '', 'and Fourth.']
Method 1: accumulator pattern
Method 2: list comprehension
lines
newlines
Introduction to Computing Using Python
List comprehension
>>> [line[:-1] for line in lines if line != '\n']['First Line', 'Second', 'and Fourth.']>>
The syntax of the list comprehension statement:
[<expression> for <item> in <sequence/iterator>]
[<expression> for <item> in <sequence/iterator> if <condition>]
More generally:
Examples:
>>> [line[:-1] for line in lines if line != '\n']['First Line', 'Second', 'and Fourth.']>>> [i for i in range(0, 20, 2)][0, 2, 4, 6, 8, 10, 12, 14, 16, 18]>>>
>>> [line[:-1] for line in lines if line != '\n']['First Line', 'Second', 'and Fourth.']>>> [i for i in range(0, 20, 2)][0, 2, 4, 6, 8, 10, 12, 14, 16, 18]>>> [len(word) for word in ['hawk', 'hen', 'hog', 'hyena']
Introduction to Computing Using Python
MapReduce
>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']
Suppose we would like to compute the frequency of every word in a list
the result would be[('one', 2), ('five', 2), ('two', 1), ('three', 3)]
So, for list
We have done this before using a dictionary and the accumulator loop pattern
We will now solve this problem using MapReduce
Introduction to Computing Using Python
MapReduce
'two'
'three'
'one'
'three'
'three'
'one'
'five'
'five'
input list
[('two', 1)]
[('three', 1)]
[('one', 1)]
[('three', 1)]
[('three', 1)]
[('one', 1)]
[('five', 1)]
[('five', 1)]
intermediate1
('two', [1])
('three', [1,1,1])
('one', [1,1])
('five', [1,1])
intermediate2
('two', 1)
('three', 3)
('one', 2)
('five', 2)
output list
Map step Partition step
Reduce step
Introduction to Computing Using Python
MapReduce
'two'
'three'
'one'
'three'
'three'
'one'
'five'
'five'
input list
[('two', 1)]
[('three', 1)]
[('one', 1)]
[('three', 1)]
[('three', 1)]
[('one', 1)]
[('five', 1)]
[('five', 1)]
intermediate1
('two', [1])
('three', [1,1,1])
('one', [1,1])
('five', [1,1])
intermediate2
('two', 1)
('three', 3)
('one', 2)
('five', 2)
output list
>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>>>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>> intermediate1 = [occurrence(word) for word in words]>>>
>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>> intermediate1 = [occurrence(word) for word in words]>>> intermediate2 = partition(intermediate1)>>>
>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>> intermediate1 = [occurrence(word) for word in words]>>> intermediate2 = partition(intermediate1)>>> [occurrenceCount(x) for x in intermediate2][('one', 2), ('five', 2), ('two', 1), ('three', 3)]
def occurrence(word): 'returns list containing tuple (word, 1)' return [(word, 1)]
ch11.py
def occurrenceCount(keyVal): '''takes tuple keyVal = (key, lst) as input and returns (key, sum(lst))''' return (keyVal[0], sum(keyVal[1]))
def partition(intermediate1):
# to do
Introduction to Computing Using Python
MapReduce
[('two', 1)]
[('three', 1)]
[('one', 1)]
[('three', 1)]
[('three', 1)]
[('one', 1)]
[('five', 1)]
[('five', 1)]
intermediate1
('two', [1])
('three', [1,1,1])
('one', [1,1])
('five', [1,1])
intermediate2
ch11.py
def partition(intermediate1): dct = {} # for every list lst of intermediate1 for lst in intermediate1: # for every (key, value) pair in list lst for key, value in lst: if key in dct: dct[key].append(value) else: dct[key] = [value] # return container of (key, values) tuples return dct.items() # return intermediate2
Introduction to Computing Using Python
MapReduce abstracted
ch11.py
def partition(intermediate1): # implementation here
class SeqMapReduce(object): 'a sequential MapReduce implementation'
def __init__(self, mapper, reducer): 'functions mapper and reducer are problem specific' self.mapper = mapper self.reducer = reducer
def process(self, data): 'runs MapReduce on data with mapper and reducer functions' intermediate1 = [self.mapper(x) for x in data] # Map intermediate2 = partition(intermediate1) return [self.reducer(x) for x in intermediate2] # Reduce
The MapReduce framework applies to a range of problems and therefore should be abstracted:
>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>> smr = SeqMapReduce(occurrence, occurrenceCount)>>> smr.process(words)[('one', 2), ('five', 2), ('two', 1), ('three', 3)]
>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>> smr = SeqMapReduce(occurrence, occurrenceCount)>>> smr.process(words)[('one', 2), ('five', 2), ('two', 1), ('three', 3)]>>> numbers = [2,3,4,3,2,3,5,4,3,5,1] >>> smr.process(numbers) [(1, 1), (2, 2), (3, 4), (4, 2), (5, 2)]
A solution to the problem could be represented as a mapping that maps each word to the list of files containing it
This mapping is called an inverted index
Introduction to Computing Using Python
Inverted index problem
Given several text files, we want to know which words appear in which file.
[('Paris', ['a.txt', 'c.txt']),('Miami', ['a.txt']), ('Cairo', ['c.txt']), ('Quito', ['b.txt', 'c.txt']), ('Tokyo', ['a.txt', 'b.txt'])]
Paris: Miami, MiamiTokyo, Miami
a.txt
Tokyo Quito ... Tokyo.Quito
b.txt
Paris, Quito.
Cairo, Paris, Quito.
c.txt
To apply MapReduce, we need to define the mapper and reducer functions
Introduction to Computing Using Python
Inverted index problem
a.txt
b.txt
c.txt
input list
(Tokyo, [a.txt, b.txt])
(Paris, [a.txt, c.txt])
(Miami, [a.txt])
(Quito, [b.txt])
intermediate2
(Cairo, [c.txt])
(...)
(...)
(...)
(...)
output list
(...)
[(Tokyo, a.txt
(Paris, a.txt)
(Miami, a.txt)]
(Tokyo, b.txt)
(Quito, b.txt)
(Paris, c.txt)
(Cairo, c.txt)
intermediate1
Paris: Miami, MiamiTokyo, Miami
a.txt
Tokyo Quito ... Tokyo.Quito
b.txt
Paris, Quito.
Cairo, Paris, Quito.
c.txt
Introduction to Computing Using Python
MapReduce
a.txt
b.txt
c.txt
input list
(Tokyo, [a.txt, b.txt])
(Paris, [a.txt, c.txt])
(Miami, [a.txt])
(Quito, [b.txt])
intermediate2
(Cairo, [c.txt])
(...)
(...)
(...)
(...)
output list
(...)
[(Tokyo, a.txt
(Paris, a.txt)
(Miami, a.txt)]
(Tokyo, b.txt)
(Quito, b.txt)
(Paris, c.txt)
(Cairo, c.txt)
intermediate1
from string import punctuationdef getWordsFromFile(file): 'returns set of items (word, file) for every word in file' infile = open(file) content = infile.read() infile.close()
# remove punctuation transTable = str.maketrans(punctuation, ' '*len(punctuation)) content = content.translate(transTable)
# construct set of items (word, file) with no duplicates res = set() for word in content.split(): res.add((word, file)) return res # return intermediate1
def getWordIndex(keyVal): 'returns input value' return keyVal
MapperReducer
intermediate2 is actually the desired list sothe reducer just copies its items to the output list
Introduction to Computing Using Python
Module multiprocessing
Standard Library module multiprocessing includes tools that make it possible to execute Python programs in parallel on multi-core machines
>>> from multiprocessing import cpu_count >>> cpu_count()8
So 8 cores (your computer may have more or less)
Class Pool from module multiprocessing can be used to split a problem and execute its pieces in parallel (i.e. at the same time) on separate cores
A Pool object represents a pool of one or more processes, each of which is capable of executing code independently on a processor core
How many processor cores does a given computer have? Let’s check:
Note: process != core
Introduction to Computing Using Python
Class Pool in module multiprocessing
> python parallel.py[4, 3, 3, 5]
from multiprocessing import Pool
animals = ['hawk', 'hen', 'hog', 'hyena']
pool = Pool(2) # create pool of 2 processesres = pool.map(len, animals) # apply len() to every animals item
print(res) # print the list of string lengths
Class Pool from module multiprocessing can be used to split a problem and execute its pieces in parallel.
A Pool object represents a pool of one or more processes, each of which is capable of executing code independently on an available processor core
parallel.py
Execute this program from a OS shell (not the Python interpreter shell):
Introduction to Computing Using Python
Class Pool in module multiprocessing
> python parallel.py[4, 3, 3, 5]
from multiprocessing import Pool
animals = ['hawk', 'hen', 'hog', 'hyena']
pool = Pool(2) # create pool of 2 processesres = pool.map(len, animals) # apply len() to every animals item
print(res) # print the list of string lengths
parallel.py
Execute this program from a OS shell (not the Python interpreter shell):
The statement
and the statement
do the same thing (they construct a list by applying len() to every item of list animal)
pool.map(len, animals)
[len(x) for x in animals]
It is how they do it that is different:
executed by 2 processes
executed by 1 process
Introduction to Computing Using Python
Class Pool in module multiprocessing
from multiprocessing import Poolfrom os import getpid
def length(word): 'returns length of string word'
# print the id of the process executing the function print('Process {} handling {}'.format(getpid(), word)) return len(word)
# main programpool = Pool(2)res = pool.map(length, ['hawk', 'hen', 'hog', 'hyena'])print(res)
parallel2.py
Let’s verify that different processes are handling different list items
> python parallel2.pyProcess 5129 handling hawkProcess 5130 handling henProcess 5129 handling hogProcess 5130 handling hyena[4, 3, 3, 5]
every process has a unique id
Introduction to Computing Using Python
Parallel spedup
The benefit of using a pool of independent processes is they can be scheduled by the CPU scheduler to execute in parallel on separate cores
• This should result in faster program running time and parallel speedup
To showcase this, let’s consider a computationally intensive problem from number theory: compare the distribution of prime numbers in several ranges of integers
• Count the number of prime numbers in several equal-size ranges of 100,000 large integers
def countPrimes(start): 'returns the number of primes in range [start, start+rng)'
rng = 100000 formatStr = 'process {} processing range [{}, {})' print(formatStr.format(getpid(), start, start+rng))
# sum up numbers i in range [start, start_rng) that are prime return sum([1 for i in range(start,start+rng) if isprime(i)])
primeDensity.py
Introduction to Computing Using Python
Parallel spedup
def countPrimes(start): # not shown
if __name__ == '__main__': p = Pool(1) # starts is a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345]
t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time
p.close() print('Time taken: {} seconds.'.format(t2-t1))
primeDensity.py
If the Pool contains only 1 process
> python map.py process 4176 processing range [12345678, 12445678] process 4176 processing range [23456789, 23556789] process 4176 processing range [34567890, 34667890] process 4176 processing range [45678901, 45778901] process 4176 processing range [56789012, 56889012] process 4176 processing range [67890123, 67990123] process 4176 processing range [78901234, 79001234] process 4176 processing range [89012345, 89112345] [6185, 5900, 5700, 5697, 5551, 5572, 5462, 5469] Time taken: 47.84 seconds.
def countPrimes(start): # not shown
if __name__ == '__main__': p = Pool(2) # starts in a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345]
t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time
p.close() print('Time taken: {} seconds.'.format(t2-t1))
Introduction to Computing Using Python
Parallel spedupprimeDensity.py
If the Pool contains 2 processes
Time taken: 24.60 seconds.
Speedup = parallel time/sequential time = 47.84/24.6 ≈1.94Using 2 processes on 2 cores instead of 1 process on 1 core descreased the running time from 47.84 to 24.6 seconds`
def countPrimes(start): # not shown
if __name__ == '__main__': p = Pool(4) # starts is a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345]
t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time
p.close() print('Time taken: {} seconds.'.format(t2-t1))
Introduction to Computing Using Python
Parallel spedupprimeDensity.py
If the Pool contains 4 processes
Time taken: 16.78 seconds.
Speedup = 47.84/16.78 ≈2.85
def countPrimes(start): # not shown
if __name__ == '__main__': p = Pool(8) # starts is a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345]
t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time
p.close() print('Time taken: {} seconds.'.format(t2-t1))
Introduction to Computing Using Python
Parallel speedupprimeDensity.py
If the Pool contains 8 processes
Time taken: 14.29 seconds.
Speedup = 47.84/14.29 ≈3.35
from multiprocessing import Poolclass MapReduce(object): 'a parallel implementation of MapReduce'
def __init__(self, mapper, reducer, numProcs = None): 'initializes map and reduce functions and process pool'
self.mapper = mapper self.reducer = reducer self.pool = Pool(numProcs)
def process(self, data): 'runs MapReduce on sequence data'
intermediate1 = self.pool.map(self.mapper, data) # Map intermediate2 = partition(intermediate1) return self.pool.map(self.reducer, intermediate2) # Reduce
Introduction to Computing Using Python
ch12.py
MapReduce in parallel
MapReduce reimplemented using a pool of processes and method map()
Introduction to Computing Using Python
The name cross-checking problem
Tens of thousands of previously classified documents have just been posted on the web. You want to find out which documents mention a particular person, and you want to do that for every person named in one or more documents.
• Assume that people’s names are capitalized, which helps you narrow down the words that can be proper names.
The precise problem is then: given a list of URLs (of the documents), obtain a list of pairs (proper, urlList) in which proper is a capitalized word in any document and urlList is a list of URLs of documents containing proper
In order to use MapReduce, we need to define the map and reduce functions
Introduction to Computing Using Python
The name cross-checking problem
The map function takes a URL as input and returns a list of tuples (word, URL) for every word that is capitalized in the document identified by the URL
from urllib.request import urlopenfrom re import findall
def getProperFromURL(url): '''returns list of items (word, url) for every capitalized word in the document identified by url'''
content = urlopen(url).read().decode() pattern = '[A-Z][A-Za-z\'\-]*' # RE for capitalized words # collect al capitalized words and remove duplicates propers = set(findall(pattern, content))
res = [] for word in propers: # for every capitalized word # create pair (word, url) and append to res res.append((word, url)) return res
crosscheck.py
Introduction to Computing Using Python
The name cross-checking problem
The partition function will, for every capitalized word, collect all tuples (word, url) in every list in intermediate1 to construct list intermediate2 containing pairs (word, [url1, url2, ...])
def getWordIndex(keyVal): 'returns input value' return keyVal
Since intermediate2 contains the desired result (mapping of capitalized wordsto urls), the reducer function just returns its input
crosscheck.py
Introduction to Computing Using Python
The name cross-checking problem
from time import timeif __name__ == '__main__':
urls = [ # URLS of eight Charles Dickens novels 'http://www.gutenberg.org/cache/epub/2701/pg2701.txt', 'http://www.gutenberg.org/cache/epub/1400/pg1400.txt', 'http://www.gutenberg.org/cache/epub/46/pg46.txt', 'http://www.gutenberg.org/cache/epub/730/pg730.txt', 'http://www.gutenberg.org/cache/epub/766/pg766.txt', 'http://www.gutenberg.org/cache/epub/1023/pg1023.txt', 'http://www.gutenberg.org/cache/epub/580/pg580.txt', 'http://www.gutenberg.org/cache/epub/786/pg786.txt']
t1 = time() # sequential start time SeqMapReduce(getProperFromURL, getWordIndex).process(urls) t2 = time() # sequential stop time, parallel start time MapReduce(getProperFromURL, getWordIndex, 4).process(urls) t3 = time() # parallel stop time
print('Sequential: {:5.2f} seconds.'.format(t2-t1)) print('Parallel: {:5.2f} seconds.'.format(t3-t2))
> python properNames.py Sequential: 19.89 seconds. Parallel: 14.81 seconds.
Let’s compare the sequential and parallel implementations of MapReduceby cross-checking the proper names in 8 Charles Dickens’ novels:
crosscheck.py