introduction to computing using python data storage and processing databases and sql python...

Introduction to Computing Using Python

Data Storage and Processing

Databases and SQL Python Database Programming List comprehension and MapReduce Parallel Computing


Data storage

Beijing × 3Paris × 5Chicago × 5

Chicago × 3Beijing × 6

Bogota × 3Beijing × 2Paris × 1

Chicago × 3Paris × 2Nairobi × 1

Nairobi × 7Bogota × 2

one.html four.html

two.html

three.html five.html

The data collected by a web crawler can be stored in a text file


Data storage

URL word counthttp://reed.cs.depaul.edu/lperkovic/one.html Paris 5http://reed.cs.depaul.edu/lperkovic/one.html Beijing 3http://reed.cs.depaul.edu/lperkovic/one.html Chicago 5

URL link http://reed.cs.depaul.edu/lperkovic/one.html http://reed.cs.depaul.edu/lperkovic/two.htmlhttp://reed.cs.depaul.edu/lperkovic/one.html http://reed.cs.depaul.edu/lperkovic/three.html

URL word counthttp://reed.cs.depaul.edu/lperkovic/two.html Bogota 3http://reed.cs.depaul.edu/lperkovic/two.html Paris 1http://reed.cs.depaul.edu/lperkovic/two.html Beijing 2

URL link http://reed.cs.depaul.edu/lperkovic/two.html http://reed.cs.depaul.edu/lperkovic/four.html

URL word counthttp://reed.cs.depaul.edu/lperkovic/four.html Paris 2...


Data storage

A search engine app may then need to access this file to make queries such as

1. In which web pages does word X appear in?2. What is the ranking of web pages containing word X, based on

the number of occurrences of word X in the page?3. How many pages contain word X?4. What pages have a hyperlink to page Y?5. What is the total number of occurrences of word ‘Paris’ across

all web pages?6. How many outgoing links does each visited page have?7. How many incoming links does each visited page have?8. What pages have a link to a page containing word X?9. What page containing word X has the most incoming links?

A text file is not ideal for this ...


Data storage

Beijing × 3Paris × 5Chicago × 5

Chicago × 3Beijing × 6

Bogota × 3Beijing × 2Paris × 1

Chicago × 3Paris × 2Nairobi × 1

Nairobi × 7Bogota × 2

one.html four.html

two.html

three.html five.html

The data collected by a web crawler can be stored in a text file ...


Database files

The data collected by a web crawler can be stored in a text file ...

... or in a database file

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html

three.html four.html

four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1

three.html Chicago 3

three.html Beijing 6

four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2


Database files

A database file consists of one or more tables

Each table has a name and consists of rows and columns Each column has a name and contains data of a specific type

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Each row is a database record


Database files

Database files are not read from or written to directly

Instead, “read/write” commands are sent to a special type of server program called a database engine that manages the database

The database engine accesses the database file on the user’s behalf

The commands accepted by database engines are statements written in the Structured Query Language (SQL)


SQL SELECT FROM statement

Link

two.html

three.html

four.html

four.html

five.html

one.html

two.html

four.html

Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html

SELECT Link FROM Hyperlinks

HyperlinksSQL statement SELECT is used make queries into a database

result table



SQL statement SELECT is used make queries into a database.

SELECT Url, Word FROM Keywords

KeywordsUrl Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Url Word

one.html Beijing

one.html Paris

one.html Chicago

two.html Bogota

two.html Beijing

two.html Paris

three.html Chicago

three.html Beijing

four.html Chicago

four.html Paris

four.html Nairobi

five.html Nairobi

five.html Bogota



Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html

SELECT * FROM Hyperlinks

HyperlinksSQL statement SELECT is used make queries into a database

Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html


SQL DISTINCT keyword

Link

two.html

three.html

four.html

five.html

one.html

Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html

SELECT DISTINCT Link FROM Hyperlinks

HyperlinksSQL keyword DISTINCT removes duplicate records in the result table


SQL WHERE clause

SQL clause WHERE is used to select only those records that satisfy a condition

SELECT Url FROM KeywordsWHERE Word = 'Paris'


one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Url

one.html

two.html

four.html

“In which pages does word X appear in?”

Operator Explanation= Equal<> Not equal> Greater than< Less than>= Greater than or equal<= Less than or equalBETWEEN Within an inclusive range


SQL WHERE clause

SQL clause WHERE is used to select only those records that satisfy a condition

SELECT Column(s) FROM TableWHERE Column operator valueSELECT Column(s) FROM TableWHERE Column BETWEEN value1 AND value2

Url Freqone.html 5

two.html 2

four.html 1


SQL keyword DESC

SQL keyword DESC is used to order the records in the result table in descending orderSELECT Url, Freq FROM KeywordsWHERE Word = 'Paris'ORDER by Freq DESC


one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

“What is the ranking of web pages containing word X, based on the number of occurrences of string X in the page?”


Exercise

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Write an SQL query that returns:1. The URL of every page that has a link to web

page four.html

SELECT DISTINCT Url FROM HyperlinksWHERE Link = 'four.html'


Exercise

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Write an SQL query that returns:2. The URL of every page that has an incoming link

from page four.html

SELECT DISTINCT Link FROM Hyperlinks WHERE Url = 'four.html'


Exercise

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Write an SQL query that returns:3. The URL and word for every word that appears

exactly three times in the web page associated with the URL

SELECT Url, Word from KeywordsWHERE Freq = 3


Exercise

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Write an SQL query that returns:4. The URL, word, and frequency for every word

that appears between 3 and 5 times, inclusive, in the web page associated with the URL

SELECT * from Keywords WHERE Freq BETWEEN 3 AND 5


SQL built-in functions

SQL includes built-in math functions such as COUNT() and SUM()

SELECT COUNT(*) FROM Keywords WHERE Word = 'Paris'


one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

3

“How many pages contain the word Paris?”


SQL built-in functions

SQL includes built-in math functions such as COUNT() and SUM()

SELECT SUM(Freq) FROM Keywords WHERE Word = 'Paris'


one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

8

Urlone.html 2

two.html 1

three.html 1

four.html 1

five.html 3


SQL GROUP BY clause

SQL clause GROUP BY groups the records of a table that have the same value in a column

SELECT Url, COUNT(*) FROM HyperlinksGROUP BY Url


one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

“How many outgoing links does each web page have?”


Exercise

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Write an SQL query that returns:1. The number of words, including duplicates, that

page two.html contains

SELECT SUM(Freq) From Keywords WHERE Url = 'two.html'


Exercise

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Write an SQL query that returns:2. The number of distinct words page two.html

contains

SELECT Count(*) From KeywordsWHERE Url = 'two.html'


Exercise

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Write an SQL query that returns:3. The number of words, including duplicates, that

each web page has

SELECT Url, SUM(Freq) FROM Keywords GROUP BY Url


Exercise

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Write an SQL query that returns:4. The number of incoming links each web page

has

SELECT Link, COUNT(*) FROM Hyperlinks GROUP BY Link

“What web pages have a link to a page containing word ‘Bogota’?”


SQL queries involving multiple tables

Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

This question requires a lookup of both tables:• Look up Keywords to find the set S of URLs of

pages containing word ‘Bogota’• Then look up Keywords to find the URLs of

pages with links to pages in S



Hyperlinks

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

The SELECT statement can be used on multiple tables.

SELECT * FROM Hyperlinks, Keywords




Url Link Url Word Freq

one.html two.html one.html Beijing 3

one.html two.html one.html Paris 5

one.html two.html one.html Chicago 5

one.html two.html two.html Bogota 3

... ... ... ... ...

five.html

four.html four.html Nairobi 5

five.html

four.html five.html Nairobi 7

five.html

four.html five.html Bogota 2

SELECT * FROM Hyperlinks, Keywords

104 records, each a combination of a record in Hyperlinks and a record in Keywords

The result table is the cross join of tables Hyperlink and Keywords

• It has five named columns corresponding to the two columns of table Hyperlinks and three columns of table Keywords.

(Hyperlink) (Keywords)

result table



Hyperlink

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2


SELECT * FROM Hyperlinks, Keywords WHERE Hyperlinks.Url = Keywords.Url






one.html two.html two.html Beijing 2

one.html two.html two.html Paris 1

one.html three.html three.html Chicago 3

... ... ... ... ...

five.html four.html four.html Paris 2

five.html four.html four.html Nairobi 5

SELECT * FROM Hyperlinks, Keywords WHERE Hyperlinks.Url = Keywords.Url




Hyperlink

Keywords

Url Link

one.html two.html

one.html three.html

two.html four.html


four.html five.html

five.html one.html

five.html two.html

five.html four.html

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

SELECT * FROM Hyperlinks, Keywords WHERE Keywords.Word = 'Bogota' AND Hyperlinks.Link = Keywords.Url






four.html five.html five.html Bogota 2

five.html two.html two.html Bogota 3


SELECT * FROM Hyperlinks, Keywords WHERE Keywords.Word = 'Bogota' AND Hyperlinks.Link = Keywords.Url




Url

one.html

four.html

five.html

SELECT Hyperlinks.Url FROM Hyperlinks, Keywords WHERE Keywords.Word = 'Bogota' AND Hyperlinks.Link = Keywords.Url



SQL CREATE TABLE statement

SQL statement CREATE TABLE is used to create a table in a database fileCREATE TABLE Keywords( Url text, Word text, Freq int)



SQL CREATE TABLE statement

SQL statement CREATE TABLE is used to create a table in a database fileCREATE TABLE TableName( Column1 dataType1, Column2 dataType2, ...)

TableNameColumn1 Column2 ...

SQL Type Python Type Explanation

INTEGER int Holds integer values

REAL float Holds floating-point values

TEXT str Holds string values, delimited with quotes

BLOB bytes Holds sequence of bytes


SQL INSERT statement

SQL statement INSERT is used to add a record to a table

INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)

KeywordsUrl Word FreqUrl Word Freq

one.html Beijing 3


SQL UPDATE statement

SQL statement UPDATE is used to modify a record in a table

UPDATE Keywords SET Freq = 4WHERE Url = 'two.html' AND Word = 'Bogota'


one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 3

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2

Url Word Freq

one.html Beijing 3

one.html Paris 5

one.html Chicago 5

two.html Bogota 4

two.html Beijing 2

two.html Paris 1



four.html Chicago 3

four.html Paris 2

four.html Nairobi 5

five.html Nairobi 7

five.html Bogota 2


Standard Library module sqlite3

The Python Standard Library includes module sqlite3 that provides an API for accessing database files

• It is an interface to a library of functions that accesses the database files directly

>>> import sqlite3>>> con = sqlite3.connect('web.db')

sqlite3 function connect() takes as input the name of a database and returns an object of type Connection, a type defined in module sqlite3

• The Connection object con is associated with database file web.db• If database file web.db does not exists in the current working directory,

a new database file web.db is created



The Python Standard Library includes module sqlite3 that provides an API for accessing database files


>>> import sqlite3>>> con = sqlite3.connect('web.db')>>> cur = con.cursor()

Connection method cursor() returns an object of type Cursor, another type defined in the module sqlite3

• Cursor objects are responsible for executing SQL statements



The Python Standard Library includes module sqlite3 provides an API for accessing database files


>>> import sqlite3>>> con = sqlite3.connect('web.db')>>> cur = con.cursor()>>> cur.execute("CREATE TABLE Keywords (Url text, Word text, Freq int)")<sqlite3.Cursor object at 0x100575730>

The Cursor class supports method execute() which takes an SQL statement as a string, and executes it

>>> import sqlite3>>> con = sqlite3.connect('web.db')>>> cur = con.cursor()>>> cur.execute("CREATE TABLE Keywords (Url text, Word text, Freq int)")<sqlite3.Cursor object at 0x100575730>>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>

Hardcoded values


Parameter substitution

In general, the values used in an SQL statement will not be hardcoded in the program but come from Python variables

>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>>>> url, word, freq = 'one.html', 'Paris', 5>>>



Parameter substitution is the technique used to construct SQL statements that make use of Python variable values

• similar to string formatting

>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>>>> url, word, freq = 'one.html', 'Paris', 5>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq))<sqlite3.Cursor object at 0x100575730>

tuple



>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>>>> url, word, freq = 'one.html', 'Paris', 5>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq))<sqlite3.Cursor object at 0x100575730>>>> record = ('one.html','Chicago', 5)>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", record)<sqlite3.Cursor object at 0x100575730>

Parameter substitution is the technique used to construct SQL statements that make use of Python variable values

• similar to string formatting



Changes to a database file are not written to the database file immediately; they are only recorded temporarily, in memory

In order to ensure that the changes are written to the database file,the commit() method must be called on the Connection object

>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>>>> url, word, freq = 'one.html', 'Paris', 5>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq))<sqlite3.Cursor object at 0x100575730>>>> record = ('one.html','Chicago', 5)>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", record)<sqlite3.Cursor object at 0x100575730>>>> con.commit()>>>

A database file should be closed just like any other file

>>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)")<sqlite3.Cursor object at 0x100575730>>>> url, word, freq = 'one.html', 'Paris', 5>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq))<sqlite3.Cursor object at 0x100575730>>>> record = ('one.html','Chicago', 5)>>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", record)<sqlite3.Cursor object at 0x100575730>>>> con.commit()>>> con.close()


Querying a database

>>> import sqlite3>>> con = sqlite3.connect('links.db')>>> cur = con.cursor()>>> cur.execute('SELECT * FROM Keywords')<sqlite3.Cursor object at 0x102686960>>>> cur.fetchall()[('one.html', 'Beijing', 3), ('one.html', 'Paris', 5), ('one.html', 'Chicago', 5), ('two.html', 'Bogota', 5), ('two.html', 'Beijing', 2), ('two.html', 'Paris', 1), ('three.html', 'Chicago', 3), ('three.html', 'Beijing', 6), ('four.html', 'Chicago', 3), ('four.html', 'Paris', 2), ('four.html', 'Nairobi', 5), ('five.html', 'Nairobi', 7), ('five.html', 'Bogota', 2)]>>>

The result of a query is stored in the Cursor object

To obtain the result as a list of tuple objects, Cursor method fetchall() is used


Querying a database

>>> cur.execute('SELECT * FROM Keywords')<sqlite3.Cursor object at 0x102686960>>>> for record in cur:

print(record)

('one.html', 'Beijing', 3)('one.html', 'Paris', 5)('one.html', 'Chicago', 5)('two.html', 'Bogota', 5)('two.html', 'Beijing', 2)('two.html', 'Paris', 1)('three.html', 'Chicago', 3)('three.html', 'Beijing', 6)('four.html', 'Chicago', 3)('four.html', 'Paris', 2)('four.html', 'Nairobi', 5)('five.html', 'Nairobi', 7)('five.html', 'Bogota', 2)>>>

An alternative is to iterate over the Cursor object


Querying a database

>>> word = 'Paris'>>> cur.execute('SELECT Url FROM Keywords WHERE Word = ?', (word,))<sqlite3.Cursor object at 0x102686960>>>> cur.fetchall()[('one.html',), ('two.html',), ('four.html',)]>>> word, n = 'Beijing', 2>>> cur.execute("SELECT * FROM Keywords WHERE Word = ? AND Freq > ?", (word, n))<sqlite3.Cursor object at 0x102686960>>>> cur.fetchall()[('one.html', 'Beijing', 3), ('three.html', 'Beijing', 6)]>>>

Parameter substitution is again used whenever Python variable values are needed in the SQL statement


List comprehension

>>> lines['First Line\n', 'Second\n', '\n', 'and Fourth.\n']>>>

Suppose we want to construct a list from an “old” list by modifying each “old” list item in the same way

['First Line\n', 'Second\n', '\n', 'and Fourth.\n']

['First Line', 'Second', '', 'and Fourth.']

>>> lines['First Line\n', 'Second\n', '\n', 'and Fourth.\n']>>> newlines = []>>> for i in range(len(lines)):

newlines.append(lines[i][:-1])

>>> newlines['First Line', 'Second', '', 'and Fourth.']>>>

>>> lines['First Line\n', 'Second\n', '\n', 'and Fourth.\n']>>> newlines = []>>> for i in range(len(lines)):

newlines.append(lines[i][:-1])

>>> newlines['First Line', 'Second', '', 'and Fourth.']>>> newlines = [line[:-1] for line in lines]>>> newlines['First Line', 'Second', '', 'and Fourth.']

Method 1: accumulator pattern

Method 2: list comprehension

lines

newlines


List comprehension

>>> [line[:-1] for line in lines if line != '\n']['First Line', 'Second', 'and Fourth.']>>

The syntax of the list comprehension statement:

[<expression> for <item> in <sequence/iterator>]

[<expression> for <item> in <sequence/iterator> if <condition>]

More generally:

Examples:

>>> [line[:-1] for line in lines if line != '\n']['First Line', 'Second', 'and Fourth.']>>> [i for i in range(0, 20, 2)][0, 2, 4, 6, 8, 10, 12, 14, 16, 18]>>>

>>> [line[:-1] for line in lines if line != '\n']['First Line', 'Second', 'and Fourth.']>>> [i for i in range(0, 20, 2)][0, 2, 4, 6, 8, 10, 12, 14, 16, 18]>>> [len(word) for word in ['hawk', 'hen', 'hog', 'hyena']


MapReduce

>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']

Suppose we would like to compute the frequency of every word in a list

the result would be[('one', 2), ('five', 2), ('two', 1), ('three', 3)]

So, for list

We have done this before using a dictionary and the accumulator loop pattern

We will now solve this problem using MapReduce


MapReduce

'two'

'three'

'one'

'three'

'three'

'one'

'five'

'five'

input list

[('two', 1)]

[('three', 1)]

[('one', 1)]

[('three', 1)]

[('three', 1)]

[('one', 1)]

[('five', 1)]

[('five', 1)]

intermediate1

('two', [1])

('three', [1,1,1])

('one', [1,1])

('five', [1,1])

intermediate2

('two', 1)

('three', 3)

('one', 2)

('five', 2)

output list

Map step Partition step

Reduce step


MapReduce

'two'

'three'

'one'

'three'

'three'

'one'

'five'

'five'

input list

[('two', 1)]

[('three', 1)]

[('one', 1)]

[('three', 1)]

[('three', 1)]

[('one', 1)]

[('five', 1)]

[('five', 1)]

intermediate1

('two', [1])

('three', [1,1,1])

('one', [1,1])

('five', [1,1])

intermediate2

('two', 1)

('three', 3)

('one', 2)

('five', 2)

output list

>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>>>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>> intermediate1 = [occurrence(word) for word in words]>>>

>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>> intermediate1 = [occurrence(word) for word in words]>>> intermediate2 = partition(intermediate1)>>>

>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>> intermediate1 = [occurrence(word) for word in words]>>> intermediate2 = partition(intermediate1)>>> [occurrenceCount(x) for x in intermediate2][('one', 2), ('five', 2), ('two', 1), ('three', 3)]

def occurrence(word): 'returns list containing tuple (word, 1)' return [(word, 1)]

ch11.py

def occurrenceCount(keyVal): '''takes tuple keyVal = (key, lst) as input and returns (key, sum(lst))''' return (keyVal[0], sum(keyVal[1]))

def partition(intermediate1):

# to do


MapReduce

[('two', 1)]

[('three', 1)]

[('one', 1)]

[('three', 1)]

[('three', 1)]

[('one', 1)]

[('five', 1)]

[('five', 1)]

intermediate1

('two', [1])

('three', [1,1,1])

('one', [1,1])

('five', [1,1])

intermediate2

ch11.py

def partition(intermediate1): dct = {} # for every list lst of intermediate1 for lst in intermediate1: # for every (key, value) pair in list lst for key, value in lst: if key in dct: dct[key].append(value) else: dct[key] = [value] # return container of (key, values) tuples return dct.items() # return intermediate2


MapReduce abstracted

ch11.py

def partition(intermediate1): # implementation here

class SeqMapReduce(object): 'a sequential MapReduce implementation'

def __init__(self, mapper, reducer): 'functions mapper and reducer are problem specific' self.mapper = mapper self.reducer = reducer

def process(self, data): 'runs MapReduce on data with mapper and reducer functions' intermediate1 = [self.mapper(x) for x in data] # Map intermediate2 = partition(intermediate1) return [self.reducer(x) for x in intermediate2] # Reduce

The MapReduce framework applies to a range of problems and therefore should be abstracted:

>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>> smr = SeqMapReduce(occurrence, occurrenceCount)>>> smr.process(words)[('one', 2), ('five', 2), ('two', 1), ('three', 3)]

>>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five']>>> smr = SeqMapReduce(occurrence, occurrenceCount)>>> smr.process(words)[('one', 2), ('five', 2), ('two', 1), ('three', 3)]>>> numbers = [2,3,4,3,2,3,5,4,3,5,1] >>> smr.process(numbers) [(1, 1), (2, 2), (3, 4), (4, 2), (5, 2)]

A solution to the problem could be represented as a mapping that maps each word to the list of files containing it

This mapping is called an inverted index


Inverted index problem

Given several text files, we want to know which words appear in which file.

[('Paris', ['a.txt', 'c.txt']),('Miami', ['a.txt']), ('Cairo', ['c.txt']), ('Quito', ['b.txt', 'c.txt']), ('Tokyo', ['a.txt', 'b.txt'])]

Paris: Miami, MiamiTokyo, Miami

a.txt

Tokyo Quito ... Tokyo.Quito

b.txt

Paris, Quito.

Cairo, Paris, Quito.

c.txt

To apply MapReduce, we need to define the mapper and reducer functions


Inverted index problem

a.txt

b.txt

c.txt

input list

(Tokyo, [a.txt, b.txt])

(Paris, [a.txt, c.txt])

(Miami, [a.txt])

(Quito, [b.txt])

intermediate2

(Cairo, [c.txt])

(...)

(...)

(...)

(...)

output list

(...)

[(Tokyo, a.txt

(Paris, a.txt)

(Miami, a.txt)]

(Tokyo, b.txt)

(Quito, b.txt)

(Paris, c.txt)

(Cairo, c.txt)

intermediate1

Paris: Miami, MiamiTokyo, Miami

a.txt

Tokyo Quito ... Tokyo.Quito

b.txt

Paris, Quito.

Cairo, Paris, Quito.

c.txt


MapReduce

a.txt

b.txt

c.txt

input list

(Tokyo, [a.txt, b.txt])

(Paris, [a.txt, c.txt])

(Miami, [a.txt])

(Quito, [b.txt])

intermediate2

(Cairo, [c.txt])

(...)

(...)

(...)

(...)

output list

(...)

[(Tokyo, a.txt

(Paris, a.txt)

(Miami, a.txt)]

(Tokyo, b.txt)

(Quito, b.txt)

(Paris, c.txt)

(Cairo, c.txt)

intermediate1

from string import punctuationdef getWordsFromFile(file): 'returns set of items (word, file) for every word in file' infile = open(file) content = infile.read() infile.close()

# remove punctuation transTable = str.maketrans(punctuation, ' '*len(punctuation)) content = content.translate(transTable)

# construct set of items (word, file) with no duplicates res = set() for word in content.split(): res.add((word, file)) return res # return intermediate1

def getWordIndex(keyVal): 'returns input value' return keyVal

MapperReducer

intermediate2 is actually the desired list sothe reducer just copies its items to the output list


Module multiprocessing

Standard Library module multiprocessing includes tools that make it possible to execute Python programs in parallel on multi-core machines

>>> from multiprocessing import cpu_count >>> cpu_count()8

So 8 cores (your computer may have more or less)

Class Pool from module multiprocessing can be used to split a problem and execute its pieces in parallel (i.e. at the same time) on separate cores

A Pool object represents a pool of one or more processes, each of which is capable of executing code independently on a processor core

How many processor cores does a given computer have? Let’s check:

Note: process != core


Class Pool in module multiprocessing

> python parallel.py[4, 3, 3, 5]

from multiprocessing import Pool

animals = ['hawk', 'hen', 'hog', 'hyena']

pool = Pool(2) # create pool of 2 processesres = pool.map(len, animals) # apply len() to every animals item

print(res) # print the list of string lengths

Class Pool from module multiprocessing can be used to split a problem and execute its pieces in parallel.

A Pool object represents a pool of one or more processes, each of which is capable of executing code independently on an available processor core

parallel.py

Execute this program from a OS shell (not the Python interpreter shell):



> python parallel.py[4, 3, 3, 5]

from multiprocessing import Pool

animals = ['hawk', 'hen', 'hog', 'hyena']

pool = Pool(2) # create pool of 2 processesres = pool.map(len, animals) # apply len() to every animals item

print(res) # print the list of string lengths

parallel.py

Execute this program from a OS shell (not the Python interpreter shell):

The statement

and the statement

do the same thing (they construct a list by applying len() to every item of list animal)

pool.map(len, animals)

[len(x) for x in animals]

It is how they do it that is different:

executed by 2 processes

executed by 1 process



from multiprocessing import Poolfrom os import getpid

def length(word): 'returns length of string word'

# print the id of the process executing the function print('Process {} handling {}'.format(getpid(), word)) return len(word)

# main programpool = Pool(2)res = pool.map(length, ['hawk', 'hen', 'hog', 'hyena'])print(res)

parallel2.py

Let’s verify that different processes are handling different list items

> python parallel2.pyProcess 5129 handling hawkProcess 5130 handling henProcess 5129 handling hogProcess 5130 handling hyena[4, 3, 3, 5]

every process has a unique id


Parallel spedup

The benefit of using a pool of independent processes is they can be scheduled by the CPU scheduler to execute in parallel on separate cores

• This should result in faster program running time and parallel speedup

To showcase this, let’s consider a computationally intensive problem from number theory: compare the distribution of prime numbers in several ranges of integers

• Count the number of prime numbers in several equal-size ranges of 100,000 large integers

def countPrimes(start): 'returns the number of primes in range [start, start+rng)'

rng = 100000 formatStr = 'process {} processing range [{}, {})' print(formatStr.format(getpid(), start, start+rng))

# sum up numbers i in range [start, start_rng) that are prime return sum([1 for i in range(start,start+rng) if isprime(i)])

primeDensity.py


Parallel spedup

def countPrimes(start): # not shown

if __name__ == '__main__': p = Pool(1) # starts is a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345]

t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time

p.close() print('Time taken: {} seconds.'.format(t2-t1))

primeDensity.py

If the Pool contains only 1 process

> python map.py process 4176 processing range [12345678, 12445678] process 4176 processing range [23456789, 23556789] process 4176 processing range [34567890, 34667890] process 4176 processing range [45678901, 45778901] process 4176 processing range [56789012, 56889012] process 4176 processing range [67890123, 67990123] process 4176 processing range [78901234, 79001234] process 4176 processing range [89012345, 89112345] [6185, 5900, 5700, 5697, 5551, 5572, 5462, 5469] Time taken: 47.84 seconds.


if __name__ == '__main__': p = Pool(2) # starts in a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345]




Parallel spedupprimeDensity.py

If the Pool contains 2 processes

Time taken: 24.60 seconds.

Speedup = parallel time/sequential time = 47.84/24.6 ≈1.94Using 2 processes on 2 cores instead of 1 process on 1 core descreased the running time from 47.84 to 24.6 seconds`






Parallel spedupprimeDensity.py



Speedup = 47.84/16.78 ≈2.85






Parallel speedupprimeDensity.py



Speedup = 47.84/14.29 ≈3.35

from multiprocessing import Poolclass MapReduce(object): 'a parallel implementation of MapReduce'

def __init__(self, mapper, reducer, numProcs = None): 'initializes map and reduce functions and process pool'

self.mapper = mapper self.reducer = reducer self.pool = Pool(numProcs)

def process(self, data): 'runs MapReduce on sequence data'

intermediate1 = self.pool.map(self.mapper, data) # Map intermediate2 = partition(intermediate1) return self.pool.map(self.reducer, intermediate2) # Reduce


ch12.py

MapReduce in parallel

MapReduce reimplemented using a pool of processes and method map()


The name cross-checking problem

Tens of thousands of previously classified documents have just been posted on the web. You want to find out which documents mention a particular person, and you want to do that for every person named in one or more documents.

• Assume that people’s names are capitalized, which helps you narrow down the words that can be proper names.

The precise problem is then: given a list of URLs (of the documents), obtain a list of pairs (proper, urlList) in which proper is a capitalized word in any document and urlList is a list of URLs of documents containing proper

In order to use MapReduce, we need to define the map and reduce functions



The map function takes a URL as input and returns a list of tuples (word, URL) for every word that is capitalized in the document identified by the URL

from urllib.request import urlopenfrom re import findall

def getProperFromURL(url): '''returns list of items (word, url) for every capitalized word in the document identified by url'''

content = urlopen(url).read().decode() pattern = '[A-Z][A-Za-z\'\-]*' # RE for capitalized words # collect al capitalized words and remove duplicates propers = set(findall(pattern, content))

res = [] for word in propers: # for every capitalized word # create pair (word, url) and append to res res.append((word, url)) return res

crosscheck.py



The partition function will, for every capitalized word, collect all tuples (word, url) in every list in intermediate1 to construct list intermediate2 containing pairs (word, [url1, url2, ...])

def getWordIndex(keyVal): 'returns input value' return keyVal

Since intermediate2 contains the desired result (mapping of capitalized wordsto urls), the reducer function just returns its input

crosscheck.py



from time import timeif __name__ == '__main__':

urls = [ # URLS of eight Charles Dickens novels 'http://www.gutenberg.org/cache/epub/2701/pg2701.txt', 'http://www.gutenberg.org/cache/epub/1400/pg1400.txt', 'http://www.gutenberg.org/cache/epub/46/pg46.txt', 'http://www.gutenberg.org/cache/epub/730/pg730.txt', 'http://www.gutenberg.org/cache/epub/766/pg766.txt', 'http://www.gutenberg.org/cache/epub/1023/pg1023.txt', 'http://www.gutenberg.org/cache/epub/580/pg580.txt', 'http://www.gutenberg.org/cache/epub/786/pg786.txt']

t1 = time() # sequential start time SeqMapReduce(getProperFromURL, getWordIndex).process(urls) t2 = time() # sequential stop time, parallel start time MapReduce(getProperFromURL, getWordIndex, 4).process(urls) t3 = time() # parallel stop time

print('Sequential: {:5.2f} seconds.'.format(t2-t1)) print('Parallel: {:5.2f} seconds.'.format(t3-t2))

> python properNames.py Sequential: 19.89 seconds. Parallel: 14.81 seconds.

Let’s compare the sequential and parallel implementations of MapReduceby cross-checking the proper names in 8 Charles Dickens’ novels:

crosscheck.py

introduction to computing using python data storage and processing databases and sql python...

Documents