data mining and open apis
DESCRIPTION
TRANSCRIPT
![Page 1: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/1.jpg)
Data Mining and Open APIs
Toby Segaran
![Page 2: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/2.jpg)
About Me
Software Developer at GenstructWork directly with scientistsDesign algorithms to aid in drug testing
“Programming Collective Intelligence”Published by O’ReillyDue out in August
Consult with open-source projects and other companieshttp://kiwitobes.com
![Page 3: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/3.jpg)
Presentation Goals
Look at some Open APIsGet some dataVisualize algorithms for data-miningWork through some Python codeVariety of techniques and sources
Advocacy (why you should care)
![Page 4: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/4.jpg)
Open data APIs
ZilloweBayFacebookdel.icio.usHotOrNotUpcoming
Yahoo AnswersAmazonTechnoratiTwitterGoogle News
programmableweb.com/apis for more…
![Page 5: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/5.jpg)
Open API uses
MashupsIntegrationAutomationCommand-line toolsMost importantly, creating datasets!
![Page 6: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/6.jpg)
What is data mining?
From a large dataset find the:ImplicitUnknownUseful
Data could be:Tabular, e.g. Price listsFree textPictures
![Page 7: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/7.jpg)
Why it’s important now
More devices produce more dataPeople share more dataThe internet is vastProducts are more customizedAdvertising is targetedHuman cognition is limited
![Page 8: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/8.jpg)
Traditional Applications
Computational BiologyFinancial MarketsRetail MarketsFraud DetectionSurveillanceSupply Chain OptimizationNational Security
![Page 9: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/9.jpg)
Traditional = Inaccessible
Real applications are esotericTutorial examples are trivialGenerally lacking in “interest value”
![Page 10: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/10.jpg)
Fun, Accessible Applications
Home price modelingWhere are the hottest people?Which bloggers are similar?Important attributes on eBayPredicting fashion trendsMovie popularity
![Page 11: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/11.jpg)
Zillow
![Page 12: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/12.jpg)
The Zillow API
Allows querying by addressReturns information about the property
BedroomsBathroomsZip CodePrice EstimateLast Sale Price
Requires registration keyhttp://www.zillow.com/howto/api/PropertyDetailsAPIOverview.htm
![Page 13: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/13.jpg)
The Zillow API
REST Request
http://www.zillow.com/webservice/GetDeepSearchResults.htm?zws-id=key&address=address&citystatezip=citystateszip
![Page 14: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/14.jpg)
The Zillow API<SearchResults:searchresults xmlns:SearchResults="http://www. zillow.com/vstatic/3/static/xsd/SearchResults.xsd">…
<response><results><result><zpid>48749425</zpid><links>…
</links><address> <street>2114 Bigelow Ave N</street><zipcode>98109</zipcode><city>SEATTLE</city><state>WA</state><latitude>47.637934</latitude> <longitude>-122.347936</longitude></address> <yearBuilt>1924</yearBuilt><lotSizeSqFt>4680</lotSizeSqFt><finishedSqFt>3290</finishedSqFt><bathrooms>2.75</bathrooms><bedrooms>4</bedrooms><lastSoldDate>06/18/2002</lastSoldDate><lastSoldPrice currency="USD">770000</lastSoldPrice><valuation><amount currency="USD">1091061</amount></result></results></response>
![Page 15: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/15.jpg)
The Zillow API<SearchResults:searchresults xmlns:SearchResults="http://www. zillow.com/vstatic/3/static/xsd/SearchResults.xsd">…
<response><results><result><zpid>48749425</zpid><links>…
</links><address> <street>2114 Bigelow Ave N</street><zipcode>98109</zipcode><city>SEATTLE</city><state>WA</state><latitude>47.637934</latitude> <longitude>-122.347936</longitude></address> <yearBuilt>1924</yearBuilt><lotSizeSqFt>4680</lotSizeSqFt><finishedSqFt>3290</finishedSqFt><bathrooms>2.75</bathrooms><bedrooms>4</bedrooms><lastSoldDate>06/18/2002</lastSoldDate><lastSoldPrice currency="USD">770000</lastSoldPrice><valuation><amount currency="USD">1091061</amount></result></results></response>
<zipcode>98109</zipcode><city>SEATTLE</city><state>WA</state><latitude>47.637934</latitude><longitude>-122.347936</longitude></address> <yearBuilt>1924</yearBuilt><lotSizeSqFt>4680</lotSizeSqFt><finishedSqFt>3290</finishedSqFt><bathrooms>2.75</bathrooms><bedrooms>4</bedrooms><lastSoldDate>06/18/2002</lastSoldDate><lastSoldPrice currency="USD">770000</lastSoldPrice><valuation><amount currency="USD">1091061</amount>
![Page 16: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/16.jpg)
Zillow from Pythondef getaddressdata(address,city):
escad=address.replace(' ','+')
# Construct the URLurl='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
# Parse resulting XMLdoc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())code=doc.getElementsByTagName('code')[0].firstChild.data
# Code 0 means success, otherwise there was an errorif code!='0': return None
# Extract the info about this propertytry:
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.datause=doc.getElementsByTagName('useCode')[0].firstChild.datayear=doc.getElementsByTagName('yearBuilt')[0].firstChild.databath=doc.getElementsByTagName('bathrooms')[0].firstChild.databed=doc.getElementsByTagName('bedrooms')[0].firstChild.datarooms=doc.getElementsByTagName('totalRooms')[0].firstChild.dataprice=doc.getElementsByTagName('amount')[0].firstChild.data
except:return None
return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
![Page 17: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/17.jpg)
Zillow from Pythondef getaddressdata(address,city):
escad=address.replace(' ','+')
# Construct the URLurl='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
# Parse resulting XMLdoc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())code=doc.getElementsByTagName('code')[0].firstChild.data
# Code 0 means success, otherwise there was an errorif code!='0': return None
# Extract the info about this propertytry:
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.datause=doc.getElementsByTagName('useCode')[0].firstChild.datayear=doc.getElementsByTagName('yearBuilt')[0].firstChild.databath=doc.getElementsByTagName('bathrooms')[0].firstChild.databed=doc.getElementsByTagName('bedrooms')[0].firstChild.datarooms=doc.getElementsByTagName('totalRooms')[0].firstChild.dataprice=doc.getElementsByTagName('amount')[0].firstChild.data
except:return None
return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
# Construct the URLurl='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
![Page 18: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/18.jpg)
Zillow from Pythondef getaddressdata(address,city):
escad=address.replace(' ','+')
# Construct the URLurl='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
# Parse resulting XMLdoc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())code=doc.getElementsByTagName('code')[0].firstChild.data
# Code 0 means success, otherwise there was an errorif code!='0': return None
# Extract the info about this propertytry:
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.datause=doc.getElementsByTagName('useCode')[0].firstChild.datayear=doc.getElementsByTagName('yearBuilt')[0].firstChild.databath=doc.getElementsByTagName('bathrooms')[0].firstChild.databed=doc.getElementsByTagName('bedrooms')[0].firstChild.datarooms=doc.getElementsByTagName('totalRooms')[0].firstChild.dataprice=doc.getElementsByTagName('amount')[0].firstChild.data
except:return None
return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
# Parse resulting XMLdoc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())code=doc.getElementsByTagName('code')[0].firstChild.data
![Page 19: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/19.jpg)
Zillow from Pythondef getaddressdata(address,city):
escad=address.replace(' ','+')
# Construct the URLurl='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
# Parse resulting XMLdoc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())code=doc.getElementsByTagName('code')[0].firstChild.data
# Code 0 means success, otherwise there was an errorif code!='0': return None
# Extract the info about this propertytry:
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.datause=doc.getElementsByTagName('useCode')[0].firstChild.datayear=doc.getElementsByTagName('yearBuilt')[0].firstChild.databath=doc.getElementsByTagName('bathrooms')[0].firstChild.databed=doc.getElementsByTagName('bedrooms')[0].firstChild.datarooms=doc.getElementsByTagName('totalRooms')[0].firstChild.dataprice=doc.getElementsByTagName('amount')[0].firstChild.data
except:return None
return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.datause=doc.getElementsByTagName('useCode')[0].firstChild.datayear=doc.getElementsByTagName('yearBuilt')[0].firstChild.databath=doc.getElementsByTagName('bathrooms')[0].firstChild.databed=doc.getElementsByTagName('bedrooms')[0].firstChild.datarooms=doc.getElementsByTagName('totalRooms')[0].firstChild.dataprice=doc.getElementsByTagName('amount')[0].firstChild.data
![Page 20: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/20.jpg)
A home price dataset
1930
1909
1854
1894
1916
1847
Built
etc..
2107871Single43.502138F
947528Duplex53.502138E
552213Duplex42.502139D
595027Duplex43.502140C
776378Triplex93.502139B
505296Single21.502138A
PriceTypeBedroomsBathroomsZipHouse
![Page 21: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/21.jpg)
What can we learn?
A made-up houses priceHow important is Zip Code?What are the important attributes?
Can we do better than averages?
![Page 22: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/22.jpg)
Introducing Regression Trees
6Circle188Square2222Square1120Circle10ValueBA
![Page 23: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/23.jpg)
Introducing Regression Trees
6Circle188Square2222Square1120Circle10ValueBA
![Page 24: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/24.jpg)
Minimizing deviation
Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most
6Circle18
8Square22
22Square11
20Circle10
ValueBA InitiallyAverage = 14
Standard Deviation = 8.2
![Page 25: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/25.jpg)
Minimizing deviation
Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most
6Circle18
8Square22
22Square11
20Circle10
ValueBA B = CircleAverage = 13
Standard Deviation = 9.9
B = SquareAverage = 15
Standard Deviation = 9.9
![Page 26: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/26.jpg)
Minimizing deviation
Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most
6Circle18
8Square22
22Square11
20Circle10
ValueBA A > 18Average = 8
Standard Deviation = 0
A <= 20Average = 16
Standard Deviation = 8.7
![Page 27: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/27.jpg)
Minimizing deviation
Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most
6Circle18
8Square22
22Square11
20Circle10
ValueBA A > 11Average = 7
Standard Deviation = 1.4
A <= 11Average = 21
Standard Deviation = 1.4
![Page 28: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/28.jpg)
Python Codedef variance(rows):
if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance
def divideset(rows,column,value):# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):
split_function=lambda row:row[column]>=valueelse:
split_function=lambda row:row[column]==value
# Divide the rows into two sets and return themset1=[row for row in rows if split_function(row)]set2=[row for row in rows if not split_function(row)]return (set1,set2)
![Page 29: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/29.jpg)
Python Codedef variance(rows):
if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance
def divideset(rows,column,value):# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):
split_function=lambda row:row[column]>=valueelse:
split_function=lambda row:row[column]==value
# Divide the rows into two sets and return themset1=[row for row in rows if split_function(row)]set2=[row for row in rows if not split_function(row)]return (set1,set2)
def variance(rows):if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance
![Page 30: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/30.jpg)
Python Codedef variance(rows):
if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance
def divideset(rows,column,value):# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):
split_function=lambda row:row[column]>=valueelse:
split_function=lambda row:row[column]==value
# Divide the rows into two sets and return themset1=[row for row in rows if split_function(row)]set2=[row for row in rows if not split_function(row)]return (set1,set2)
# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):
split_function=lambda row:row[column]>=valueelse:
split_function=lambda row:row[column]==value
![Page 31: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/31.jpg)
Python Codedef variance(rows):
if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance
def divideset(rows,column,value):# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):
split_function=lambda row:row[column]>=valueelse:
split_function=lambda row:row[column]==value
# Divide the rows into two sets and return themset1=[row for row in rows if split_function(row)]set2=[row for row in rows if not split_function(row)]return (set1,set2)
# Divide the rows into two sets and return themset1=[row for row in rows if split_function(row)]set2=[row for row in rows if not split_function(row)]return (set1,set2)
![Page 32: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/32.jpg)
CART Algoritm
6Circle188Square2222Square1120Circle10ValueBA
![Page 33: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/33.jpg)
CART Algoritm
6Circle188Square2222Square1120Circle10ValueBA
![Page 34: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/34.jpg)
CART Algoritm
22Square11
20Circle106Circle18
8Square22
![Page 35: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/35.jpg)
CART Algoritm
![Page 36: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/36.jpg)
Python Codedef buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1for col in range(0,column_count):# Generate the list of different values in# this columncolumn_values={}for row in rows:
column_values[row[col]]=1# Now try dividing the rows up for each value# in this columnfor value in column_values.keys():(set1,set2)=divideset(rows,col,value)
# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)
# Create the sub branches if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
else:return decisionnode(results=uniquecounts(rows))
![Page 37: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/37.jpg)
Python Codedef buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1for col in range(0,column_count):# Generate the list of different values in# this columncolumn_values={}for row in rows:
column_values[row[col]]=1# Now try dividing the rows up for each value# in this columnfor value in column_values.keys():(set1,set2)=divideset(rows,col,value)# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)
# Create the sub branches if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
else:return decisionnode(results=uniquecounts(rows))
def buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1
![Page 38: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/38.jpg)
Python Codedef buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1for col in range(0,column_count):# Generate the list of different values in# this columncolumn_values={}for row in rows:
column_values[row[col]]=1# Now try dividing the rows up for each value# in this columnfor value in column_values.keys():(set1,set2)=divideset(rows,col,value)# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)
# Create the sub branches if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
else:return decisionnode(results=uniquecounts(rows))
for value in column_values.keys():(set1,set2)=divideset(rows,col,value)# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)
![Page 39: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/39.jpg)
Python Codedef buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1for col in range(0,column_count):# Generate the list of different values in# this columncolumn_values={}for row in rows:
column_values[row[col]]=1# Now try dividing the rows up for each value# in this columnfor value in column_values.keys():(set1,set2)=divideset(rows,col,value)# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)
# Create the sub branches if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
else:return decisionnode(results=uniquecounts(rows))
if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],
tb=trueBranch,fb=falseBranch)else:
return decisionnode(results=uniquecounts(rows))
![Page 40: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/40.jpg)
Zillow Results
Bathrooms > 3
Zip: 02139? After 1903?
Triplex?Duplex?Bedrooms > 4?Zip: 02140?
![Page 41: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/41.jpg)
Just for Fun… Hot or Not
![Page 42: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/42.jpg)
Just for Fun… Hot or Not
![Page 43: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/43.jpg)
Supervised and Unsupervised
Regression trees are supervised“answers” are in the datasetTree models predict answers
Some methods are unsupervisedThere are no answersMethods just characterize the dataShow interesting patterns
![Page 44: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/44.jpg)
Next challenge - Bloggers
Millions of blogs onlineUsually focus on a subject areaCan they be characterized automatically?… using only the words in the posts?
![Page 45: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/45.jpg)
The Technorati Top 100
![Page 46: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/46.jpg)
A single blog
![Page 47: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/47.jpg)
Getting the content
Use Mark Pilgrim’s Universal Feed ReaderRetrieve the post titles and textSplit up the wordsCount occurrence of each word
![Page 48: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/48.jpg)
Python Codeimport feedparserimport re# Returns title and dictionary of word counts for an RSS feeddef getwordcounts(url):
# Parse the feedd=feedparser.parse(url)wc={}# Loop over all the entriesfor e in d.entries:
if 'summary' in e: summary=e.summaryelse: summary=e.description# Extract a list of wordswords=getwords(e.title+' '+summary)for word in words:
wc.setdefault(word,0)wc[word]+=1
return d.feed.title,wc
def getwords(html):# Remove all the HTML tagstxt=re.compile(r'<[^>]+>').sub('',html)# Split words by all non-alpha characterswords=re.compile(r'[^A-Z^a-z]+').split(txt)# Convert to lowercasereturn [word.lower() for word in words if word!='']
![Page 49: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/49.jpg)
Python Codeimport feedparserimport re# Returns title and dictionary of word counts for an RSS feeddef getwordcounts(url):
# Parse the feedd=feedparser.parse(url)wc={}# Loop over all the entriesfor e in d.entries:
if 'summary' in e: summary=e.summaryelse: summary=e.description# Extract a list of wordswords=getwords(e.title+' '+summary)for word in words:
wc.setdefault(word,0)wc[word]+=1
return d.feed.title,wc
def getwords(html):# Remove all the HTML tagstxt=re.compile(r'<[^>]+>').sub('',html)# Split words by all non-alpha characterswords=re.compile(r'[^A-Z^a-z]+').split(txt)# Convert to lowercasereturn [word.lower() for word in words if word!='']
for e in d.entries:if 'summary' in e: summary=e.summaryelse: summary=e.description# Extract a list of wordswords=getwords(e.title+' '+summary)for word in words:
wc.setdefault(word,0)wc[word]+=1
![Page 50: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/50.jpg)
Python Codeimport feedparserimport re# Returns title and dictionary of word counts for an RSS feeddef getwordcounts(url):
# Parse the feedd=feedparser.parse(url)wc={}# Loop over all the entriesfor e in d.entries:
if 'summary' in e: summary=e.summaryelse: summary=e.description# Extract a list of wordswords=getwords(e.title+' '+summary)for word in words:
wc.setdefault(word,0)wc[word]+=1
return d.feed.title,wc
def getwords(html):# Remove all the HTML tagstxt=re.compile(r'<[^>]+>').sub('',html)# Split words by all non-alpha characterswords=re.compile(r'[^A-Z^a-z]+').split(txt)# Convert to lowercasereturn [word.lower() for word in words if word!='']
def getwords(html):# Remove all the HTML tagstxt=re.compile(r'<[^>]+>').sub('',html)# Split words by all non-alpha characterswords=re.compile(r'[^A-Z^a-z]+').split(txt)# Convert to lowercasereturn [word.lower() for word in words if word!='']
![Page 51: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/51.jpg)
Building a Word Matrix
Build a matrix of word countsBlogs are rows, words are columnsEliminate words that are:
Too commonToo rare
![Page 52: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/52.jpg)
Python Codeapcount={}wordcounts={}for feedurl in file('feedlist.txt'):
title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():
apcount.setdefault(word,0)if count>1:
apcount[word]+=1
wordlist=[]for w,bc in apcount.items():
frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)
out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():
out.write(blog)for word in wordlist:
if word in wc: out.write('\t%d' % wc[word])else: out.write('\t0')
out.write('\n')
![Page 53: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/53.jpg)
Python Codeapcount={}wordcounts={}for feedurl in file('feedlist.txt'):
title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():
apcount.setdefault(word,0)if count>1:
apcount[word]+=1
wordlist=[]for w,bc in apcount.items():
frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)
out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():
out.write(blog)for word in wordlist:
if word in wc: out.write('\t%d' % wc[word])else: out.write('\t0')
out.write('\n')
for feedurl in file('feedlist.txt'):title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():
apcount.setdefault(word,0)if count>1:
apcount[word]+=1
![Page 54: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/54.jpg)
Python Codeapcount={}wordcounts={}for feedurl in file('feedlist.txt'):
title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():
apcount.setdefault(word,0)if count>1:
apcount[word]+=1
wordlist=[]for w,bc in apcount.items():
frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)
out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():
out.write(blog)for word in wordlist:
if word in wc: out.write('\t%d' % wc[word])else: out.write('\t0')
out.write('\n')
wordlist=[]for w,bc in apcount.items():frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)
![Page 55: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/55.jpg)
Python Codeapcount={}wordcounts={}for feedurl in file('feedlist.txt'):
title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():
apcount.setdefault(word,0)if count>1:
apcount[word]+=1
wordlist=[]for w,bc in apcount.items():
frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)
out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():
out.write(blog)for word in wordlist:
if word in wc: out.write('\t%d' % wc[word])else: out.write('\t0')
out.write('\n')
out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():out.write(blog)for word in wordlist:
if word in wc: out.write('\t%d' % wc[word])else: out.write('\t0')
out.write('\n')
![Page 56: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/56.jpg)
The Word Matrix
12220Quick Online Tips
2106GigaOM
0330Gothamist
“yahoo”“music”“kids”“china”
![Page 57: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/57.jpg)
Determining distance
12220Quick Online Tips
2106GigaOM
0330Gothamist
“yahoo”“music”“kids”“china”
Euclidean “as the crow flies”
2222 )122()21()20()06( −+−+−+−
= 12 (approx)
![Page 58: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/58.jpg)
Other Distance Metrics
ManhattanTanamotoPearson CorrelationChebychevSpearman
![Page 59: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/59.jpg)
Hierarchical Clustering
Find the two closest itemCombine them into a single itemRepeat…
![Page 60: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/60.jpg)
Hierarchical Algorithm
![Page 61: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/61.jpg)
Hierarchical Algorithm
![Page 62: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/62.jpg)
Hierarchical Algorithm
![Page 63: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/63.jpg)
Hierarchical Algorithm
![Page 64: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/64.jpg)
Hierarchical Algorithm
![Page 65: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/65.jpg)
Dendrogram
![Page 66: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/66.jpg)
Python Code
class bicluster:def
__init__(self,vec,left=None,right=None,distance=0.0,id=None):self.left=leftself.right=rightself.vec=vecself.id=idself.distance=distance
![Page 67: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/67.jpg)
Python Codedef hcluster(rows,distance=pearson):distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]if d<closest:closest=dlowestpair=(i,j)
# calculate the average of the two clustersmergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],distance=closest,id=currentclustid)
# cluster ids that weren’t in the original set are negativecurrentclustid-=1del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)
return clust[0]
![Page 68: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/68.jpg)
Python Codedef hcluster(rows,distance=pearson):distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]if d<closest:closest=dlowestpair=(i,j)
# calculate the average of the two clustersmergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],distance=closest,id=currentclustid)
# cluster ids that weren’t in the original set are negativecurrentclustid-=1del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)
return clust[0]
distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]
![Page 69: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/69.jpg)
Python Codedef hcluster(rows,distance=pearson):distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]if d<closest:closest=dlowestpair=(i,j)
# calculate the average of the two clustersmergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],distance=closest,id=currentclustid)
# cluster ids that weren’t in the original set are negativecurrentclustid-=1del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)
return clust[0]
while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):
for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances:
distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]if d<closest:
closest=dlowestpair=(i,j)
![Page 70: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/70.jpg)
Python Codedef hcluster(rows,distance=pearson):distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]if d<closest:closest=dlowestpair=(i,j)
# calculate the average of the two clustersmergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],distance=closest,id=currentclustid)
# cluster ids that weren’t in the original set are negativecurrentclustid-=1del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)
return clust[0]
# calculate the average of the two clustersmergevec=[
(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0for i in range(len(clust[0].vec))
]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],distance=closest,id=currentclustid)
del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)
![Page 71: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/71.jpg)
Hierarchical Blog Clusters
![Page 72: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/72.jpg)
Hierarchical Blog Clusters
![Page 73: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/73.jpg)
Hierarchical Blog Clusters
![Page 74: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/74.jpg)
Rotating the Matrix
Words in a blog -> blogs containing each word
1220Yahoo213music203kids060chinaQuick OnlGigaOMGothamist
![Page 75: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/75.jpg)
Hierarchical Word Clusters
![Page 76: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/76.jpg)
K-Means Clustering
Divides data into distinct clustersUser determines how manyAlgorithm
Start with arbitrary centroidsAssign points to centroidsMove the centroidsRepeat
![Page 77: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/77.jpg)
K-Means Algorithm
![Page 78: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/78.jpg)
K-Means Algorithm
![Page 79: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/79.jpg)
K-Means Algorithm
![Page 80: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/80.jpg)
K-Means Algorithm
![Page 81: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/81.jpg)
K-Means Algorithm
![Page 82: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/82.jpg)
Python Codeimport randomdef kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]# Create k randomly placed centroidsclusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]
lastmatches=Nonefor t in range(100):print 'Iteration %d' % tbestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each rowfor j in range(len(rows)):row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches
# Move the centroids to the average of their membersfor i in range(k):avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
![Page 83: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/83.jpg)
Python Codeimport randomdef kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]# Create k randomly placed centroidsclusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]
lastmatches=Nonefor t in range(100):print 'Iteration %d' % tbestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each rowfor j in range(len(rows)):row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches
# Move the centroids to the average of their membersfor i in range(k):avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),
max([row[i] for row in rows])) for i in range(len(rows[0]))]
# Create k randomly placed centroidsclusters=[[random.random()*
(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))]
for j in range(k)]
![Page 84: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/84.jpg)
Python Codeimport randomdef kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]# Create k randomly placed centroidsclusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]
lastmatches=Nonefor t in range(100):print 'Iteration %d' % tbestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each rowfor j in range(len(rows)):row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches
# Move the centroids to the average of their membersfor i in range(k):avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
for t in range(100):bestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each rowfor j in range(len(rows)):
row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)
![Page 85: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/85.jpg)
Python Codeimport randomdef kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]# Create k randomly placed centroidsclusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]
lastmatches=Nonefor t in range(100):print 'Iteration %d' % tbestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each rowfor j in range(len(rows)):row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches
# Move the centroids to the average of their membersfor i in range(k):avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches
![Page 86: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/86.jpg)
Python Codeimport randomdef kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]# Create k randomly placed centroidsclusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]
lastmatches=Nonefor t in range(100):print 'Iteration %d' % tbestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each rowfor j in range(len(rows)):row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches
# Move the centroids to the average of their membersfor i in range(k):avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
# Move the centroids to the average of their membersfor i in range(k):
avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:
for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
![Page 87: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/87.jpg)
K-Means Results
>> [rownames[r] for r in k[0]]['The Viral Garden', 'Copyblogger', 'Creating Passionate Users', 'Oilman', 'ProBlogger Blog Tips', "Seth's Blog"]
>> [rownames[r] for r in k[1]]['Wonkette', 'Gawker', 'Gothamist', 'Huffington Post']
![Page 88: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/88.jpg)
2D Visualizations
Instead of Clusters, a 2D MapGoals
Preserve distances as much as possibleDraw in two dimensions
Dimension ReductionPrincipal Components AnalysisMultidimensional Scaling
![Page 89: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/89.jpg)
Multidimensional Scaling
![Page 90: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/90.jpg)
Multidimensional Scaling
![Page 91: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/91.jpg)
Multidimensional Scaling
![Page 92: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/92.jpg)
def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]outersum=0.0
# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move pointsgrad=[[0.0,0.0] for i in range(n)]
totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)
print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror
# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]
return loc
![Page 93: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/93.jpg)
def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]outersum=0.0
# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move pointsgrad=[[0.0,0.0] for i in range(n)]
totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)
print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror
# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]
return loc
n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]outersum=0.0
![Page 94: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/94.jpg)
def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]outersum=0.0
# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move pointsgrad=[[0.0,0.0] for i in range(n)]
totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)
print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror
# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]
return loc
# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]
![Page 95: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/95.jpg)
def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]outersum=0.0
# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move pointsgrad=[[0.0,0.0] for i in range(n)]
totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)
print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror
# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]
return loc
lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
![Page 96: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/96.jpg)
def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]outersum=0.0
# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move pointsgrad=[[0.0,0.0] for i in range(n)]
totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)
print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror
# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]
return loc
# Move pointsgrad=[[0.0,0.0] for i in range(n)]
totalerror=0for k in range(n):
for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the # other point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)
![Page 97: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/97.jpg)
def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]outersum=0.0
# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move pointsgrad=[[0.0,0.0] for i in range(n)]
totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)
print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror
# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]
return loc
# If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: breaklasterror=totalerror
![Page 98: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/98.jpg)
def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]outersum=0.0
# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move pointsgrad=[[0.0,0.0] for i in range(n)]
totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)
print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror
# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]
return loc
# Move each of the points by the learning rate times the gradientfor k in range(n):
loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]
![Page 99: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/99.jpg)
![Page 100: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/100.jpg)
![Page 101: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/101.jpg)
![Page 102: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/102.jpg)
![Page 103: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/103.jpg)
Numerical Predictions
Back to “supervised” learningWe have a set of numerical attributes
Specs for a laptopAge and rating for wineRatios for a stock
Want to predict another attributeFormula/model is unknowne.g. price
![Page 104: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/104.jpg)
Regression Trees?
Regression trees find hard boundariesCan’t deal with complex formulae
![Page 105: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/105.jpg)
Statistical regression
Requires specification of a modelUsually linearDoesn’t handle context
![Page 106: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/106.jpg)
Alternative - Interpolation
Find “similar” itemsGuess price based on similar itemsNeed to determine:
What is similar?How should we aggregate prices?
![Page 107: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/107.jpg)
Price Data from EBay
![Page 108: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/108.jpg)
The eBay API
XML APISend XML over HTTPSReceive results in XML
http://developer.ebay.com/quickstartguide.
![Page 109: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/109.jpg)
Some Python Code
def sendRequest(apicall,xmlparameters):connection = httplib.HTTPSConnection(serverUrl)connection.request("POST", '/ws/api.dll', xmlparameters, getHeaders(apicall))response = connection.getresponse()if response.status != 200:
print "Error sending request:" + response.reasonelse:
data = response.read()connection.close()
return data
def getHeaders(apicall,siteID="0",compatabilityLevel = "433"):headers = {"X-EBAY-API-COMPATIBILITY-LEVEL": compatabilityLevel,
"X-EBAY-API-DEV-NAME": devKey,"X-EBAY-API-APP-NAME": appKey,"X-EBAY-API-CERT-NAME": certKey,"X-EBAY-API-CALL-NAME": apicall,"X-EBAY-API-SITEID": siteID,"Content-Type": "text/xml"}
return headers
![Page 110: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/110.jpg)
Some Python Codedef getItem(itemID):
xml = "<?xml version='1.0' encoding='utf-8'?>"+\"<GetItemRequest xmlns=\"urn:ebay:apis:eBLBaseComponents\">"+\"<RequesterCredentials><eBayAuthToken>" +\userToken +\"</eBayAuthToken></RequesterCredentials>" + \"<ItemID>" + str(itemID) + "</ItemID>"+\"<DetailLevel>ItemReturnAttributes</DetailLevel>"+\"</GetItemRequest>"
data=sendRequest('GetItem',xml)result={}response=parseString(data)result['title']=getSingleValue(response,'Title')sellingStatusNode = response.getElementsByTagName('SellingStatus')[0];result['price']=getSingleValue(sellingStatusNode,'CurrentPrice')result['bids']=getSingleValue(sellingStatusNode,'BidCount')seller = response.getElementsByTagName('Seller')result['feedback'] = getSingleValue(seller[0],'FeedbackScore')attributeSet=response.getElementsByTagName('Attribute');attributes={}for att in attributeSet:
attID=att.attributes.getNamedItem('attributeID').nodeValueattValue=getSingleValue(att,'ValueLiteral')attributes[attID]=attValue
result['attributes']=attributesreturn result
![Page 111: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/111.jpg)
Building an item table
etc..
17
14
13
14
Screen
$800112016001024Pavillion
$200120900256T22
$8005300160Lenovo
$3501401400512D600
PriceDVDHDDCPURAM
![Page 112: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/112.jpg)
Distance between items
14
14
Screen
$200120900256T22
???1401400512New
PriceDVDHDDCPURAM
Euclidean, just like in clustering
22222 )11()1414()2040()9001400()256512( −+−+−+−+−
![Page 113: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/113.jpg)
Idea 1 – use the closest item
With the item whose price I want to guess:
Calculate the distance for every item in my datasetGuess that the price is the same as the closest
This is called kNN with k=1
![Page 114: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/114.jpg)
Problems with “outliers”
The closest item may be anomalousWhy?
Exceptional deal that won’t occur againSomething missing from the datasetData errors
![Page 115: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/115.jpg)
Using an average
15
14
13
14
Screen
$325012016001024No. 3
$4001601400512No. 2
$3601301400512No. 1
???1401400512New
PriceDVDHDDCPURAM
k=3, estimate = $361
![Page 116: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/116.jpg)
Using a weighted average
$325
$400
$360
???
Price
15
14
13
14
Screen
1012016001024No. 3
21601400512No. 2
31301400512No. 1
1401400512New
WeightDVDHDDCPURAM
Estimate = $367
![Page 117: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/117.jpg)
Python code
def weightedknn(data,vec1,k=5,weightf=gaussian):# Get distancesdlist=getdistances(data,vec1)avg=0.0totalweight=0.0
# Get weighted averagefor i in range(k):
dist=dlist[i][0]idx=dlist[i][1]weight=weightf(dist)avg+=weight*data[idx]['result']totalweight+=weight
avg=avg/totalweightreturn avg
def getdistances(data,vec1):distancelist=[]for i in range(len(data)):
vec2=data[i]['input']distancelist.append((euclidean(vec1,vec2),i))
distancelist.sort()return distancelist
![Page 118: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/118.jpg)
Python codedef getdistances(data,vec1):
distancelist=[]for i in range(len(data)):
vec2=data[i]['input']distancelist.append((euclidean(vec1,vec2),i))
distancelist.sort()return distancelist
def weightedknn(data,vec1,k=5,weightf=gaussian):# Get distancesdlist=getdistances(data,vec1)avg=0.0totalweight=0.0
# Get weighted averagefor i in range(k):
dist=dlist[i][0]idx=dlist[i][1]weight=weightf(dist)avg+=weight*data[idx]['result']totalweight+=weight
avg=avg/totalweightreturn avg
![Page 119: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/119.jpg)
Too few – k too low
![Page 120: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/120.jpg)
Too many – k too high
![Page 121: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/121.jpg)
Determining the best k
Divide the dataset upTraining setTest set
Guess the prices for the test set using the training setSee how good the guesses are for different values of kKnown as “cross-validation”
![Page 122: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/122.jpg)
Determining the best k
06
108
3011
2010
PriceAttribute 2010
PriceAttribute
06
108
3011
PriceAttribute
Test set
Training set
For k = 1, guess = 30, error = 10For k = 2, guess = 20, error = 0For k = 3, guess = 13, error = 7
Repeat with different test sets, average the error
![Page 123: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/123.jpg)
Python codedef dividedata(data,test=0.05):
trainset=[]testset=[]for row in data:
if random()<test:testset.append(row)
else:trainset.append(row)
return trainset,testset
def testalgorithm(algf,trainset,testset):error=0.0for row in testset:
guess=algf(trainset,row['input'])error+=(row['result']-guess)**2
return error/len(testset)
def crossvalidate(algf,data,trials=100,test=0.05):error=0.0for i in range(trials):
trainset,testset=dividedata(data,test)error+=testalgorithm(algf,trainset,testset)
return error/trials
![Page 124: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/124.jpg)
Python code
def testalgorithm(algf,trainset,testset):error=0.0for row in testset:
guess=algf(trainset,row['input'])error+=(row['result']-guess)**2
return error/len(testset)
def crossvalidate(algf,data,trials=100,test=0.05):error=0.0for i in range(trials):
trainset,testset=dividedata(data,test)error+=testalgorithm(algf,trainset,testset)
return error/trials
def dividedata(data,test=0.05):trainset=[]testset=[]for row in data:
if random()<test:testset.append(row)
else:trainset.append(row)
return trainset,testset
![Page 125: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/125.jpg)
Python codedef dividedata(data,test=0.05):
trainset=[]testset=[]for row in data:
if random()<test:testset.append(row)
else:trainset.append(row)
return trainset,testset
def crossvalidate(algf,data,trials=100,test=0.05):error=0.0for i in range(trials):
trainset,testset=dividedata(data,test)error+=testalgorithm(algf,trainset,testset)
return error/trials
def testalgorithm(algf,trainset,testset):error=0.0for row in testset:
guess=algf(trainset,row['input'])error+=(row['result']-guess)**2
return error/len(testset)
![Page 126: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/126.jpg)
Python codedef dividedata(data,test=0.05):
trainset=[]testset=[]for row in data:
if random()<test:testset.append(row)
else:trainset.append(row)
return trainset,testset
def testalgorithm(algf,trainset,testset):error=0.0for row in testset:
guess=algf(trainset,row['input'])error+=(row['result']-guess)**2
return error/len(testset)
def crossvalidate(algf,data,trials=100,test=0.05):error=0.0for i in range(trials):
trainset,testset=dividedata(data,test)error+=testalgorithm(algf,trainset,testset)
return error/trials
![Page 127: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/127.jpg)
Problems with scale
![Page 128: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/128.jpg)
Scaling the data
![Page 129: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/129.jpg)
Scaling to zero
![Page 130: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/130.jpg)
Determining the best scale
Try different weightsUse the “cross-validation” methodDifferent ways of choosing a scale:
Range-scalingIntuitive guessingOptimization
![Page 131: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/131.jpg)
Methods covered
Regression treesHierarchical clusteringk-means clusteringMultidimensional scalingWeight k-nearest neighbors
![Page 132: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/132.jpg)
New projects
OpenadsAn open-source ad serverUsers can share impression/click dataMatrix of what hits based on
Page TextAdAd placementSearch query
Can we improve targeting?
![Page 133: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/133.jpg)
New Projects
FinanceAnalysts already drowning in infoStories sometimes broken on blogsMessage boards show sentiment
Extremely low signal-to-noise ratio
![Page 134: Data Mining and Open APIs](https://reader033.vdocuments.mx/reader033/viewer/2022052522/547bb311b4af9fea158b4f32/html5/thumbnails/134.jpg)
New Projects
EntertainmentHow much buzz is a movie generating?What psychographic profiles like this type of movie?
Of interest to studios and media investors